What can we learn about how the brain processes sound from the patterns in the world?
Caveat: I am only an amateur audio engineer so there may/will be much wrongness below.
Frequency analysis, for example via the Fourier Transform, is often the first port of call when analysing audio recordings. This analysis considers different frequencies of a common oscillating signal (such as a sine or cosine wave) that are present in a particular signal. “Frequency” here can be thought of as the number of times the wave (up and down) repeats in a given time period: high frequency means rapid repetition; low frequency means slow repetition. A common visualisation is a spectogram, where magnitudes of a set of frequencies present in a small time window are shown as the time window moves forward through time.
We know a fair amount about how the brain processes different frequencies. The ear has cochlea which detect different frequencies and the cortex contains tonographically mapped regions (e.g. each frequency within a particular resolution is first processed by a dedicated portion of the A1 cortex). However, we still know comparatively little about the underlying temporal structure that the brain is detecting in the world.
We do know that changes in the frequency patterns in time are important for how we perceive the auditory world. The distinctive sound of many instruments (e.g. a trumpet vs a saxophone) is characterised by different patterns of “attack” – how the pitch changes over time. We also know that there are patterns within the range of audible frequencies – notes separated by an octave are deemed similar. Here an octave represents a base-2 (doubling) pattern within the range of frequencies.
One interesting question to start with is that of timing patterns. Are there patterns in how we detect frequencies over time?
To answer this we can look at our own auditory reflections. The modern world is full of speech and music. These sources can provide clues as to how we perceive and create sounds. We need to imagine that we are an alien observing the strange world of human beings; although it all seems natural to us, most of it is dependent on the environment of earth and our unique evolutionary path.
Most music has an underlying beat. This is also referred to as “tempo”. Tempo is often measured in beats per minute (bpm). Within most music, the tempo tends to range from 40 bpm (slow) to 170 bpm (jungle). Common tempos are 60 to 120 bpm – 1 to 2 Hz.
These ranges appear familiar: they are the common heart beat timings for human beings. A resting heat rate is around 50-60 bpm and an active “cardio workout” heart rate is around 150-180 bpm. Beat thus appears to be perceived in relation to interoceptive perception of our own heart rhythms.
The beat is also typically driven by low frequency sources (drums, bass guitar etc.). It would thus appear that our interoceptive processing is somehow coupled to our low frequency audio processing. This seems logical – at a particular very low frequency (~40Hz and below) we cease to hear and start to feel.
Music is often said to have a time signature. Written music is split into bars or measures (UK vs US). A common time signature making up a large proportion of music is 4/4. This means there are four quarter notes per bar. The quarter notes are indications of pitch and duration.
As the location of the quarter notes indicate the pitch of a note, the staff (i.e. the horizontal lines in a sheet of music) can be considered a form of vertical frequency scale. The horizontal axis in sheet music represents time. Hence, sheet music provides a rough form of spectogram.
The popularity of the 4/4 time signature suggests that there is something our brain “likes” about this arrangement. For a song at 60 bpm this represents 4 second segments but the time signature is also common across different tempos. This suggests it is not the timing per se (4/4 at 60 bpm is 4s per bar, at 120 bpm is 2s per bar) but the ratio that is important – our brain seems to like grouping things in 2s or 4s. This may be likened to how we tend to group letters together to make words.
The drum pattern at the bottom of the article here shows how notes may be played above an underlying beat. The upper limit for my perception was around 32 notes of a ride pattern for every beat at 60 bpm, 64 or 128 notes per beat sounded like a drone rather than distinct sounds. This suggests an upper note frequency within music of around 30 to 40 Hz.
Speech in English is typically found to contain between 100 and 200 words per minute, with around 100 to 150 being a comfortable rate and above 200 being fast but comprehensible. This gives us between 1 and 4 words per second. This is a similar order of magnitude to the number of notes per bar.
Words are made up of syllables. A typical speaking rate for English is 4 syllables per second, with a range of between 3 and 6 syllables a second being common. This matches the words per minute rate above, if we take it that most words have 1 to 4 syllables.
If we switch to audio patterns, we can represent speech using phones. A phone is a particular distinct speech sound, whereas a phoneme is a phone that is used by a particular language to shape the comprehension of a word – swapping phonemes would change a word. Phones are language independent but phonemes are language dependent. Hence, certain languages share phones but these relate to different phonemes. Normal speech has about 10 to 15 phonemes per second, with an upper range being 40-50 phonemes per second. This also makes sense, if a typical syllable has between 3 to 5 phonemes.
Now, an upper range of 40-50 phonemes per second represents a sound unit frequency of around 40 to 50 Hz. This is similar to the upper range of our note frequency set out above.
Chords are common in Western music. They consist of groups (normally three) notes played simultaneously or in an overlapping manner. Chords consist of a root note, a third and a fifth. This roughly translates as: a first note, a second note three or four semitones above the first note, and a third note six to eight semitones above the first note. A semitone represents a set frequency spacing – piano keys are a semitone apart (the distance between two white or two black keys being a whole tone). So a chord is a pattern of frequencies that occur together in time.
When we move between chords in time we get a chord progression. Chord progressions are the basis for harmonies. Most pop songs have between 2 to 6 chord progressions, with 4 being the most common. Interestingly, the patterns of progression can be independent of the underlying chords; you can change the key of a song and a chord progression can remain the same, but with different notes being played. This suggests that the brain is storing relative changes in frequency groupings in time in a consistent and possibly common way. For example, within our brains there is a representation of a key-independent progression pattern. This is a pattern in time.
A chord progression of multiple chords per measure or bar is seen as energetic or hectic; a chord progression over multiple measures or bars is seen as slow and drawn out. In pop songs chords are often maintained for one measure or bar, e.g. 4s in 4/4 time. Hence, it appears that our brains have some way of detecting groups of frequencies (chords) and detecting changes in these groups (chord progressions) over time periods of a few seconds.
Poems and songs tend to have a pretty standard length, about a page or two of iambic pentameter verse or a three verse song interspersed with a chorus. Despite there being lots of different lengths, this seems to be the peak of the bell curve. Good pop songs also average around 3 minutes 30 seconds, which is about the time it takes to speak that length of text.
In conversation, on average, each turn for a speaking party lasts for around 2 seconds, and the typical gap between them is just 200 milliseconds. This pattern is found across cultures. This is fairly short – it appears that the sentence is result of this conversational pattern of speech. It is similar to a line of a verse of poetry or song. So a song or a poem is similar to a conversation (explicitly so in call-and-response patterns).
The paragraph and stanza are a sub-unit of both prose and poetry, each of a similar average length (around 4 to 8 sentences or lines). This is on a similar order of magnitude to the number of measures or bars in a song, with sentences being roughly mapped onto bars. A common pattern is 4 bars being mapped to one “phrase”. Different multiples of 2 and 4 are common.
Useful Information for Intelligent Systems
So what useful information have we learned for intelligent systems that process audio information, especially human-generated audio?
First, we know that a base sampling rate for frequency patterns is somewhere around 30 to 60 Hz – patterns faster than this become incomprehensible. Note this is not the sampling rate for the raw audio data that is used to generate frequency information. This is typically much higher (e.g. 44.1 kHz and above) to capture the full frequency range of human hearing (~20Hz to 20kHz). Rather it is the rate at which audio features arise in frequency data, e.g. the sampling rate for the spectrogram. Interestingly, this also tends to be a base sampling rate for visual information in video data. This makes sense – the McGurk effect shows that audio and visual features extracted from our environment are often synchronised, and used to predict each other.
Second, we know that several patterns exist over a time scale of a few seconds: measures or bars of music, turns in conversation, chord progressions. There is no fixed time scale but much change occurs in a time window of between 1 and 10 seconds.
Third, we know that the regular patterns that exist over the time scale of seconds often repeat regularly over longer time periods. For example, measures or bars of music repeat to form song portions (verse or chorus) and a motif of a first verse will often be repeated in subsequent verses, together with repetition of a chorus motif. Similarly, we alternate turns in a conversation, and repeat rhythmic line structures within poetry.
Fourth, we know that speech and music borrow from each other and share patterns. This suggests a common underlying processing methodology for audio information.
If we were to generate an intelligent audio processing system, it would appear that we need to extract, detect or generate features or representations at multiple time scales. However, it would also appear that we have similar patterns occurring at each time scale, indicating recursion or reuse of common processing algorithms.
It would also appear that we are learning something useful for the processing of language. The patterns we have discussed here seem to form a starting point for the development of linguistic structures.