Monday, August 13, 2012

Physics of Sound: The Spectrogram (or what the heck am I looking at part 2)

I probably should give a little lesson on how to read a spectrogram, since my next post will feature spectrograms rather heavily.  I made all these spectrograms on PRAAT, which is free and downloadable if you wish to play with it.  (I know the website looks a little sketch, but I had to get it for my classes using the site I linked to and it's totally safe for your computer.)  PRAAT is a lovely piece of software that will record a sound and then give you both a spectrogram and a waveform of that sound.  As you look at the images below, the waveform is the image on the top with the thick black band and blue vertical lines, and the spectrogram is the grey-scale mess below that waveform.  So, on to the important part of this post!

How to read a spectrogram:

The x-axis (horizontal) is time, the y-axis (vertical) is frequency, and the grey-scale shows amplitude.  So a spectrogram can show three dimensions, time, frequency, and amplitude vs. a waveform that shows only two, time and frequency.  On fancier programs, the amplitude is sometimes shown in color, like having blue be the softest sounds and red being the loudest, but in PRAAT, the darker the band, the higher the amplitude.  In terms of the frequencies, I set the spectrograms to show from 0 Hz to 7000 Hz.  PRAAT can display up to 20,000 Hz, but then the formant bands I want to focus on get too squished together.  If you click and make the image bigger, you can see a dotted red line with a frequency number off to the left.  I set those lines there just to give you some idea of where the upper formant lies in terms of Hz.  And remember from the last post that the formant will be somewhere around this frequency, not right at the single frequency itself.

So this is what a typical spectrogram will look like with the upper frequency set at 7000 Hz.  (I think PRAAT's default setting is usually 5000 Hz.):


The spectrogram above is me sustaining the vowel /a/ with my speaking voice.  You can clearly see five dark bands going horizontally across the image, but the bottom two dark bands are the darkest, indicating that those are the highest amplitude formants.

Sustained-speech of an /a/ vowel with formants marked.
This is the same spectrogram as the one above it, but I've set PRAAT to show me the first five formants, which it does by adding in those red lines.  The software is simply determining where the highest amplitudes are and sticking bands in there.  I'm not controlling where those thick red lines go.

Sustained, spoken /i/ vowel, no formants marked in.
Here's me sustaining an /i/ vowel with my speaking voice.  Note the wide distance between the first and second formants, which is just what the /i/ vowel does.  Oh /i/, you so crazy!

Sustained, spoken /i/ vowel, first five formants marked in red.
Above is the same spectrogram again, but with PRAAT marking the first five formants in red.

Spoken phrase:  "One, two, three, go," no formants marked.
 And there's a spectrogram of me speaking the phrase, "one, two, three, go.'  Here, you can see the movement of the formants as I go through those words and the "white space" between the words.  (Those areas where there's a thick blue vertical band on the waveform is where the /t/ and the "th" sound of "two" and "three" are.  And, you can see the antiformants present in the /n/ sound right at the end of the first word "one."  Pretty cool, huh?)  (Scroll to the bottom of page 2 on that antiformant link to read more about them.) And here's the same phrase with the formants marked in:

"One, two, three, go," with formants marked in red.

Now, some super cool people can actually read spectrograms like they're reading words off the page.  I'm not quite that awesome yet, but if you tell me what the phrase is, I can pick out where each specific word is using my knowledge of vowel formants and consonant frequencies.  It'd be cool to become that person who can just read them, though!

Now the reason I kept setting the spectrogram to 7000 Hz instead of 5000 is two-fold:  First, I wanted to make sure the upper formant wasn't cut off since that formant does occasionally go higher than 5000 Hz, and second, I wanted you to see that there actually is a thick band of amplitude above the 5000 Hz mark, which you can see in the spectrogram above.  So there are more "formants" above that 5000 Hz mark...we just don't really regard frequencies higher than 5000 when discussing speech or singing very much.  (Although, this article does!)  Heck, PRAAT doesn't even mark in any formants above the 5000 Hz area...usually the fifth formant area.  But, I wanted to make sure you know that it's not like formants and harmonics just disappear above 5000 Hz.  Mathematically speaking, harmonics would just keep on going higher and higher, and so would formants.  However, the amplitude lessens the higher you go, so vocal harmonics and formants do dampen out eventually...just not at 5000 Hz.

Up next:  The singer's formant!  I'mma gonna break apart a common misconception in the hopes that it clarifies what is we're actually doing when we carry over that orchestra.

Physics of Sound Series: Formants, formants, and more formants

According to Raphael et al., the source-filter theory of speech production states that the source of vocal sound, i.e. the vocal folds, is filtered through the air spaces in the vocal tract (p. 330).*  This is a fairly simplistic model of vocal production, but it is very useful just because of its simplicity.  Other models of speech production out there get a lot more detailed, but for a general, conceptual knowledge of the relationship between the vocal folds and vocal tract in terms of acoustic output, I think the the source-filter model can't really be beat.

So what does this have to do with formants?  Well, on the last physics post, I left off by stating that the vocal tract can change it's shape and configuration to filter out different harmonics from the same sound source.  The shape of the vocal tract will also amplify certain harmonic frequencies, while dampening others.  The resulting "peaks" in amplitude at specific frequency ranges are what we call formants.  One important thing to note here is that formants are not the same thing as harmonics.  You can think of formants as being a certain specific collection of harmonics, so the first formant is not the same as the first harmonic.  The idea of a harmonic is that it is one particular sine wave that is related, mathematically, to the fundamental, but the formants are collections of these sine waves.  The language you typically see is that the first formant is around a specific frequency.  So while you might read about the singer's formant being somewhere around 3000 Hz, the formant isn't actually only at 3000 Hz, it's just a collection of frequencies centered somewhere around 3000 Hz.  I think the semantics might get a little fuzzy there for a lot of people, but what seems like a little, unimportant detail actually makes a big difference when discussing harmonics vs. formants.  If you use those terms interchangeably, you'll just confuse the folks who know they're different things and then you'll get confused that they're confused and yadda yadda yadda...

Think of it like this:  Let's say you have a collection of all the Star Trek episodes from every Star Trek series, even the crappy ones.  If you consider the first series, the original Star Trek, as the fundamental, the first "harmonic" would then be Star Trek:  The Next Generation, the second would be Deep Space Nine, the third Voyager, etc.  However, it's possible that if these "harmonics" get filtered into formants, the first formant could consist of the first five seasons of The Next Generation, with the last two seasons filtered down to really low amplitude.  The second formant could be the last four seasons of Deep Space Nine, with the first three seasons of DS9 being filtered down.  The third formant could be the last five seasons of Voyager with the first two seasons filtered down, etc.  See the difference?  So harmonics are the building blocks of formants, but harmonics come from the resonance of the vocal folds themselves and formants come from the resonance of the acoustic filter or vocal tract.

What's great about formants is that they happen to be the way we distinguish vowels during speech.  In fact, the relationship between vocal tract shape and the acoustic output (vocal sound once it exits the mouth) is so interrelated, we are able to classify vowels by both the vocal tract shape and the acoustic output, depending on what we're talking about.  I.e.:  Talking about articulation?  You'll be talking about the shape of the vocal tract made by the articulators (tongue, soft palate, etc.).

If you happened to click over to that Wikipedia article on vowels, you probably noticed there's a section on articulation and a separate section on acoustics.  The position of the tongue in the mouth happens to make the biggest difference to the overall shape of the vocal tract, and so, a lot of vowels can be categorized by place of tongue articulation during production.  For example:  An /i/ ("ee") vowel is categorized as a high, front vowel because the tongue is positioned very high near the roof of the mouth, but it is also positioned quite forward in the mouth and is, therefore, a high-front vowel.  A high-back vowel, such as /u/, has the tongue positioned as a "hump" near the back of the mouth, so it's high, but in the back.  A low vowel, such as /a/, doesn't involve the tongue in a raised position at all, and is closer to a neutral vowel position, of which the schwa sound is considered the most neutral.  (I know a lot of singers consider /a/ as the most neutral vowel, but linguists and speech scientists have researched tongue positions, and schwa is indeed the most neutral.  I think the reason singers like the focus on /a/ so much more is that we don't tend to sing schwa very often, and if we do, we don't sustain a sound on schwa.  So schwa gets kinda a bad-rap in the singing world, but it is an important little vowel in spoken language.)  

Because a larger space will resonant at lower frequencies, and a smaller one at higher frequencies, the formants are a result of the size of the pharyngeal space and/or oral space as determined by the tongue position, primarily.  A good example of this is if you tap on a glass with some water in it, then tap again after drinking the water, the second tap will be a lower pitch than the first tap because there is more air inside the glass after the water is gone to resonant the sound.  Or a better example:  A cello is bigger than a violin.  So...there you go.  Therefore, in a simplified sense, these tongue positions all correspond to the formant frequencies of each vowel.  The /i/ vowel is known for having a low first formant (more pharyngeal space created by the high tongue position) and a high second formant (small oral space created by tongue position,) and in fact, this vowel has the widest space between the first and second formant as it's trademark sound.  The /u/ vowel has a low first formant (from the high tongue position creating more pharyngeal space), but also has a low second formant (from the tongue position being near the back of the mouth, creating more space in the oral cavity).  Once again, this is a very simplified way of looking at this, but it's an easy way to understand the basic idea.  Just be aware that the science of acoustics can get pretty darn complicated in this area.


*Raphel, L. J., Borden, G. J., Harris, K. S. (2007).  Speech science primer:  Physiology, acoustics, perception of speech (5th ed.).  Philadelphia, PA:  Lippincott Williams & Williams.