Monday, August 13, 2012

Physics of Sound Series: Formants, formants, and more formants

According to Raphael et al., the source-filter theory of speech production states that the source of vocal sound, i.e. the vocal folds, is filtered through the air spaces in the vocal tract (p. 330).*  This is a fairly simplistic model of vocal production, but it is very useful just because of its simplicity.  Other models of speech production out there get a lot more detailed, but for a general, conceptual knowledge of the relationship between the vocal folds and vocal tract in terms of acoustic output, I think the the source-filter model can't really be beat.

So what does this have to do with formants?  Well, on the last physics post, I left off by stating that the vocal tract can change it's shape and configuration to filter out different harmonics from the same sound source.  The shape of the vocal tract will also amplify certain harmonic frequencies, while dampening others.  The resulting "peaks" in amplitude at specific frequency ranges are what we call formants.  One important thing to note here is that formants are not the same thing as harmonics.  You can think of formants as being a certain specific collection of harmonics, so the first formant is not the same as the first harmonic.  The idea of a harmonic is that it is one particular sine wave that is related, mathematically, to the fundamental, but the formants are collections of these sine waves.  The language you typically see is that the first formant is around a specific frequency.  So while you might read about the singer's formant being somewhere around 3000 Hz, the formant isn't actually only at 3000 Hz, it's just a collection of frequencies centered somewhere around 3000 Hz.  I think the semantics might get a little fuzzy there for a lot of people, but what seems like a little, unimportant detail actually makes a big difference when discussing harmonics vs. formants.  If you use those terms interchangeably, you'll just confuse the folks who know they're different things and then you'll get confused that they're confused and yadda yadda yadda...

Think of it like this:  Let's say you have a collection of all the Star Trek episodes from every Star Trek series, even the crappy ones.  If you consider the first series, the original Star Trek, as the fundamental, the first "harmonic" would then be Star Trek:  The Next Generation, the second would be Deep Space Nine, the third Voyager, etc.  However, it's possible that if these "harmonics" get filtered into formants, the first formant could consist of the first five seasons of The Next Generation, with the last two seasons filtered down to really low amplitude.  The second formant could be the last four seasons of Deep Space Nine, with the first three seasons of DS9 being filtered down.  The third formant could be the last five seasons of Voyager with the first two seasons filtered down, etc.  See the difference?  So harmonics are the building blocks of formants, but harmonics come from the resonance of the vocal folds themselves and formants come from the resonance of the acoustic filter or vocal tract.

What's great about formants is that they happen to be the way we distinguish vowels during speech.  In fact, the relationship between vocal tract shape and the acoustic output (vocal sound once it exits the mouth) is so interrelated, we are able to classify vowels by both the vocal tract shape and the acoustic output, depending on what we're talking about.  I.e.:  Talking about articulation?  You'll be talking about the shape of the vocal tract made by the articulators (tongue, soft palate, etc.).

If you happened to click over to that Wikipedia article on vowels, you probably noticed there's a section on articulation and a separate section on acoustics.  The position of the tongue in the mouth happens to make the biggest difference to the overall shape of the vocal tract, and so, a lot of vowels can be categorized by place of tongue articulation during production.  For example:  An /i/ ("ee") vowel is categorized as a high, front vowel because the tongue is positioned very high near the roof of the mouth, but it is also positioned quite forward in the mouth and is, therefore, a high-front vowel.  A high-back vowel, such as /u/, has the tongue positioned as a "hump" near the back of the mouth, so it's high, but in the back.  A low vowel, such as /a/, doesn't involve the tongue in a raised position at all, and is closer to a neutral vowel position, of which the schwa sound is considered the most neutral.  (I know a lot of singers consider /a/ as the most neutral vowel, but linguists and speech scientists have researched tongue positions, and schwa is indeed the most neutral.  I think the reason singers like the focus on /a/ so much more is that we don't tend to sing schwa very often, and if we do, we don't sustain a sound on schwa.  So schwa gets kinda a bad-rap in the singing world, but it is an important little vowel in spoken language.)  

Because a larger space will resonant at lower frequencies, and a smaller one at higher frequencies, the formants are a result of the size of the pharyngeal space and/or oral space as determined by the tongue position, primarily.  A good example of this is if you tap on a glass with some water in it, then tap again after drinking the water, the second tap will be a lower pitch than the first tap because there is more air inside the glass after the water is gone to resonant the sound.  Or a better example:  A cello is bigger than a violin.  So...there you go.  Therefore, in a simplified sense, these tongue positions all correspond to the formant frequencies of each vowel.  The /i/ vowel is known for having a low first formant (more pharyngeal space created by the high tongue position) and a high second formant (small oral space created by tongue position,) and in fact, this vowel has the widest space between the first and second formant as it's trademark sound.  The /u/ vowel has a low first formant (from the high tongue position creating more pharyngeal space), but also has a low second formant (from the tongue position being near the back of the mouth, creating more space in the oral cavity).  Once again, this is a very simplified way of looking at this, but it's an easy way to understand the basic idea.  Just be aware that the science of acoustics can get pretty darn complicated in this area.


*Raphel, L. J., Borden, G. J., Harris, K. S. (2007).  Speech science primer:  Physiology, acoustics, perception of speech (5th ed.).  Philadelphia, PA:  Lippincott Williams & Williams.

No comments: