Speech Perception
The world around us is presented to us through physical mechanisms. Our understanding of the world depends on how we perceive signals from people and things through these mechanisms. One of these mechanisms is hearing, the faculty which enables us to perceive and interpret sound.
In order to understand listening, it is important to understand how the hearing mechanism works and what hearing contributes to language understanding. Hearing is the basis of language perception, and perception is the basis for listening. When we understand what aural perception does and does not provide us as we listen, we can understand how hearing is complemented by thinking and interpretation processes.
What are the characteristics of sound?
Sound surrounds us. It is caused by objects in contact with each other. Sound is produced by the movements of air particles created by the objects in contact. The movement of air particles emanates from the point of contact, and continues from the contact point in the form of fluctuations of air. Some of these sound waves are very simple in their structure (such as the sound of a note on a piano), but most sound waves are quite complex in their formation.
One basic fact about sound, one that is essential in articulatory phonetics (the study of how speech sounds are produced), is that sound reveals the structure of the objects that produce it. For example, if you knock on a door you can tell if it is hollow or solid, or if you jingle the coins in your pocket, you can tell if you have thick coins or thin ones. Perception of the structure of speech sounds is much more subtle, but follows the same principle. We hear differences in speech sounds because of the structure of the speech organs that produced them.
Speech sounds are produced by continuous fluctuations of air particles, originally produced in the lungs, coming into contact with the various parts of our vocal apparatus (most notably, the larynx, velar ridge, tongue, teeth, and lips). Each point of contact or modulation changes the shape of the sound waves that we perceive. Sound consists of variations in pressure either air pressure or fluid pressure as a function of time. Sound can be measured in decibels.
A second fact about sound, one that is central to the process of hearing speech, is that sound is evanescent, that is, it has a fleeting existence. Sound has an essential temporal and temporary nature. Once the sound waves are no longer emanated, they quickly fade away. We typically have only a second or two to perceive sound and make sense of it.
How do we hear sounds?
The human auditory system consists of the outer ear, the middle ear, the inner ear, and the auditory nerves connecting to the brain stem. The auditory system consists of several interdependent sub-systems.
Figure 2.1 The human auditory system.
/insert Figure 2.1 here/
[Caption:]
The human auditory system is a series of stages for converting sound to neural stimuli.
Hearing occurs when (1) sound vibrations reach the eardrum, (2) causing the ossicles to vibrate and the stapes to move. (3) The vibrations pass through the oval window to the fluid filled canals of the cochlea, (4) and are transmitted to the cochlear duct where they set off nerve impulses which are sent along the cochlear nerve to the brain.
The outer ear consists of the pinna (this is the part of the ear we can see) and the auditory canal. The pinna modifies the incoming sound, in particular the higher frequencies, and allows us the ability to locate the source of the sound.
Sound waves travel down the canal and cause the eardrum to vibrate. The vibrations are passed along through the middle ear, which is a remarkable transformer consisting of three small bones (the ossicles) surrounding a small opening in the skull (the oval window). The major function of the middle ear is to ensure efficient transfer of sounds (which are in the form of air particles) to the fluids inside the cochlea.
In addition to this transmission function, the middle ear also has a protective function. The ossicles have tiny muscles which can contract (this is called the reflex action) to reduce the level of sound that will reach to the inner ear. This reflex action occurs when we are presented with loud sounds such as the roar of an airplane engine. This protects the delicate hearing mechanism from damage. Interestingly, the reflex action also occurs when we begin to speak. In this way the reflex protects us from too much feedback it prevents us from hearing too much of our own speech and thus becoming distracted by it.
The cochlea is the most important part of the ear in terms of auditory perception. The cochlea is a small bony structure, about the size of your thumbnail, which is narrow at one end at wide at the other. It is filled with fluid. The membranes inside in the cochlea respond mechanically to movements of the fluid (this is called sinusoidal stimulation). Lower frequency sounds stimulate primarily the narrower end of the membrane and higher frequencies stimulate only the broader end. Each different sound, however, produces varying patterns of movement in the fluid and the membrane.
At the side of the cochlea nearest the brain stem are thousands of tiny hair cells, with ends both inside and outside the cochlea. The outer hair cells are connected to the auditory nerve fibers which lead to the auditory cortex of the brain. These hair cells respond to minute movements of the fluid in the membrane, and transduce the mechanical movements of the fluid into nerve (neural) activity.
As with other neural networks in the human body, these nerves have evolved to a high degree of specialization. Different auditory nerve fibers have different characteristic frequencies (CF) that they respond to. Fibers with high CFs are found in the periphery of the nerve bundle and there is an orderly decrease in CF toward the center of the nerve bundle (this is called tonotopic organization).
The distribution of the neural activity (as a function of CF) is called the excitation pattern, and this excitation pattern is the fundamental 'result' or 'output' of the human hearing mechanism. For instance, if you hear the word, "bye", there is a specific excitation pattern produced in response.
Not everyone hears the same thing, however, even though the excitation pattern for a particular tone will be similar in all of us. The difference in our perception is due to the fact that the individual neurones that make up the nerve fibers are interactive they are affected by the action of other neurones. For example, before the excitation pattern caused by my saying, "bye", you may have just heard the word "good": the closely juxtaposed excitation from the two signals may be partially overlapping and confused. Sometimes, the activity of one neurone is suppressed or amplified by the presence of a second tone. In addition, since these nerves are physical structures, they are affected by our general health and level of arousal or fatigue. Another fact that interferes with accurate hearing is that these nerves sometimes seem to fire randomly, even when no hearing stimulus is present. This is due to the fact that the auditory nerve is intertwined with the vestibular nerve, which helps us keep our balance.
How do we hear speech?
When sounds reach our inner ear and excite the auditory nerve, they are passed to the auditory cortex of the brain. Here we quickly, and practically automatically, classify them as speech or non-speech. If they are speech sounds, we begin phonological decoding.
The first step in decoding happens without conscious thought. This is the step of discriminating between sounds or putting the sounds into categories. This is called categorical perception. As we acquire our first language as young children, largely by listening to people around us speak it, we acquire prototypes, or typical examples for each of the sounds of our language. Gradually, over the course of the first few years of our lives, we begin to hear all speech sounds as falling into one of the fifty or so categories that our language has. The categories are phonemes, which are the smallest unit of sound meaning in a language.
These phonemes can be further classified as consonants and vowels. Although it is not necessary to know how these sounds are made in order to learn to hear them, it is most common to classify them by how and where they are produced:
The 22 consonants used in most varieties of English are fairly easy to classify, as they have an identifiable point of constriction and articulation:
labial (articulated at the lips): p, b, m
labio-dental (articulated at the lips and teeth): f, v
dental (articulated
at the teeth): th, th
alveolar (articulated at the ridge behind the top teeth): t, d, s, z, n, l, r
palatal (articulated at the ridge along the roof of the mouth): y, sh, z, ch, j
velar (articulated at the ridge of the mouth near the throat): k, g, ng
Vowels are much less easy to classify, since the speech organs are more open and there is no clear point of contact. (As a result, there are many more dialect variations for vowels than for consonants.) Vowel sounds are often classified as open or closed (depending on how open the lips are when we make the sound), tense or lax (depending on contraction of tongue muscle) and front or back (depending on where the most constriction in the tongue is when we are making the sound). The following classification of the 10 vowels used in most varieties of English utilizes positional and tenseness features:
high-front/tense i (as in sheep)
high-front/lax I (as in ship)
mid-front/tense e (as in shape)
mid-front/lax E (as in kept)
low-front/tense (as in American English can)
mid/lax (as in (about)
high-back/tense u (as in boot)
high-back/lax U (as in book)
mid-back/tense o (as in boat)
low-back/tense (as in bought)
These same sounds can be described from the perspective of the hearer, not as means of articulation by a speaker, but as physical sounds consisting of minute and subtle variations of loudness, frequency, and duration. However, from the hearer's perspective, speech sounds can only be perceived as part of the next larger units in speech, which are syllables.
Even this of course is not a very realistic hearing task, since virtually all of the speech we hear occurs as continuous, contextualized sounds in phrases or clauses. For example, if we take any phrase, such as "see you tomorrow", we can see its representation on a spectrogram (see figure 2-2) as a continuing pattern of sound waves. These sound waves have dimensions of time, frequency, and loudness that would vary from moment to moment.
Of course, we do not specifically hear rising or falling pitch glides or fluctuations in loudness or lengthening of phones we hear speech. While we do use information about pitch, duration, and loudness to determine what sounds we hear, we do this quickly and virtually automatically, with little conscious thought.
Figure 2-2 A spectrogram
/insert Figure 2-2 here/
[Caption] A spectrogram for the word "read". A spectrogram is a photograph of speech, showing the pattern of sound waves. From a spectrogram we can detect dimensions of (a) duration (length) of each speech sounds, (b) frequency, and (c) loudness. Note that the sound /r/ has a low frequency, that the vowel sound /i/ is much longer than the other sounds, and that the /d/ sound is relatively louder than the other sounds.
How do we recognize phonemes ?
Phonemes are considered the smallest units of speech that can be reliably produced and identified by speakers and hearers of a language.
However, in connected speech, individual phonemes usually cannot be isolated. If, for example, you were to record the word "sprint" on audio tape, you would find it virtually impossible (even with precision equipment) to cut the tape into phonemic segments of /s/ + /p/ + /r/ + /I/ + /n/ + /t/. As Figure 2-3 suggests, phonemic features overlap and are transmitted in parallel.
[figure 2-3]
/insert Figure 2-3 here/
[caption] Speech sounds overlap and influence each other, so we do not hear specific phonemes in isolation.
If you were to view a spectrogram of the word /sprint/, you would also find it quite difficult to identify where the sound formants for /s/ end and the sound formants for /p/ begin or where the vowel /I/ begins and ends. Sounds within the same utterance are colored by effects of co-articulation with other sounds; this is particularly so for sounds immediately next to each other (Liberman, 1970). When we listen to speech, we cannot anticipate hearing clear pronunciations of words since all phonemes change their features depending on the words or phrases they are part of. These changes are called allophonic variations and are the result of connected speech variations.
Another class of allophonic variations are accents the form of speech used in differing speech communities. (Southern American, Western Scottish, Southern English, etc., within which there are many additional variations). When we listen to speakers of different accents than our own, we must make adjustments. With practice, we can usually learn to hear speaker differences in accent as allophonic variations.
TASK 2-1: Connected speech patterns.
Many variations can be described in terms of assimilation, reduction, and elision:
assimilation nasalization, labialization, palatalization, glottalization, voicing, de-voicing, lengthening that results from two sounds being pronounced in sequence
reduction centering of vowels, weakening of consonants that results from a phoneme being in an unstressed syllable
elision omission of individual phonemes that results from simplifying a cluster of sounds for easier pronunciation
Read the expressions below aloud, as you would in normal speech. Can you notice how the sounds indicated by the letters in bold are changed from their "ideal" form?
assimilations
has a nice shape ni(s) shape
thieves stole most of them thie(z)stole
there seems to be a mistake see(m)z-to
was quite difficult quite(t)ifficult
it can carry four people i(c)can
owing to our negligence owin(g)-to
didn't you see her di(d)n(t)-chu
What's this wat-s(th)is
who asked him as(k)-t(h)im
your handbag ham-bag
not that boy tha(t)poy
reductions and elisions
where he lived where (h)e lived
comfortable chair comf(or)table
going to be here go(i)n(gt)o be here
I'll pay for it a(l) pay
given to them given to (th)em
succeed in imagining succeed in (i)magining
terrorist attack terr(or)ist attack
in the environment in the envir(on)ment
Comments on the task
You probably found yourself pronouncing these phrases in various ways until you could detect a sound change. It is important to note that often there is no single answer for any pronunciation item. We all speak somewhat differently. These differences are based on our physiological differences and on our dialect and also on how carefully we are trying to enunciate. Many of us may think that using "fast speech", full of assimilations and reductions, is sub-standard. This is not true. If we record and analyze speech, we discover that virtually all speakers of all varieties of English do use these patterns. These patterns are a result of the sound rules of the language, not of preference of the speakers!
How do stress and intonation influence speech perception?
Stress is an extremely important factor in speech perception. Stressed syllables are generally the best articulated syllable in each word. Therefore, stressed syllables proved islands of sound reliability in the normal blur of speech.
The vowels in stressed syllables, in particular, are usually longer and louder. More importantly, they tend to keep their full vowel value. By contrast, the vowels in unstressed syllables (reduced syllables at fast speaking rates) all tend to move towards a central or neutral vowel sound, like the /schwa/ sound in about.
For example, in the following utterance, the bold syllables will tend to be stressed and the vowels will be clearer. The vowels in the unstressed syllables will tend toward a neutral vowel sound.
The independence movement was something of surprise to their opponents.
What this means is that if unstressed syllables change their form, they become somewhat useless for recognition purposes. We can't count on them to recognize the words they appear in. It is the stressed syllables that provide the best discriminatory content. Moreover, as we will discuss in the upcoming section on new and given information, the most important words in a discourse context will also tend to be stressed, while the function words (articles, conjunctions, auxiliary verbs) tend to be unstressed or reduced.
Like stress, pitch or intonation can influence speech perception. Sometimes, pitch correlates with stress: a speaker often raises pitch on the stressed syllable in word or phrase. Sometimes, pitch can help us identify phrase or clause boundaries, since speakers typically drop in pitch at the end of a phrase or clause. At the same time, pitch has other functions in carrying emotional information that helps us to understand the speaker's meaning more fully.
CONCLUSION: Speech in context
One the most obvious features of human language is that it occurs in sequences. The various sounds in English (or in any language) can be arranged in various sequences to form thousands of different words, and the various words in turn can be arranged in different sequences to form a nearly infinite number of phrases. This sequential principle of language entails problems of efficiency. It takes longer to produce multiple signals than a single signal. Therefore, for any language system to be efficient, the signals have to be brief and follow each other in very quick succession.
In conversational English, the average word (such as "sprint" or "olive" or "speech") has about five phones, or distinct sounds. Since most us typically speak at a rate of about 150 words per minute, this means that we are producing 12.5 sounds per second, and as listeners, we are hearing 12.5 sounds per second. (These computations for other languages show similar results.)
As experiments have shown, however, the human auditory system cannot distinguish more than two or three sounds per second. Therefore, when we listen to language, we must depend on a sampling of sounds from the stream of speech. Based on this sampling and employing other information to predict likely sounds, we can still hear all of the sounds of language as someone speaks to us (Marslen-Wilson and Tyler, 1981).
As with other perception processes in the brain, speech perception involves processing at many different levels and separate information at one level may be used to resolve problems at another level. For speech comprehension, information at the sound level may be re-analyzed on the basis of information learned at other levels lexical, syntactic, semantic, or pragmatic. Even some details of the speech sound wave may be retained in 'echoic memory' so that re-analysis of the sound itself may be made.
The three dimensions of sound are to some extent redundant in the acoustic signal. In each dimension (e.g. length) are cues that are also recoverable from cues in other dimensions (e.g. frequency). Because of the redundancy, the listener needs to rely on samples of features in the stream of speech in order to make sense of a speech signal. Even when there is a lot of background noise or when the speech signal is corrupted, we can usually still make sense of it.
All of this can be done, however, only with continuous speech in context. We cannot perform speech perception well with sounds and syllables and words in isolation. The human auditory system has evolved to allow us to succeed in hearing speech but only when we have a context to guide our interpretation.
PROJECT
Tape a short conversation. Listen to part of the conversation, perhaps two or three exchanges, carefully. Using the phoneme symbols in this chapter, transcribe that part of the conversation. Mark the most stressed (loudest) parts of each utterance. Underline any parts of the language that are reduced or assimilated, that is, which have some kind of allophonic variation.
To make your transcription reflect the way language is spoken, try transcribing it in pause units. Each time you hear a significant pause, start a new line of the transcription. (Don't use grammatical units, such as sentences or clauses to guide you.)