Dawson Margin Notes On Green Chapter 5

"Producing And Perceiving Speech"

by Peter Howell

Relating The Reading To The Lectures

I decided to include this chapter with lectures on "connectionist case studies" because speech perception is one area in which classical techniques have failed, and in which there is growing interest in connectionism. In the chapter you will find two different sections that briefly mention relevant PDP models. In order to strengthen the relationship between the reading and the lecture, you might consider different ways in which you might decide to build a connectionist network to perceive speech. What would the input be like? What kind of network would you use? How would training proceed?

Margin Notes On The Chapter

What are the cognitive mechanisms for the production and understanding of speech?

Basic Background On Production And Perception Of Speech

"The notion of a phoneme is basic to the understanding of speech perception and production." Phonemes are the minimal sounds required to distinguish one word from another. Sospeech perception and production involves the processing of strings of phonemes. Key issue in understanding phonemes is understanding what articulators (vocal organs) are involved in producing them.

Three technical terms are relevant here. 1) The manner of production involves how close together two articulators are, and can be used (for example) to distinguish plosives from vowels. 2) Place of articulation: where in vocal tract do articulators come closest together? 3) Voicing: when air flows, do vocal cords vibrate (voiced) or not (voiceless)?

Vowels are all voiced; different vowels are distinguished by a) where tongue and roof of mouth come closest together, and b) how close the tongue is to the roof of the mouth.

How are articulatory configurations translated into sound? Lets consider voiced plosives as an example. "Recall that during voiced sounds, the air issuing from the lungs is modulated by the opening and closing of the vocal cords." Therefore there is a periodic buildup (and sudden decay) of energy (amplitude) in the system. "Besides representing the energy entering the vocal tract in terms of energy and time as in figure 5.2, the energy in this source of excitation can also be represented in terms of its frequency content." Frequency is calculated from the period of the periodic waveform. Fourier's theorem tells us that complex periodic wave forms can be described as the sum of simple sine wave components. Amplitude spectrum: graph of the amplitude of each sinusoid component of a complex waveform.

The vocal tract filters these component frequencies. "The transmission of certain frequencies through a vocal tract with a given shape is better than that of others." The filtering characteristics of different articulatory configurations can be measured. "The regions in which good transmission occurs are called the formants (they are the resonant frequencies of the cavities of the vocal tract.) The formants are numbered from the lowest frequency up, F1 refers to the first (lowest) frequency formant, F2 to the second lowest frequency, and so on."

When the amplitude spectrum of a sound source is multiplied by the filtering properties of the articulatory configuration, this gives the amplitude properties of vocal output. The formant frequencies of the vocal tract change over time. "The spectrogram is a plot of the freuqencies in the signal (vertical axis). The amplitude at a particular frequency and time is represented on a grey-scale (the darker the point, the higher the amplitude." As the vocal tract changes in shape, the formants move over time. So, spectrograms reveal information about speech movements being made. Also, "the information in the spectrographic representation has been considered to be a clue about what information the speaker has available for making perceptual decisions about speech."

Production Accounts

Basic production issue is variability. "We cannot find a direct relationship between that sound when it is spoken in different verbal contexts, an the resulting action at the muscular, articulatory, or acoustic levels." I.e., when the same speech sound is uttered in different contexts, there is not a single acoustic, muscular, or articulatory event involved. The same speech sound can be produced by a variety of events!

The locus theory has attempted to explain this variation. "The main assumptions in the theory are that fixed commands are issued whenever a particular phoneme is produced, and that variation arises because the speaker has to move his or her articulators from the position specified for one phoneme to that of the next, which will vary with what the adjacent contextual phonemes are." Locus theory has been successful, but does not account for all of the data. It doesn't account for some coarticulatory phenomena, and it doesn't explain the variation in speech spoken at different rates. In this latter case, "this restructuring of the commands implies that speakers do not use fixed muscle commands for a particular phoneme as required by locus theory."

An alternative approach is the explain constraints on when articulators can be positioned. "Henke argued for a look-ahead mechanism.' The tenet behind the theory is that a speaker will start to position an articulator appropriately as early as possible."

Recurrent PDP networks have also been used to study coarticulation, because such networks can handle temporal stimuli. "The input to the network, for predicting coarticulation, is a plan of the phonemes to be produced. The output will be some feature of articulation...Since the networks are of the recurrent type, some of the output is fed back to the input." The network is trained with a version of backprop. In this network, the effect of later inputs is mitigated by the network's current output, because of the recurrent connections between output and input connections.

Now let's turn to theories of speech perception. Key problem is "the perception of a particular phoneme in the face of different cues in different contexts." There aren't any invariant formant transitions.

One solution is the view that other acoustic signals are used. "One possibility is a brief burst of noise which occurs as the plosive is released." Unfortunately, such cues are also not invariant.

"The evidence indicates that a direct mapping between single acoustic cues and phoneme percept is untenable: both burst frequencies and formant transitions exhibit considerable contextual variability so no acoustic preoperty can be identified which indicates what phoneme is being spoken." This leads to a very different approach: motor theory.

Assumption: phonemes are the consequence of fixed articulatory intent. So, "early versions of motor theory proposed that perception took place by interpreting the acoustic cudes in terms of the muscular commands which, in turn, indicated what phoneme ahd been said whatever its context." Synthetic speech continua have been used to study this theory. For example, hold F1 constant, but artificially manipulate the leading edge of F2. Continuous manipulations of F2 leads to discrete (and abrupt) classifications of speech sounds by subjects  there is a "phoneme boundary" and categorical perception. Categorical perception is predicted by motor theory, because of the view that it was supposed to be speech specific. But "views changed in the middle 1970s when it was reported that categorical percpetion occurs with certain nonspeech continua more complex than those used in the classic psychophysical studies." Even chinchillas show categorical perception!

Why, then, does categorical perception occur? Perhaps "there are invariant auditory properites thaqt mediate phoneme perception." The auditory system does not respond equally to all points along a physical dimentison. "Changes at some points on physical continua are barely noticeable, while other equally large differences at different points on the same continuum are clearly distinct. Stevens (1981) has argued that speech has evolved to take advantage of this perceptual discontinuities to distinguish one phoneme from another." This has led to "natural-sensitivities theory", but evidence for this theory is not 100%.

PDP approach, again with recurrent networks, has also been applied to speech perception. "In one of their studies, Elman and Zipser applied spectra of digitized real speech waveforms as inputs to their networks. The outputs that they required were the phoneme labels." Input was variable, but the network was able to learn the task. (NB: For some reason, this is really downplayed in the book!)


Pearl Street | "An Invitation To Cognitive Science" Home Page | Dawson Home Page |