From: malcolm@interval.com (Malcolm Slaney)
Date: Mon, 6 Mar 1995 14:20:17 -0800
Subject: How do Humans Process and Recognize Speech?
Message-Id: <v02110100ab8137d550e8@[192.203.7.70]>


I'm very pleased to announce that Jont Allen will be visiting CCRMA this
week.  Jont is a distinguished researcher at AT&T Bell Laboratories and has
been doing work on auditory perception for many years.  Some of you might
know him for his passionate views of cochlear models.

Most recently, he has revived the results of experiments on speech
perception performed many years ago by Harvey Fletcher.  He will be
describing Harvey's results, and drawing new conclusions based on the
wealth of data.

The experiments all consisted of trained listeners recognizing nonsense
words through degraded channels.  What can we tell about the perceptual
system when you filter out a portion of the waveform?  Is the information
in each frequency band as important as the others?  Can we say anything
about the organization of the system?

        Who:    Jont Allen
        What:   How do humans process and recognize speech
        When:   Thursday March 9 at 11AM
        Where:  CCRMA Library (Top Floor of the Knoll)

Be sure to come to hear a fascinating speaker and what will undoubtably be
a spirited discussion.

-- Malcolm
P.S.  We're actually blessed with two distinguished guests this week.  On
Friday morning, Bill Yost, head of the Parmly Hearing Institute at Loyola
in Chicago will be speaking about temporal mechanisms of pitch perception!
More details to follow.


How do humans process and recognize speech?
J. B. Allen
Acoustics Research Dept. Rm2D553
AT&T Bell Labs
Murray Hill, NJ 07974

Until the performance of automatic speech recognition (ASR) hardware
surpasses human performance in accuracy and robustness, we stand to gain
by understanding the basic principles behind how humans recognize speech.
This problem was studied exhaustively at Bell Labs between the years
of 1919 and 1950 by Harvey Fletcher and his colleagues.
The motivation for these studies was to quantify the quality of speech
sounds in the telephone plant to improve speech
intelligibility and preference. To do this he and his group
studied the effects of filtering and noise on speech
recognition accuracy for nonsense consonant-vowel-consonant
(CVC) syllables, words, and sentences.  Fletcher coined the term
articulation as the probability of correct recognition
for nonsense sounds, and intelligibility as the probability
of correction recognition for words (sounds having meaning).
In 1919, Fletcher derived a linear speech articulation density
function $D(f)$ for the CVC's and found a formula that accurately predicts
the CVC and phoneme errors.  The average area of $D$, quantized to
critical bands, is called the Articulation Index.
My interpretation of these results is that humans extract
speech features in independent channels. I will drive this point
home with a dramatic demonstration of the McGurk effect.
Under ideal conditions the articulation density along the basilar membrane
is 20\%/mm.  Fletcher then went on to find relationships between the
recognition errors for the nonsense speech sounds, words, and sentences.
Once the AI is determined, the human recognition error can be accurately
predicted. The formulas for doing this will be discussed.
A possible resolution of the problem of coarticulation will be discussed.
This work has recently been reviewed and partially
replicated by Boothroyd and by Bronkhorst et al.
Taken as a whole, these studies tell us a great deal about
how humans process and recognize speech sounds.