From published by the American Psychological Association

Seeing Is Hearing
“Speaking” computerized faces elucidate speech perception and have applications in multimedia, spoken language comprehension by the deaf, and foreign language learning

A research tool was all that psychologist Dominic Massaro and his team at the University of California - Santa Cruz had wanted. They needed something to help them do basic perceptual laboratory work on how people perceive and recognize speech by eye and how, as listeners, they combine their visual perceptions of speakers with what they hear.
But now the research tools the developed, computerized “talking heads,” are achieving national attention. These talking heads appear to have many more potential applications in multimedia communications. And in the nearer term, maybe as near as five years, they could help deliver important help to persons who have hearing impairments and those learning a second language.

A License to Call

New Jersey’s AT&T Bell laboratories have perked up their sensitive antennas to the work Massaro and his research associate, Michael Cohen, have been doing. A five-person AT&T team visited Santa Cruz lab late last year. One outcome has been a licensing agreement that provides Massaro’s group with AT&T software for speech synthesizers. AT&T is also somewhat beefing up the programs overall funding which has come almost entirely from the National Institute of Deafness and Communicative Disorders since 1980. And Now NIDCD has extended its support to cover the project for four more years.

Massaro says all this goes well beyond anything he and Cohen had in mind in the mid-1980s when they first conceived of their “talking heads.” At that time, Massaro and Cohen were using video clips of natural faces in their efforts to isolate “visual speech” - the visual perception of speech - from auditory speech.

Natural Isn’t Everything

But normal faces couldn’t give all the information that perceivers need, Massaro says. Moreover, he and his team wanted stimuli that could be controlled more rigorously than natural faces. This would allow the researchers greater ability to precisely manipulate facial and lip movement as well as movement of the tongue and jaw. (The current underlying computer controllable grid allows control of as many as 60 parameters.) They could even also present complicated sounds -- even sounds that contradict facial stimuli - while measuring subjects perceptions. Their goal was to develop a tool that would do as much for research on visible speech as synthetic speech was already doing for investigators into auditory speech perception.

“That's how the talking heads developed,” Massaro says, “as a tool to do perceptual work. But then, thanks to the serendipity in science, we soon saw there was a lot of value in talking heads, not just as an experimental tool but as a device that could help the hearing impaired and people learning a second language, and also in multimedia, human-machine interaction and many other applications in education and entertainment.” The possibilities are practically endless, he says.
Everyday Importance

One everyday indication of the value of speech is the fact that the hearing impaired can significantly augment their speech comprehension through lip reading, something that is also important to people with normal hearing. “I'm sure you've probably heard elderly friends or relatives say that they hear the television better with their glasses on,” Massaro says.

But in the 1980s Massaro looked hard and long to find research funding programs unwilling to support his development of an animated head that talks.

“We applied for money from the National Science Foundation and several other funding agencies in 1985 and in 1986,” Massaro said. Receiving good reviews, they were unsuccessful there and elsewhere, however.

They finally received support in 1990 from NIDCD and four the next four years from the same institute,” so we will be able to continue this work, refine , and go in some slightly broader directions,” Massaro said.

Hearing the PIcture

With NIDCD’s past support, today's state-of-the-art talking head is a computerized image that resembles a highly expressive mannequin. An underlying grid allows researchers to control about 60 parameters to animate the face and create other movements in speech. Researchers can manipulate the jaw, mouth, lips, and tongue to mimic the visible component of speech. (Massaro emphasizes that the basic design came from the 1970s doctoral dissertation of Fred Parke, a computer scientist; but Massaro didn't have the computers required to start the project until the mid-1980s.)

To start a session, researchers can type in English text of almost any length into the computer. It then produces the text as spoken language, complete with corresponding facial movements, pausing for a second or two between sentences.

But investigators can also program novel or ambiguous sounds, halfway between “ba” and “da,” for example. They can also program the talking head to say “doll” visually, for example, while the word “ball” is sounded audibly. The result in this case is that most people watching the talking head hear “wall.” Similarly, if a researcher makes an audible recording of the nonsense sentence, ”My bab pop me poo brive” and dubs it into a video of the head saying “My gag kock me koo grive,” most viewers will report having heard “My dad taught me to drive.”

Massaro sees this as evidence that “people are always trying to impose meaning [on stimuli] at the highest level, even when they're given conflicting information,” Massaro explains. “Although you might expect people to ignore either the sound or the visible speech, in fact they use all the evidence and come up with the best solution. When there is inconsistent or ambiguous information, people will try to put all the pieces together in the way that makes the most sense.

Research seems to support this contention, with some studies showing that listeners who rely soully on lip reading have a comprehension rate of about 25 percent. THose who receive only audio signals in an environment like a noisy cocktail party have a similar rate of comprehension. However, when the same listeners both lip read and receive audio messages, the rate of comprehension jumps to about 80 percent.

Prosopagnosia

Massaro’s talking heads have now started to appear in psychology laboratories in a few other parts of the world, for example in London with Ruth Campbell and at the University of Western Ontario with Mel Goodale. Both are using the tapes with prosopagnosic subjects, persons who have difficulty recognizing faces, even those of close relatives.

As to their interest in talking heads, AT&T laboratory heads haven't been talking much. One of its representatives who visited Massaro’s laboratory told the press, “I think this is an important technology for the future, but I don't think the future is quite here yet.”

Massaro himself is more open and sanguine. He sees talking heads being useful in computing “so if, for example you are using Microsoft Windows, a talking head could give you instructions. When you click on a menu, a talking head could read you the menu, or it could serve as an alerting device.”

In the Future

With four more years of funding now assured by NIDCD, Massaro says a priority is further work on a talking head that will give the hearing-impaired more information than they can get from normal heads in visible speech.

Another goal is to bring the affect and emotional expression to the talking heads, manipulating the eyes, eyebrows, and corners of the mouth - in part to determine if people can discriminate between emotions on the basis of cues in the face. Graduate student John Ellison is involved in most work on emotion.

Basically, Massaro has been working in speech perception for 20 years, striving to uncover fundamental rules about the way the mind works with language. His general approach is to identify how people perceive and recognize patterns. One of the themes of this approach is how people use many different sources of information to perceive and recognize patterns. The sources may be ambiguous, but a perceiver pieces them together to interpret what the situation actually is , Massaro notes.

This general theoretical framework for describing the process of perception and pattern recognition also works in other language domains as well, Massaro says - in reading and sentence interpretation, for example. But pattern recognition also functions in situations like natural object recognition, cues to depth perception, and memory. The memory research is done by putting several cues together - like doing a crossword puzzle in which you work with a definition plus some letters from other words already written in.