From Wednesday May 6, 1998

Lessons from a talking head
Baldi, a computerized teaching tool, helps deaf students understand visual and auditory mechanisms of speech.

By Oz Hopkins Koglin

The new teaching assistant in George Fortier’s Classroom is a talking head named Baldi.

This guy teaches for free, never tires or takes breaks, and never goes home. Students love him; he's patient and is a good listener.

If Baldi seems a bit too dedicated, its because he’s, well, a tool – a marvel of spoken language technology.

Baldi is a listening and speaking three dimensional, computerized talking head. When he speaks, his jaw, lips, tongue, and facial movements are manipulated to mimic human speech.

But don’t mistake Baldi for just another animated computer image. Behind his self-assured exterior is a tool kit that meshes voice recognition and text-to speech synthesis programs with his talking head.

In research circles, Baldi is known as a conversational agent. His teaching assignment at Portland's Tucker-Maxon Oral School is to help 8 to 12 year old deaf children speed their use and understanding of the auditory and visual mechanisms associated with speech. Baldi is the key figure in a study by the Oregon Graduate Institute of Science and Technology’s Center for Spoken Language and Understanding in Hillsboro. The study is financed by a 1.8 million dollar grant from the National Science Foundation.

Ronald a Cole, professor and director of the Center for Spoken Language and Understanding and principal investigator in the study, said his vision is to provide the average person with language technologies that allow people to talk to computers.

“If our students from grade school to graduate school are going to be able to play with and understand and become developers of tomorrow's technology, then we have to put it in their hands and make it widely available,” Cole said.

Tucker-Maxon, founded in 1947 by four participants who wanted their deaf children to grow up speaking, now has 55 students who wear powerful hearing aids, cochlear implants or both. Cochlear implants allow deaf students to be aware of sounds by sending electrical signals to the auditory nerve. Tucker-Maxon offers a standard elementary school curriculum, and it’s goal is to help students transfer to school for hearing children as soon as they are ready.

“We think all children can learn t talk,” said Patrick S. Stone, executive director of Tucker-Maxon. “We don’t use sign language.”

The school is recruiting four hearing fifth graders next fall to participate in the classroom will Baldi.

“It will give deaf children an opportunity to be in a classroom with hearing children, and hearing children will have the experience of being in a small classroom of 10 and they will have access to state-of-the-art technology,” Stone said.

Last fall, researchers trained Tucker-Maxon teachers to use the CSLU Toolkit that operates Baldi. Intel Corp. donated five top-of-the-line Pentium II computer platforms to the project.

To create conversation with Baldi, all the teacher has to do is use programs from the CSLU Toolkit. Teachers can type in the words they want Baldi to say and the words they want Baldi to recognize in response. Baldi’s speech comes from the text-to-speech Festival system, developed by a team at the University of Edinburgh in Scotland. It turns any english or Spanish text into intelligible speech. A facial animation program developed at the University of California, Santa Cruz, takes speech segments produced by the festival system called phonemes, and uses them to move Baldi’s lips, tongue and jaw, and synchronizes movement to speech.

“Nobody has ever had an animated face like this, that accurately produces words that can be lip read,” said Fortier

At Tucker-Maxon, Baldi asks questions as part of a word game, listens to the children's answers and tells them whether they are correct. When the answer is wrong, Baldi takes them through a training exercise until they arrive at right answer. The topics can include whatever the students are studying, such as geography, science, or history.

In the past, students have used computer programs that presented information with pictures and words, typed questions and answers, but they weren't always able to use what they learned in conversation, Fortier said.

“What I see now is that when we sit down and discuss concepts, the practice Baldi gives children enables them to understand others when they are talking about these concepts,” Fortier said. “The people recognize the words better and people understand their speech better.”

Baldi started out as a wire frame head model that Dominic W. Massaro, a professor and chairman of psychology at the University of California, Santa Cruz, and Michael M. Cohen, research associate, refines and used to measure how people put together information from a face, independent of voice.

“We can think of Baldi now as a puppet on about 60 strings and we control those stings over time so that Baldi says appropriate things, makes the appropriate mouth movements,“ Massaro said.

Using texture mapping, Massaro and Cohen can wrap any still video picture over the framework to produce a more natural of familiar image. So, in the future, students might see their own faces on the screen, for instance.

In feedback to the Oregon graduate institute's researchers, Fortier has passed along his student’s desire to have a pause button so they can stop Baldi and return to where they were without starting him again. And they would like to have real speech, rather than synthesized speech, which is something the researchers are working on.

Massaro, a cognitive psychologist who studies speech perception and comprehension, heads one of the few laboratories in the world using facial animation in the quest for understanding. He is the author of a new book “Perceiving Talking Faces: From Speech Perception to a Behavioral Principle.”

The value of visible speech and talking heads extends well beyond therapy for the deaf, Massaro said. Auditory cues play a large role in comprehension, but we also rely on what we see when we hear. For example, some people don’t like to talk on the telephone because they don’t get the visual cues from people on the other end. And many elderly people say they “hear” the television better with their glasses on.

In laboratory studies, people with normal hearing are able to comprehend about 25 percent of a message when they rely solely on lip-reading. Those who receive only audio signals in a noisy environment, such as a cocktail party, do as poorly. But when the same research subjects lip-read and receive audio messages, the rate of comprehension jumps to about 80 percent.

“Traditionally, people thought about spoken language as simply being auditory, and what our research along with others has revealed is that people are very good at putting together many sources of information to make sense of a situation, “ Massaro said.