Methods for Hair Dynamics Description

Ken Anjyo
Hitachi Research Laboratory, Hitachi, Ltd.
email: anjyo@hrl.hitachi.co.jp

Tsuneya Kurihara
Yoshiaki Usami
Central Research Laboratory, Hitachi, Ltd.
email: kurihara@hrl.hitachi.co.jp

Our talk presents some experimental results in describing the visual realism of hair dynamics. It may be unavoidable to greatly simplify the rigorous physics, in order to obtain a computationally tractable hair dynamics model. This is because of the large amount of hair and the extreme complexity and diversity of actual hair dynamics. In the proposed method, each hair strand is represented as a collection of linked segments and governed by the one-dimensional projective equations. For each segment of the hair strand, the differential equations describe the projective behavior of the segment. This means that the equations govern a pair of the time-dependent unknown functions: the azimuth Q (t) and zenith F (t), which give the three-dimensional polar coordinates of the segment. The projective equations are then easy-to-solve 2nd order ordinary differential equations of Qi(t) and Fi(t), for the i-th segment with 1<= i < = k. Then the equations describe the azimuth and zenith functions independently so that its discretization provides simple recurrence formulae. This assures fast generation of animated sequences and, more importantly, quick feedback in previewing.

The dual problem in hair dynamics description is (1) what is the dynamics equations suitable for computer graphics modeling, and (2) how to define and specify the external force to obtain a desired result. As described above, we employ the projective equations as a (tentative) answer to the first problem. As for the second problem, a discontinuous force field is introduced as the answer associated with the projective equations. The force field is considered to be an ``easy to define'' version of a spatially uniform, external force field, which is meant to provide a constant vector, independent of its position, during a certain period of time. Despite of the ``rough'' approximation of hair dynamics by the formulation, the experimental results obtained illustrate the efficiency and descriptive power of the method. For example, several wind gust scenes, along with hair swaying according to human movement, were obtained by specifying the discontinuous force fields. In the previewing processes, only a few hundred hair strands were used for quick feedback. This did not cause, however, quite a difference between the preview and a full animation with tens of thousands of hair, because the hair dynamics algorithm performs each hair strand calculation, independently on the other hairs. The discontinuous force field in the method is rather simplified, compared to an applied force field in existing physically based approaches. As shown with the examples, specification of the discontinuous force field means to prescribe the rough directions that the hair strands are going to move as time varies, rather than the physically correct force vectors. This allows a user to intuitively specify the hair movement.

Let us consider the problem of how to treat inter-hair effects, such as of collision or friction between hairs. The projective equations involve the empirical rule concerning the inertia moment, which simply means that, in a hair animation, the hair segments near a pore tend to move relatively slowly. This is considered to roughly describe a frictional effect between hairs or between hairs and a head. As for the collision between hairs, the method neglects the collision detection calculation, for simplicity. Collisions of hair with a head model are drastically simplified using the concept of pseudo-force. This is a ``rough'' treatment of the collision phenomenon in that some hair strands are allowed to get into the head. The collision detection between hair and a human body or other objects is neglected in the method.

In our talk, a variation of the above method is also described, with an emphasis on collision detection between hair and a human body. In the alternative method, more accurate hair dynamics is considered, and a reaction-constraint technique is also used for fast collision detection between hair and the human body. The efficiency of the approach is demonstrated with the short animations obtained, including head shaking scene.


Christian BENOIT
Institut de la Communication Parlee
Universite Stendhal
Grenoble FRANCE
email: benoit@icp.grenet.Fr

As a ``Speech Scientist'', I am mainly concerned with the visible aspects of facial gestures in the production of speech. It is well known that speech perception is dramatically enhanced by watching the speaker's face, especially when the acoustic signal is degraded. I will demonstrate this through presentation of recent intelligibility results obtained under uni-modal and bi-modal presentation conditions, with natural and synthetic faces. In fact, the originality of bimodal speech relies on the intrinsic coherence of the sources of information: The acoustic and the optic transmission of spoken information are simultaneously excited by the same source, e.g., geometric changes in the human vocal tract. Therefore, a synthetic face will be able to simulate speech only if the lip, jaw and (more generally) face gestures are strictly coherent with the acoustic utterance that is supposedly produced.

Our knowledge of the very complex articulatory commands humans make in order for their vocal tract to be properly controlled is as yet very crude. We can only control the few existing parametric models of the vocal tract in the production of steady vowels, or in the transitions between vowels, but we are still far from being able to anthropomorphically simulate the production of continuous speech. While it is of the first importance to continue making (a major) effort and (slow) progress in this area, we must now deal with the problem of synchronizing two different sources of information so that we can give the listener/viewer the illusion that both modalities are coherent.

In an attempt to partly solve this problem, we, at the ICP, have first focused on the image analysis/synthesis of talking faces in synchrony with the natural acoustic speech. I will present a geometrically-based parametric model of the lips that has recently been developed and the basic principles of a real-time analysis/synthesis demo (Angola, 1993). The lip model has been evaluated in terms of the intelligibility it adds to acoustically degraded natural speech, i) in isolation, ii) when superimposed to the Parke's model of the whole face. Finally, these results are compared to those obtained with the original face of the speaker. Such an evaluation of a parametric model will bring some light to the general discussion on ``What parameters for which facial model?''


Coarticulated Synthetic Visual Speech from English Text

Michael M. Cohen & Dominic W. Massaro
Program in Experimental Psychology
68 Clark Kerr Hall
University of California - Santa Cruz
Santa Cruz, CA 95064, USA
Email: mmcohen@fuzzy.ucsc.edu

After describing the importance of visual information in speech perception and sketching the history of visual speech synthesis, we consider the problem of coarticulation in human speech. Coarticulation refers to changes in the articulation of a speech segment depending on preceding (backward coarticulation) and upcoming segments (forward coarticulation). An example of forward coarticulation is the anticipatory lip rounding at the beginning of the word ``stew''. An implementation of Lofqvist's (1990) gestural theory of speech production is described for visual speech synthesis. This approach entails overlapping dominance functions specifying the degree to which the speech articulators achieve phoneme targets values. We also describe the graphically controlled development system for visual-auditory speech. Finally, we describe how MITalk is used to provide overall control for automatic text-to-visual/auditory speech.


Sandy Pentland & Irfan Essa

Perceptual Computing Section, MIT Media Lab
20 Ames Street,
Cambridge, MA 02139
email: sandy@media.mit.edu, irfan@media.mit.edu

We have developed a series of machine vision tools for tracking and analyzing human behavior and expression. These range from very robust but coarse real-time systems for for tracking the human body, to analysis systems for accurate recovery of shape and motion (head, lips, cheeks, and eye). We are currently developing a system for detailed interpretation of human facial motion within an active control framework. The goal of this work is to derive a model of muscle control that is firmly based on experimental data. Initial results show that current FACS and muscle models can not easily account for real human face movement, but that there may be simple ``fixes'' that will make them more accurate.


Continuous Automatic Speech Recognition by Lipreading

Alan Jeffrey Goldschen
email: ajg@seas.gwu.edu

This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance - without the use of syntactical, semantic, or any other contextual guide to the recognition process - indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized.

This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to- viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system achieved recognition rates of 2 percent, 12.7 percent, and 25.3 percent using, respectively, viseme HMMs, triseme HMMs, and generalized triseme HMMs.

The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.


Joseph C. Hager

P.O. Box 883843San Francisco
CA 94188-3843
email: joehager@ucsfvm.ucsf.edu

Synthetic models of the head, face, facial features, and facial muscular action could make a valuable contribution to behavioral science research. The usefulness of such models in this area depends upon how well each model incorporates certain key parameters of the face as a signal system. This talk summarizes the important facts about facial signals. The focus of our research is on the expression and interpretation of signals about emotion. I present some specific examples of the kind of synthetic images that would aid our research. One of our projects is classifying facial muscular action with neural network tools. We are compiling a database of facial images to use in this project. I describe these images and suggest how they might be of interest to those working on the animation and modeling of the face and muscular action.


Face Models and Interactive Neuroscience

John Hestenes
Biomedical Engineering and Science Institute
Drexel University, 32nd and Chestnut Streets
Room 7-706, Philadelphia, PA 19104
email: jhestene@ece.drexel.edu

A position held by many is that face animation modeling methods should be motivated and guided by the potential applications. The variety of applications for face animation is potentially large but the number of basic, underlying approaches to produce face animation are probably few. Virtual reality is an example of a recent application area where face animation is becoming an important next step toward realism. It is fair to say that no virtual reality implementation has yet demonstrated facial expression animations that are indistinguishable from reality in the eyes of the human participant. Attempts at face animation in 2 and 2-1/2 D computer graphics and in virtual reality have been either model driven or have used a teleoperator master-slave approach where synthetic faces are manipulated more or less directly by a human (like a ``Wizard of Oz''), with simple exaggerations to provide emphasis and interest. Future telecommunication technology will surely require compression schemes based on facial models. One research area in communications which will be enabled by good facial models is the application of transforms of facial and gesture models to achieve appropriate communication between disparate cultures. Autonomous and intelligent agents that interact with facial expressions in virtual worlds is another research area that may eventually emerge. Other research areas that may leverage on facial model developments are human-computer interaction research, psychology, psychiatry, psychophysiology, cognitive science and cognitive neuroscience. These areas may have unique requirements for the definition and functionality of face models and model compression schemes.

At Drexel University we are motivated by basic neuroscience issues such as the relation of facial expression to neural activation in the brain and, in particular, the potential use of such models as intermediate steps in understanding human perceptual, cognitive and affective function and dysfunction. Our current interest is to determine the nature and type of facial modeling approaches that might be useful in such studies. Exploratory studies are in progress to identify and develop measurement techniques that are non-invasive and suitable to assess a range of interactive models including those based emerging from human-computer interaction research and physiological models emerging from electrical and magnetic field brain measurements. The latter approaches are closely related to PET and MRI imaging and brain function modeling and simulation. Potential interactive scenarios for research include individual humans working on computer-based tasks and computer-mediated interaction between humans remotely or in shared worlds. These need facial models.

In one effort, we are examining the use of structured light illumination of to capture facial expression using video methods and triangulation. Labeling points in the rectangular grid illumination requires excessively long searches to resolve correspondences. An illuminating array of 45 by 45 colored dots with unique nearest-neighbors may solve the labeling problem and may be suitable for real-time studies of muscle action groups or master-slave face animation. The technique uses three colors in the visible spectrum but could be implemented in non-visible regions of the spectrum and with various scanning techniques. Multiple illumination sources and cameras may be used.

The impact on face models and facial expression recognition is that range data can be used to interpret gray scale or color information from the face as well as in the identification of active muscle action groups. To be useful the raw data must be transformed into an normalized space to deal with sampling issues and individual differences. Important issues are the choice of the normalized space, coordinate representation and the nature and specificity of the transforms required for normalization. The normalized range data set could be included as preprocessed inputs to neural network models for classification and control schemes with or without the application of segmentation rules. Using range data alone as the basis for geometric or functional representation is insufficient in many cases since it cannot, in itself, always distinguish between different states of the facial skin and underlying muscle. Functional models that include muscle dynamics are required along with certain aspects of muscle action groups and perhaps of innervation pathways.

The task of identifying a minimal set of meaningful features for good perception of faces begs for methods and metrics to assess how humans actually perceive faces and what humans find meaningful (unless we limit ourselves merely to orthogonality and sensitivity properties). This problem is closely related to that addressed by neuroscience studies on human allocation of attention and resource utilization. A cross-disciplinary approach is required and interactive, real-time studies must be designed. Only by such an approach will we have reason for confidence in the criteria for an adequate notational system for computer graphics facial models. Only by such an approach will we be able to identify the vocabulary of signals (physical signals such as wrinkles, textures of the skin, elasticity of the skin, and so on, and expressive signals such as eye blinks, smile, frown and so on) that not only characterize faces and their motions but are also actually utilized by humans in their functioning perceptual models. Human expectations become important and tightly coupled, for example, when a human views a simulated image of his own face and head as in a mirror. In the extreme, this might suggest that aspects of human expectation and attention might need to be included in certain functional facial models. Alternatively, one might say that facial models (especially for pair-wise interactions) might need to be intelligent and able to be adaptive and sensitive to the course of the immediate dialog. The levels at which adaptive and intelligent functionality might occur are unknown. In any case, this line of argument leads to a mandate for the development and use of scientific discovery methods linked with real-time computational advances and modeling.


Multi-level Facial Animation System

Prem Kalra, Nadia Magnenat-Thalmann
MIRAlab, University of Geneva.
email: KALRA@uni2a.unige.ch

Our prototype system for facial animation encapsulates different groups of activities of facial expressions arising from speech and emotions. The multi-level configuration of system structure reduces complexity and provides independent control for each level. For the low level deformation controller, where we simulate muscle actions, we employ an approach based on rational free form deformations. This offers a simple and intuitive design for muscular activities. The high level controller provides the ability to define the animation in terms of more abstract entities such as sentences and emotions. A mechanism to control and synchronize the different actions is added. For realistic rendering, we use texture mapping. An emotion model is included to incorporate change in the color of a face due to vascular expressions. This allows us to provide for emotional clues such as paleness due to fear, or blushing due to embarrassment. The system is multimodal in the sense that it offers a platform for experimentation with various types of input accessories to interact and control the animation.


Talking Heads Made Simple

A. E. Kaplan and S. Keshav
AT&T Bell Laboratories, Murray Hill, N.J.
email: aek@research.att.com

We present a minimal approach to facial animation for a text to speech application. A text to speech program was changed to add a synchronized ``talking head''. The ``talking head'' is actually a series of digitized photographs. The digitized photographs are aligned and normalized for intensity. The text to speech program generates a series of phones and their durations. (A phone is an acoustical concept corresponding to the linguistic concept of a phoneme.) A look-up table choses a photograph corresponding to a phone and displays it for the phone duration generated by the text to speech program. (Additionally, eyes blink during silences of 150 milliseconds to 1 second. The blinking makes the head seem much more alive.) The suggested limited set of mouth positions (10) correspond to facial positions obtained from a book on drawing cartoons [1]. No interpolation of facial position is done between phones. Despite the highly simplified approach we have taken, the ``talking heads'' seem fairly realistic. A larger set of facial positions and more than one facial position per phone are easily accomplished by trivial software modifications. A larger set of visemes might produce more realistic results. (A sequence of one or more facial positions corresponding to a phone or phoneme is referred to as a viseme [2].) We are in the process of doing experiments (with Alan Goldschen) on how well trained lip-readers comprehend the ``talking head's'' speech when no audio is played. We will compare this with a video of a human speaking (again with no audio). We may also try seeing how the number and character of the visemes chosen affects the intelligibility of the ``speech''.

References

[1] Animation, Preston Blair, Walter T. Foster Art Books.
[2] Continuous Automatic Speech Recognition by Lipreading, Alan J. Goldschen, Ph.D Dissertation, George Washington University, 1993.


Facial Modeling and Animation

Tsuneya KURIHARA and Kiyoshi ARAI
Central Research Laboratory, Hitachi, Ltd.,
1-280, Higashi-koigakubo, Kokubunji-shi, Tokyo 185 Japan.
kurihara@crl.hitachi.co.jp

The generation and animation of human facial images has many applications. Many techniques for facial animation have been investigated for the past 20 years. For the entertainment and communication use, the important issues are:
  1. Efficient generation of individual facial models,
  2. Intuitive and interactive manipulation of facial expressions,
  3. Realistic rendering of the human face.
To solve these issues, we have developed facial animation techniques based on a 3-D canonical facial model and an interactive transformation method. The creation of a 3-D human facial model of the specific person is difficult or time-consuming, because of the complex geometry of a human face. Every human face has the same structure and is similar in shape for every person. Therefore, we prepare a canonical (prototype) model of the human face and transform it into the face of a specific person. The canonical facial model will play an important role for facial animation. The canonical facial model is represented by polygonal meshes (4500 polygons). Several control points are selected and their displacements are specified to transform the model. The proposed transformation method then interpolates the displacements of the remaining vertices. For the interpolation of the displacement, the 3-D facial model is projected onto a 2-D parametric space using cylindrical projection. The 2-D parameter space is triangulated by the control points using Delauney triangulation. The displacement of each vertex is determined by the weighted sum of the displacements of three neighboring control points. This transformation technique has two advantages: it is direct and intuitive because it can handle multiple control points directly and simultaneously, and it is fast because it depends on simple interpolation. One disadvantage of this method is that it may lead to first-derivative discontinuities. However, it is easy to add additional control points and their displacements to make the transformation smoother. A wide variety of faces can be generated using the transformation method. In addition to the interactive transformation, we can use photographs of the specific person as a reference. Vertices that are important for transforming a facial model (such as the corners of the eyes and mouth) are specified as control points. The displacements of the control points are estimated using the photographs, after which the canonical facial model is transformed into the specified model. The transformation technique is also applied for modifying facial expressions. We manipulate about 100 control points for the facial expression. Complex facial expressions can be generated because the transformation technique can handle multiple control points and is very fast. We have developed transformation elements for all Action Units of FACS (Facial Action Coding System). The combined expression of these Action Units is generated by composition of the transformation elements. A one to one mapping can be defined between the canonical model and the specified model using the 2-D parameter space, because the facial model of the specified person is transformed from the canonical facial model. Therefore, facial expression data for one person can be applied to another facial model using this mapping. This means that the facial expression data is represented independently of the facial geometry. In addition, we can morph a facial model and modify facial expression simultaneously. Facial images are generated using a texture mapping. When photographs of a face are used for the creation of the specific facial model, the same photographs can be used to generate the texture. The texture-mapped face can be animated in real-time with a latest graphics workstation. Hair is also represented by a texture mapping if quick response is required. When more realistic image of the hair is required, we model the hair as many line segments. The facial model of a specified person can be generated efficiently by interactive transformation or utilization of photographs as a reference, and then animated interactively with the proposed techniques. All the aforementioned techniques were applied in the interactive computer graphics theater ``KA-O-RI''. ``KA-O-RI'' was performed in Tokyo during March 11-14, 1993. The play was 100 minutes long. The virtual actress KAORI and actual actors played in real-time according to the scenario. KAORI was manipulated by an actual actress and a few operators. The real actress talked, with her voice reaching the audience by microphone. Using a lip-synchronizer, the movements of her lips were translated into those of KAORI's lips. Other operators selected the typical facial expressions using dial button boxes. ``KA-O-RI'' is a joint project of Hitachi, Ltd. and Fuji Television Network, inc. Future works planned are: creation of a more precise facial model using a 3-D laser scanner, real-time acquisition of facial expression using a video camera or voice, and real-time manipulation of the whole human body.


Pete Litwinowicz, Lance William
Apple
email: litwinow@apple.com, lance.w@applelink.apple.com

Our goal in animating faces is to retrieve facial expressions from a performer and transfer the motion from the actor to another character. In this way we are able to capture the nuances of a particular performer and reuse the digitized motion to animate many characters, in essence providing a clip motion library for facial animation. Most of the work presented will be 2D, but a preliminary 3D face will be shown. Finally, MacHeadroom will be shown which combines precomputed facial animation sequences with synthesized speech to produce talking agents from input text.


Fred Parke
email:parke@futserv.austin.ibm.com

My current work and interests are centered on ``Conversational Interfaces'' which integrate speech recognition, speech analysis, realtime 3D facial animation, ``agent'' technologies, text-to-speech synthesis, and rapid creation of specific person 3D face models. The basic idea is to support two way spoken dialogs as a primary form of human-computer interaction.


Manjula Patel
School of Mathematical Sciences
University of Bath
Bath, AVON
U.K.
email:mp@maths.bath.ac.uk

My interest in facial modelling and animation stems from the work I did as part of my doctorate. For the purposes of this work, faces were viewed as having two major functions; those of identification and communication. The communication aspect can be further broken down into verbal (speech) and non-verbal (expression) communication. Most of my work so far has been concentrated on the latter of these.

In order to investigate issues concerning the modelling and animation of faces, a system was implemented which integrates these two functions using a three-layer anatomical model comprising bone, muscle and surface features. The resulting system is called FACES, which is an acronym for the Facial Animation, Construction and Editing System (Patel and Willis 1991). Major issues addressed as part of the project concern the modelling of a variety of faces and their subsequent animation. The problem of providing adequate and effective control for the user (Parke 1991) has also been considered.

FACES is an interactive system which helps with both the generation and animation of faces, while hiding the structural complexities of the face from the user. The software consists of three sub-systems named: Construct, Modify and Animate. Construct and Modify cater for modelling functionality to enable creation of distinct faces. The Animate sub-system allows sequences which comprise facial movements to be generated. Further control is provided over facial colouration and motion evaluation. Several levels of control are available within both Construct and Modify. Essentially, the Construct sub-system deals with the bony structure of the head, while the Modify sub-system is concerned with skin, muscle and surface features. At a global level, changes can be made to overall proportions of the head and face. At a regional level, the head is considered in terms of three sections, so that modifications can be made to relative proportions. Local control facilitates amendments to individual bones, such as the zygomatic which is responsible for the prominence of the cheeks, as well as features such as the eyes, nose and lips.

The Animate sub-system caters for motion specification and control. At a basic level, facial movement is generated through simulation of muscular contraction (Waters 1987). However, since generation of facial movement through manipulation of individual muscles would be a cumbersome task, the user is provided with two higher levels of control. Facial actions may be specified using a `kit-of-parts' approach through selection from a repertoire of 31 Action Units which have been derived from the Facial Action Coding System (Ekman and Friesen 1978). These consist of actions such as `raise-inner-eyebrow', `jaw-drop' and `wrinkle-nose'. At an even higher level, predefined expressions such as happiness, sadness, disgust, anger, surprise and fear, may also be used. Such expressions provide control at an ``emotional'' level, while Action Units provide flexibility for animators to create their own effects.

The experience of implementing this system has high-lighted several outstanding problems, which it may be useful to ``air'' during the workshop. With regard to facial expression and communication there are issues concerning the modelling of facial deformation; levels of control over facial actions; timing and synchronization.

In the area of facial modelling and appearance there are also problems which need to be addressed. For example, facial deformation for modelling; types of faces and their classification and characteristics; the establishment of parameters for various parts of the face; higher levels of control such as age, race and gender characteristics; whether there are ``rules'' which indicate how a modification in one part of the face may affect another; appearance and recognition aspects also involve the modelling and rendering of hair, beards, moustache, spectacles etc.; and of course how control over all of these is to be provided for the user.

References

P. Ekman and W. Friesen (1978) Manual for the Facial Action Coding System. Consulting Psychologist Press, Palo Alto, California
F.I. Parke (1991) Control Parameterization for Facial Animation. Proceedings Computer Animation`91,pp3--31
M. Patel and P.J. Willis (1991) FACES---The Facial Animation, Construction and Editing System. Proceedings Eurographics`91,pp33--45
K. Waters (1987) A Muscle Model for Animating Three-Dimensional Facial Expressions. Proceedings ACM SIGGRAPH 21(4):17--24


Two Agents Dialogue Simulation

Catherine Pelachaud, Norman Badler, Mark Steedman, Tripp Becket, Scott Prevost

Center for Human Modeling and Simulation (HMS)
email: pelachau@graphics.cis.upenn.edu
email: badler@central.cis.upenn.edu
email: steedman@linc.cis.upenn.edu
email: becket@graphics.cis.upenn.edu
email: prevost@linc.cis.upenn.edu

A current challenge to many applications of real-time animation is the absence of virtual human participants, represented as programs capable of autonomous reactions, including conversation. This paper presents preliminary results from a study of the construction of dialogue including animated agents. We are particularly interested in the integration of facial gestures (smiling, frowning, nodding, eye movements, etc) with spoken language. To limit the problems that arise from the involvement of human conversants, we present the work in the form of a dialogue generation program in which two copies of an identical program having different knowledge of the world must cooperate to accomplish a goal. Both agents collaborate via the dialog to develop a simple plan of action. They interact with each other to exchange information, ask for a favor... The appropriate dialogue is automatically generated as suggested by Richard Power (1977). Each utterance of the generated dialogue contains a specification of intonation. The model is based on Combinatory Categorial Grammar, a formalism allowing syntactic and semantic derivations that are structurally isomorphic to prosodic phrasing. Each utterance is output using a voice synthesizer. We are particularly interested in the facial gestures accompanying such dialogs. We have characterized nonverbal signs by their syntactic definitions and their functional significances in transmitting information. The set of facial clusters contains: We have began to look more particularly at the problem of eye movement. Eye behavior during a conversation shows a complex pattern. Eye movements can be classified into 4 main subclasses depending on their role in the conversation. Personality is an important factor in the occurrence of facial and gaze behavior. We partially embody this factor by varying the probability of occurrence of actions.

Automatic lipreading to enhance speech recognition

Eric Petajan
AT&T Bell Laboratories
600 Mountain Ave. Rm 2B-231 Murray Hill
NJ 07974
email: edp@allegra.att.com

Current acoustic speech recognition technology performs well with very small vocabularies in noise or with large vocabularies in very low noise. Accurate acoustic speech recognition in noise with vocabularies over 100 words has yet to be achieved. Humans frequently lipread the visible facial speech articulations to enhance speech recognition, especially when the acoustic signal is degraded by noise or hearing impairment. Automatic lipreading has been found to improve significantly acoustic speech recognition and could be advantageous in noisy environments such as offices, aircraft and factories.

Several generations of automatic lipreading systems have been developed during the last decade. An overview of this work will be provided with a discussion of applications to face animation.


Steve Pieper

Medical Media Systems
Steve.Pieper@dartmouth.edu

The human face is an amazingly complex and subtle mechanical structure. Over the past several years, I've been trying to capture some aspects of this behavior in numerical simulations. I've been especially interested in the way the soft tissues of the face respond to both internal and external forces and in the way soft tissues are manipulated during surgical procedures. After experimenting with several simulation techniques, I found that I got the most satisfying results through the application of the Finite Element Method (FEM). Two major advantages of this method are the ability to work with the tissue as a continuum that may be sampled at various resolutions for analysis or display, and the existence of efficient and stable solution techniques. I have built a prototype of a Computer-Aided Plastic Surgery (CAPS) system based on an FEM approach to soft tissue modelling. The CAPS system includes (1) tools for designing surgical incisions and closures on a graphical patient model, (2) a mesh generation algorithm to create custom continuum finite element meshes, (3) a solution module to predict post-operative tissue displacement as a function of material properties and wound closure constraints, (4) a visualization module to allow interactive exploration of the simulated surgical result using a texture and displacement mapping technique to display full resolution video surface scan (cyberware) data on a lower resolution FEM mesh, and (5) a model of the force generating properties of muscles based on volumes of force within a continuum of the soft tissue. In the future I hope that CAPS systems will advance to the point where a complete working model of a patient's face may be created before surgery, and that the exact muscle activations used by the patient can be applied to the post-operative patient model. This type of system would give the surgeon the ability to custom design a surgical procedure around the patient's use of the face as a functioning communication organ. That is, the surgery could better preserve the look of a smile or frown, and in that way help maintain the patient's identity over a range of expressions.


Realistic Facial Modeling and Facial Image Analysis

Demetri Terzopoulos
University of Toronto
email: dt@vis.toronto.edu

We have developed a highly automated approach to constructing realistic, working models of human heads for use in animation. These physics-based models are anatomically accurate and may be made to conform closely to specific individuals. We begin by scanning a person with a laser sensor which circles around the head to acquire detailed range and reflectance information. Next, an automatic conformation algorithm adapts a triangulated face mesh of predetermined topological structure to these data. The generic mesh, which is reusable with different individuals, reduces the range data to an efficient, polygonal approximation of the facial geometry and supports a high-resolution texture mapping of the skin reflectivity.

The conformed polygonal mesh forms the epidermal layer of a sophisticated physics-based model of facial tissue. An automatic algorithm constructs the multilayer synthetic skin and estimates an underlying skull substructure with a jointed jaw. Finally, the algorithm inserts synthetic muscles into the deepest layer of the facial tissue. These contractile actuators, which emulate the primary muscles of facial expression, generate forces that deform the synthetic tissue into meaningful expressions. To increase realism, we include constraints to emulate tissue incompressibility and to enable the tissue to slide over the skull without penetrating into it. The resulting animate models appear significantly more realistic than our earlier physics-based facial models.

We have developed a new approach to the analysis of dynamic facial images for the purposes of estimating and resynthesizing dynamic facial expressions. Motivated by the anatomically consistent musculature in our model, we consider the estimation of dynamic facial muscle contractions from video sequences of expressive faces. We develop an analysis technique that uses deformable contour models (snakes) to track the nonrigid motions of facial features in video. The technique estimates and encodes muscle actuator controls with sufficient accuracy to permit the face model to resynthesize transient expressions. Our model-based analysis/synthesis approach is potentially useful for performance driven animation and low bandwidth telecommunication.


Modelling Facial Transformation

Marie-Luce Viaud
Institut National de l'Audiovisuel, France
email: luce@ina.fr
Hussein Yahia
INRIA Rocquencourt, France
B.P. 105, 78153 Le chesnay
email: hussein@bora.inria.fr

The observation of details, such as expressive wrinkles, marks of aging, and conversational signals in the speech, is an important feature of understanding human faces. We propose here a facial animation system integrating such details. We first deform a reference wrinkle mask, built as a spline surface, to match each observable wrinkle of a particular face. A dynimical system controls then facial animation: we associate with this spline mask a mesh of springs, and define forces corresponding to each muscle of the face. In our system, expressive wrinkles are modelled geometrically and linked to the system of forces. The shapes of the bulges vary also according to a parametrized function related to the age of the human whose face is being modelled. We present a simulation of the three phenomena which make peoples faces older: persistence of expressive wrinkles, face shapes changes (depending on gravity and skin loosening), and skin texture changes (with the outbreak of micro-wrinkles).


Carol Leon-Yun Wang

University of Calgary, Department of Computer Science
wangc@cpsc.ucalgary.ca

Langwidere, a facial animation system, is intended to serve as the basis for a flexible system capable of imitating realistic characters and actions as well as creating the exaggerated and fantastic characters found in traditional animation.

In the field of computer generated characters, facial animation is difficult because of the complexity of the surface and the fact that humans have entire sections of their brains devoted to facial processing. Typically, the construction of new characters involves tedious and repetitive labor. An unwieldy level of detail is often needed for a realistic model. Apart from any geometric consideration, animating a new character often requires recoding or changing and retuning a large number of arcane parameters.

Langwidere integrates a hierarchical spline modeling system with simulated muscles based on local area surface deformation. The multi-level shape representation allows control over the extent of deformations, at the same time reducing the number of control vertices needed to define the surface. The head model is constructed from a closed surface (the surface also includes a rudimentary body) allowing the modeling of internal structures such as tongue and teeth, unlike some models that are just masks. Simulated muscles are attached to various levels of the surface with more rudimentary levels substituting for bone such as the skull and jaw.

The combination of a hierarchical model and simulated muscles provides precise, flexible surface control and supports easy creation of new characters without reprogramming.


Keith Waters
Cambridge Research Laboratory
Digital Equipment Corporation
1 Kendall Square
Building 700
MA 02139
email:waters@crl.dec.com

My face-based research has been focused on the following topics: