Our talk presents some experimental results in describing the visual realism of hair dynamics. It may be unavoidable to greatly simplify the rigorous physics, in order to obtain a computationally tractable hair dynamics model. This is because of the large amount of hair and the extreme complexity and diversity of actual hair dynamics. In the proposed method, each hair strand is represented as a collection of linked segments and governed by the one-dimensional projective equations. For each segment of the hair strand, the differential equations describe the projective behavior of the segment. This means that the equations govern a pair of the time-dependent unknown functions: the azimuth Q (t) and zenith F (t), which give the three-dimensional polar coordinates of the segment. The projective equations are then easy-to-solve 2nd order ordinary differential equations of Qi(t) and Fi(t), for the i-th segment with 1<= i < = k. Then the equations describe the azimuth and zenith functions independently so that its discretization provides simple recurrence formulae. This assures fast generation of animated sequences and, more importantly, quick feedback in previewing.
The dual problem in hair dynamics description is (1) what is the dynamics equations suitable for computer graphics modeling, and (2) how to define and specify the external force to obtain a desired result. As described above, we employ the projective equations as a (tentative) answer to the first problem. As for the second problem, a discontinuous force field is introduced as the answer associated with the projective equations. The force field is considered to be an ``easy to define'' version of a spatially uniform, external force field, which is meant to provide a constant vector, independent of its position, during a certain period of time. Despite of the ``rough'' approximation of hair dynamics by the formulation, the experimental results obtained illustrate the efficiency and descriptive power of the method. For example, several wind gust scenes, along with hair swaying according to human movement, were obtained by specifying the discontinuous force fields. In the previewing processes, only a few hundred hair strands were used for quick feedback. This did not cause, however, quite a difference between the preview and a full animation with tens of thousands of hair, because the hair dynamics algorithm performs each hair strand calculation, independently on the other hairs. The discontinuous force field in the method is rather simplified, compared to an applied force field in existing physically based approaches. As shown with the examples, specification of the discontinuous force field means to prescribe the rough directions that the hair strands are going to move as time varies, rather than the physically correct force vectors. This allows a user to intuitively specify the hair movement.
Let us consider the problem of how to treat inter-hair effects, such as of collision or friction between hairs. The projective equations involve the empirical rule concerning the inertia moment, which simply means that, in a hair animation, the hair segments near a pore tend to move relatively slowly. This is considered to roughly describe a frictional effect between hairs or between hairs and a head. As for the collision between hairs, the method neglects the collision detection calculation, for simplicity. Collisions of hair with a head model are drastically simplified using the concept of pseudo-force. This is a ``rough'' treatment of the collision phenomenon in that some hair strands are allowed to get into the head. The collision detection between hair and a human body or other objects is neglected in the method.
In our talk, a variation of the above method is also described, with an emphasis on collision detection between hair and a human body. In the alternative method, more accurate hair dynamics is considered, and a reaction-constraint technique is also used for fast collision detection between hair and the human body. The efficiency of the approach is demonstrated with the short animations obtained, including head shaking scene.
As a ``Speech Scientist'', I am mainly concerned with the visible aspects of facial gestures in the production of speech. It is well known that speech perception is dramatically enhanced by watching the speaker's face, especially when the acoustic signal is degraded. I will demonstrate this through presentation of recent intelligibility results obtained under uni-modal and bi-modal presentation conditions, with natural and synthetic faces. In fact, the originality of bimodal speech relies on the intrinsic coherence of the sources of information: The acoustic and the optic transmission of spoken information are simultaneously excited by the same source, e.g., geometric changes in the human vocal tract. Therefore, a synthetic face will be able to simulate speech only if the lip, jaw and (more generally) face gestures are strictly coherent with the acoustic utterance that is supposedly produced.
Our knowledge of the very complex articulatory commands humans make in order for their vocal tract to be properly controlled is as yet very crude. We can only control the few existing parametric models of the vocal tract in the production of steady vowels, or in the transitions between vowels, but we are still far from being able to anthropomorphically simulate the production of continuous speech. While it is of the first importance to continue making (a major) effort and (slow) progress in this area, we must now deal with the problem of synchronizing two different sources of information so that we can give the listener/viewer the illusion that both modalities are coherent.
In an attempt to partly solve this problem, we, at the ICP, have first focused on the image analysis/synthesis of talking faces in synchrony with the natural acoustic speech. I will present a geometrically-based parametric model of the lips that has recently been developed and the basic principles of a real-time analysis/synthesis demo (Angola, 1993). The lip model has been evaluated in terms of the intelligibility it adds to acoustically degraded natural speech, i) in isolation, ii) when superimposed to the Parke's model of the whole face. Finally, these results are compared to those obtained with the original face of the speaker. Such an evaluation of a parametric model will bring some light to the general discussion on ``What parameters for which facial model?''
This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance - without the use of syntactical, semantic, or any other contextual guide to the recognition process - indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized.
This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to- viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system achieved recognition rates of 2 percent, 12.7 percent, and 25.3 percent using, respectively, viseme HMMs, triseme HMMs, and generalized triseme HMMs.
The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.
Synthetic models of the head, face, facial features, and facial muscular action could make a valuable contribution to behavioral science research. The usefulness of such models in this area depends upon how well each model incorporates certain key parameters of the face as a signal system. This talk summarizes the important facts about facial signals. The focus of our research is on the expression and interpretation of signals about emotion. I present some specific examples of the kind of synthetic images that would aid our research. One of our projects is classifying facial muscular action with neural network tools. We are compiling a database of facial images to use in this project. I describe these images and suggest how they might be of interest to those working on the animation and modeling of the face and muscular action.
A position held by many is that face animation modeling methods should be motivated and guided by the potential applications. The variety of applications for face animation is potentially large but the number of basic, underlying approaches to produce face animation are probably few. Virtual reality is an example of a recent application area where face animation is becoming an important next step toward realism. It is fair to say that no virtual reality implementation has yet demonstrated facial expression animations that are indistinguishable from reality in the eyes of the human participant. Attempts at face animation in 2 and 2-1/2 D computer graphics and in virtual reality have been either model driven or have used a teleoperator master-slave approach where synthetic faces are manipulated more or less directly by a human (like a ``Wizard of Oz''), with simple exaggerations to provide emphasis and interest. Future telecommunication technology will surely require compression schemes based on facial models. One research area in communications which will be enabled by good facial models is the application of transforms of facial and gesture models to achieve appropriate communication between disparate cultures. Autonomous and intelligent agents that interact with facial expressions in virtual worlds is another research area that may eventually emerge. Other research areas that may leverage on facial model developments are human-computer interaction research, psychology, psychiatry, psychophysiology, cognitive science and cognitive neuroscience. These areas may have unique requirements for the definition and functionality of face models and model compression schemes.
At Drexel University we are motivated by basic neuroscience issues such as the relation of facial expression to neural activation in the brain and, in particular, the potential use of such models as intermediate steps in understanding human perceptual, cognitive and affective function and dysfunction. Our current interest is to determine the nature and type of facial modeling approaches that might be useful in such studies. Exploratory studies are in progress to identify and develop measurement techniques that are non-invasive and suitable to assess a range of interactive models including those based emerging from human-computer interaction research and physiological models emerging from electrical and magnetic field brain measurements. The latter approaches are closely related to PET and MRI imaging and brain function modeling and simulation. Potential interactive scenarios for research include individual humans working on computer-based tasks and computer-mediated interaction between humans remotely or in shared worlds. These need facial models.
In one effort, we are examining the use of structured light illumination of to capture facial expression using video methods and triangulation. Labeling points in the rectangular grid illumination requires excessively long searches to resolve correspondences. An illuminating array of 45 by 45 colored dots with unique nearest-neighbors may solve the labeling problem and may be suitable for real-time studies of muscle action groups or master-slave face animation. The technique uses three colors in the visible spectrum but could be implemented in non-visible regions of the spectrum and with various scanning techniques. Multiple illumination sources and cameras may be used.
The impact on face models and facial expression recognition is that range data can be used to interpret gray scale or color information from the face as well as in the identification of active muscle action groups. To be useful the raw data must be transformed into an normalized space to deal with sampling issues and individual differences. Important issues are the choice of the normalized space, coordinate representation and the nature and specificity of the transforms required for normalization. The normalized range data set could be included as preprocessed inputs to neural network models for classification and control schemes with or without the application of segmentation rules. Using range data alone as the basis for geometric or functional representation is insufficient in many cases since it cannot, in itself, always distinguish between different states of the facial skin and underlying muscle. Functional models that include muscle dynamics are required along with certain aspects of muscle action groups and perhaps of innervation pathways.
The task of identifying a minimal set of meaningful features for good perception of faces begs for methods and metrics to assess how humans actually perceive faces and what humans find meaningful (unless we limit ourselves merely to orthogonality and sensitivity properties). This problem is closely related to that addressed by neuroscience studies on human allocation of attention and resource utilization. A cross-disciplinary approach is required and interactive, real-time studies must be designed. Only by such an approach will we have reason for confidence in the criteria for an adequate notational system for computer graphics facial models. Only by such an approach will we be able to identify the vocabulary of signals (physical signals such as wrinkles, textures of the skin, elasticity of the skin, and so on, and expressive signals such as eye blinks, smile, frown and so on) that not only characterize faces and their motions but are also actually utilized by humans in their functioning perceptual models. Human expectations become important and tightly coupled, for example, when a human views a simulated image of his own face and head as in a mirror. In the extreme, this might suggest that aspects of human expectation and attention might need to be included in certain functional facial models. Alternatively, one might say that facial models (especially for pair-wise interactions) might need to be intelligent and able to be adaptive and sensitive to the course of the immediate dialog. The levels at which adaptive and intelligent functionality might occur are unknown. In any case, this line of argument leads to a mandate for the development and use of scientific discovery methods linked with real-time computational advances and modeling.
References
My interest in facial modelling and animation stems from the work I did as part of my doctorate. For the purposes of this work, faces were viewed as having two major functions; those of identification and communication. The communication aspect can be further broken down into verbal (speech) and non-verbal (expression) communication. Most of my work so far has been concentrated on the latter of these.
In order to investigate issues concerning the modelling and animation of faces, a system was implemented which integrates these two functions using a three-layer anatomical model comprising bone, muscle and surface features. The resulting system is called FACES, which is an acronym for the Facial Animation, Construction and Editing System (Patel and Willis 1991). Major issues addressed as part of the project concern the modelling of a variety of faces and their subsequent animation. The problem of providing adequate and effective control for the user (Parke 1991) has also been considered.
FACES is an interactive system which helps with both the generation and animation of faces, while hiding the structural complexities of the face from the user. The software consists of three sub-systems named: Construct, Modify and Animate. Construct and Modify cater for modelling functionality to enable creation of distinct faces. The Animate sub-system allows sequences which comprise facial movements to be generated. Further control is provided over facial colouration and motion evaluation. Several levels of control are available within both Construct and Modify. Essentially, the Construct sub-system deals with the bony structure of the head, while the Modify sub-system is concerned with skin, muscle and surface features. At a global level, changes can be made to overall proportions of the head and face. At a regional level, the head is considered in terms of three sections, so that modifications can be made to relative proportions. Local control facilitates amendments to individual bones, such as the zygomatic which is responsible for the prominence of the cheeks, as well as features such as the eyes, nose and lips.
The Animate sub-system caters for motion specification and control. At a basic level, facial movement is generated through simulation of muscular contraction (Waters 1987). However, since generation of facial movement through manipulation of individual muscles would be a cumbersome task, the user is provided with two higher levels of control. Facial actions may be specified using a `kit-of-parts' approach through selection from a repertoire of 31 Action Units which have been derived from the Facial Action Coding System (Ekman and Friesen 1978). These consist of actions such as `raise-inner-eyebrow', `jaw-drop' and `wrinkle-nose'. At an even higher level, predefined expressions such as happiness, sadness, disgust, anger, surprise and fear, may also be used. Such expressions provide control at an ``emotional'' level, while Action Units provide flexibility for animators to create their own effects.
The experience of implementing this system has high-lighted several outstanding problems, which it may be useful to ``air'' during the workshop. With regard to facial expression and communication there are issues concerning the modelling of facial deformation; levels of control over facial actions; timing and synchronization.
In the area of facial modelling and appearance there are also problems which need to be addressed. For example, facial deformation for modelling; types of faces and their classification and characteristics; the establishment of parameters for various parts of the face; higher levels of control such as age, race and gender characteristics; whether there are ``rules'' which indicate how a modification in one part of the face may affect another; appearance and recognition aspects also involve the modelling and rendering of hair, beards, moustache, spectacles etc.; and of course how control over all of these is to be provided for the user.
References
Current acoustic speech recognition technology performs well with very small vocabularies in noise or with large vocabularies in very low noise. Accurate acoustic speech recognition in noise with vocabularies over 100 words has yet to be achieved. Humans frequently lipread the visible facial speech articulations to enhance speech recognition, especially when the acoustic signal is degraded by noise or hearing impairment. Automatic lipreading has been found to improve significantly acoustic speech recognition and could be advantageous in noisy environments such as offices, aircraft and factories.
Several generations of automatic lipreading systems have been developed during the last decade. An overview of this work will be provided with a discussion of applications to face animation.
The conformed polygonal mesh forms the epidermal layer of a sophisticated physics-based model of facial tissue. An automatic algorithm constructs the multilayer synthetic skin and estimates an underlying skull substructure with a jointed jaw. Finally, the algorithm inserts synthetic muscles into the deepest layer of the facial tissue. These contractile actuators, which emulate the primary muscles of facial expression, generate forces that deform the synthetic tissue into meaningful expressions. To increase realism, we include constraints to emulate tissue incompressibility and to enable the tissue to slide over the skull without penetrating into it. The resulting animate models appear significantly more realistic than our earlier physics-based facial models.
We have developed a new approach to the analysis of dynamic facial images for the purposes of estimating and resynthesizing dynamic facial expressions. Motivated by the anatomically consistent musculature in our model, we consider the estimation of dynamic facial muscle contractions from video sequences of expressive faces. We develop an analysis technique that uses deformable contour models (snakes) to track the nonrigid motions of facial features in video. The technique estimates and encodes muscle actuator controls with sufficient accuracy to permit the face model to resynthesize transient expressions. Our model-based analysis/synthesis approach is potentially useful for performance driven animation and low bandwidth telecommunication.
In the field of computer generated characters, facial animation is difficult because of the complexity of the surface and the fact that humans have entire sections of their brains devoted to facial processing. Typically, the construction of new characters involves tedious and repetitive labor. An unwieldy level of detail is often needed for a realistic model. Apart from any geometric consideration, animating a new character often requires recoding or changing and retuning a large number of arcane parameters.
Langwidere integrates a hierarchical spline modeling system with simulated muscles based on local area surface deformation. The multi-level shape representation allows control over the extent of deformations, at the same time reducing the number of control vertices needed to define the surface. The head model is constructed from a closed surface (the surface also includes a rudimentary body) allowing the modeling of internal structures such as tongue and teeth, unlike some models that are just masks. Simulated muscles are attached to various levels of the surface with more rudimentary levels substituting for bone such as the skull and jaw.
The combination of a hierarchical model and simulated muscles provides precise, flexible surface control and supports easy creation of new characters without reprogramming.