Goldschen
This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance - without the use of syntactical, semantic, or any other contextual guide to the recognition process - indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized.
This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to-viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system recognition rates of 2 percent, 12.7 percent, and 25.3 using, respectively, viseme HMMs, triseme HMMs, and triseme HMMs.
The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.
There is evidence that information from the oral cavity region of a speaker's face can enhance the robustness of classical acoustic automatic speech recognition systems. We describe experimental data and research to determine the less correlated, but discriminating, features of the oral cavity region of a speaker for optical automatic speech recognition. We reduced our feature space from 35 to 13 features using a correlation matrix, principal component analysis, and heuristics. We include a description of the database and describe previous research that helped us to determine our initial features. This investigation demonstrates the importance of the dynamic aspects of the optical perception of certain speech facial articulation features for speech recognition by humans and machines. These results should be of significant value for the design of more robust speech recognizers that utilize both optical and acoustic information, and for the teaching of lipreading to the hearing impaired.
Index Terms -- feature extraction, feature analysis, lipreading, facial expression, speech recognition, optical speech recognition, multimodal speech recognition.