MG SRC="bregler.jpg"> Bregler
In this paper we show how recognition performance in automated speech perception can be significantly improved by additional Lipreading, so called "speech-reading". We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. The acoustic and visual speech data is preclassified in two separate front-end phoneme TDNNs and combined to acoustic-visual hypotheses for the Dynamic Time Warping algorithm. This is shown on a connected word recognition problem, the notoriously difficult letter spelling task. With speech-reading we could reduce the error rate up to half of the error rate of the pure acoustic recognition.
Most connectionist research has focused on learning mappings from one space to another (eg. classification and regression). This paper introduces the more general task of learning constraint surfaces. It describes a simple but powerful architecture for learning and manipulating nonlinear surfaces from data. We demonstrate the technique on low dimensional synthetic surfaces and compare it to nearest neighbor approaches. We then show its utility in learning the space of lip images in a system for improving speech recognition by lip reading. This learned surface is used to improve the visual tracking performance during recognition.
In this study we improve the performance of a hybrid connectionist speech recognition system by incorporating visual information about the corresponding lip movements. Specifically, we investigate the benefits of adding visual features in the presence of additive noise and crosstalk (cocktail party effect). Our study extends previous experiments by using a new visual front end, and an alternative architecture for combining the visual and acoustic information. Furthermore, we have extended our recognizer to a multi-speaker, connected letters recognizer. Our results show a significant improvement for the combined architecture (acoustic and visual information) over just the acoustic system in the presence of additive noise and crosstalk.