Summerfield
The strategy for measuring speech-reception thresholds for sentences in noise advocated by Plomp and Mimpen (Audiology, 18, 43-52, 1979) was modified to create a reliable test for measuring the difficulty which listeners have in speech reception, both auditorily and audio-visually. The test materials consist of 10 lists of 15 short sentences of homogeneous intelligibility when presented acoustically, and of different, but still homogeneous, intelligibility when presented audio-visually, in white noise. Homogeneity was achieved by applying phonetic and linguistic principles at the stage of compilation, followed by pilot testing and balancing of properties. To run the test, lists are presented at signal-to-noise ratios (SNRs) determined by an up-down psychophysical rule so as to estimate auditory and audio-visual speech-reception thresholds, defined as the SNRs at which the three content words in each sentence are identified correctly on 50% of trials. These thresholds provide measures of a subject's speech-reception abilities. The difference between them provides a measure of the benefit received from vision. It is shown that this measure is closely related to the accuracy with which subjects lip-read words in sentences with no acoustical information. In data from normally hearing adults, the standard deviations (s.d.s) of estimates of auditory speech reception threshold in noise (SRTN), audio-visual SRTN, and visual benefit are 1.2, 2.0, and 2.3 dB, respectively. Graphs are provided with which to estimate the trade-off between reliability and the number of lists presented, and to assess the significance of deviant scores from individual subjects.
Two signal-processing algorithms, derived from those described by Stubbs and Summerfield [R.J. Stubbs and A.Q. Summerfield, J. Acoust. Soc. Am. 84, 1236-1249 (1988)], were used to separate the voiced speech of two talkers speaking simultaneously, at similar intensities, in a single channel. Both algorithms use fundamental frequency (FO) as the basis for segregation. One attenuates the interfering voice by filtering the cepstrum of the signal. The other is a hybrid algorithm that combines cepstral filtering with the technique of harmonic selection [T.W. Parsons, J. Acoust. Soc. Am. 60, 911-918 (1976)]. The algorithms were evaluated and compared in perceptual experiments involving listeners with normal hearing and listeners with cochlear hearing impairments. In experiment 1 the processing was used to separate voiced sentences spoken on a monotone. Both algorithms gave significant increases in intelligibility to both groups of listeners. The improvements were equivalent to an increase of 3-4 dB in the effective signal-to-noise ratio (SNR). In experiment 2 the processing was used to separate voiced sentences spoken with time-varying intonation. For normal-hearing listeners, cepstral filtering gave a significant increase in intelligibility, while the hybrid algorithm gave an increase that was on the margins of significance (p = 0.06). The improvements were equivalent to an increase of 2-3 dB in the effective SNR. For impaired listeners, no intelligibility improvements were demonstrated with intoned sentences. The decrease in performance for intoned material is attributed to limitations of the algorithms when FO is nonstationary.
Procedures for enhancing the intelligibility of a target talker in the presence of a co-channel competing talker were evaluated in tests involving (i) continuously voiced sentences spoken on a monotone, (ii) continuously voiced sentences with time-varying intonation, and (iii) noncontinuously voiced sentences produced with natural intonation. The procedures were based on the methods of harmonic selection and cepstral filtering [R.J. Stubbs and A.Q. Summerfield, J. Acoust. Soc. Am. 87, 359-372 (1990)]. Target and competing voices were combined at signal-to-noise ratios (SNRs) between -10 dB and +10 dB. Subjects were a group with normal hearing and a heterogeneous group with mild-moderate cochlear hearing impairments. Processing enhanced the target voice over a range of SNRs for each type of sentence and for most listeners. Enhancement was greatest at negative SNRs. Among the impaired listeners, benefit was generally greater for those with milder losses. These results consolidate and extend previous demonstrations that voice-separation algorithms that exploit the harmonic structure of the voiced portions of speech can enhance intelligibility. However, practical application of such algorithms depends on a solution to the problem of tracking the fundamental-frequency contour of one voice in the presence of a competing voice.
This paper reviews progress in understanding the psychology of lipreading and audio-visual speech perception. It considers four questions. What distinguishes better from poorer lipreaders? What are the effects of introducing a delay between the acoustical and optical speech signals? What have attempts to produce computer animations of talking faces contributed to our understanding of the visual cues that distinguish consonants and vowels? Finally, how should the process of audio-visual integration in speech perception be described; that is, how are the sights and sounds of talking faces represented at their conflux?
Two signal-processing procedures for separating the continuously-voiced speech of competing talkers are described and evaluated. With competing sentences, each spoken on a monotone, the procedures improved the intelligibility of the target talker both for listeners with normal hearing and for listeners with moderate-to-severe hearing losses of cochlear origin. However, with intoned sentences, benefits were smaller for normal-hearing listeners and were inconsistent for impaired listeners. It is argued that smaller benefits arise with intoned sentences because harmonics of the two voices are blurred together during spectral analysis, limiting the extent to which spectral contrast can be recovered in the processed signal. This is particularly disadvantageous to impaired listeners who have reduced spectro-temporal resolution. This paper discusses other substantial problems to be overcome before the feasibility of the procedures as components of a speech-enhancement system for hearing-impaired listeners could be demonstrated.