Data Warehouse for Speech Perception and Model Testing

Massaro et al., 1993

Cross-linguistic influences

Paper:

Massaro, D.W., Cohen, M.M., Gesi, A., Heredia, R. & Tsuzaki, M. (1993). Bimodal Speech Perception: An Examination across Languages. Journal of Phonetics, 21, 445-478. PDF

Description:

This archived research investigates the contribution of visible information in face-to-face communication and how it is combined with auditory information in bimodal speech perception. The experimental research methodology utilized a strategy of hypothesis testing, independent manipulation of multiple sources of information, and the testing of mathematical models against the results of individual participants. Synthetic speech allowed the auditory and visual signals to be manipulated directly, an experimental feature central to the study of psychophysics and perception. In addition, an expanded factorial design was used to study how auditory speech and visual speech are processed alone and in combination, and under different degrees of ambiguity. This design also provides a powerful test of quantitative models of perceptual recognition (Massaro, 1998). The paradigm allows a direct assessment of how several sources of information are used in pattern recognition. Experiments of this type have clarified the classic McGurk effect, assessed the contribution of segment frequency in the language, the psychophysical properties of the auditory and visual speech, and the relative influence of written text versus visible speech.

An informative manipulation in speech perception research is to systematically vary the ambiguity of each of the source of information in terms of how much it resembles each syllable. Synthetic speech (or at least a systematic modification of natural speech) is necessary to implement this manipulation. In several experiments on bimodal speech perception (Massaro et al., 1993; 1995), we used synthetic speech to cross five levels of audible speech varying between /ba/ and /da/ with five levels of visible speech varying between the same alternatives. We also included the unimodal test stimuli to implement the expanded factorial design, as shown in Figure 1.

Figure 1. Expansion of a typical factorial design to include auditory and visual conditions presented alone. The five levels along the auditory and visible continua represent auditory and visible speech syllables varying in equal physical steps between /ba/ and /da/.

Tokens of the first author's /ba/ and /da/ were analyzed by using linear prediction to derive a set of parameters for driving a software formant serial resonator speech synthesizer (Klatt, 1980). By altering the parametric information specifying the first 80 msec of the consonant-vowel syllable, a set of five 400 ms syllables covering the range from /ba/ to /da/ was created. The center and lower panels of Figure 2 show how some of the acoustic synthesis parameters changed over time for the most /ba/-like and /da/-like of the five auditory syllables. During the first 80 msec, the Fl went from 250 to 700 Hz following a negatively accelerated path. The F2 followed a negatively accelerated path to 1199 Hz, beginning with one of five values equally spaced between 1187 and 1437 Hz from most /ba/-like to most /da/-like, respectively. The F3 followed a linear transition to 2729 Hz from one of five values equally spaced between 2387 and 2637 Hz. All other stimulus characteristics were identical for the five auditory syllables.

A parametrically controlled polygon topology was used to generate a fairly realistic animation facial display (Cohen & Massaro, 1990). The animation display was created by modeling the facial surface as a polyhedral object composed of about 900 small surfaces arranged in 3D, joined together at the edges. The face was animated by altering the location of various points in the grid under the control of 50 parameters, 11 of which were used for speech animation. The parameters that are used are jaw rotation, mouth x scale, mouth z offset, lip corner x width, mouth corner z offset, mouth corner x offset, mouth corner y offset, lower lip "f" tuck, upper lip raise, and x and z teeth offset. The top panel of Figure 2 shows how two of the visible synthesis parameters changed over time for the most /ba/-like and /da/-like of the five visible syllables. The animation was implemented on a Silicon Graphics Iris 3030 computer. To create an animation sequence, each frame was recorded with a broadcast quality Betacam video recorder under control of the Iris.

The properties of the auditory stimulus were varied to give an auditory continuum between the syllables /ba/ and /da/. In analogous fashion, properties of our animated face were varied to give a continuum between visual /ba/ and /da/. Five levels of audible speech varying between /ba/ and /da/ were crossed with five levels of visible speech varying between the same alternatives. In addition, the audible and visible speech also were presented alone. This gave a total of 25 + 5 + 5 = 35 independent stimulus conditions. Six random sequences were determined by sampling the 35 conditions without replacement giving six different blocks of 35 trials. An experimental session consisted of these 6 blocks preceded by 6 practice trials and with a short break between sessions. There were 4 sessions of testing for a total of 840 test trials (35 x 6 x 4). Thus there were 24 observations at each of the 35 unique experimental conditions. Participants were instructed to listen and to watch the speaker, and to identify the syllable as /ba/ or /da/. This experimental design was used with 24 English and 20 Spanish speaking participants and their results have served as a database for testing models of pattern recognition (Massaro, 1998). The trial-by-trial data from 3 of the English speaking participants were omitted from the analyses and modeling.

Average results across individuals can distort the underlying pattern given by each individual (Massaro & Cohen, 1993; Massaro, 1998). Thus, our database archives individual participant results. Trial-by-trial results are given for each individual in the experiment. In our analyses, we computed the mean observed proportion of /da/ identifications for each of the 35 unimodal and bimodal conditions for each of the 41 participants.

Subjects:
62 Ss, 3 groups English, Spanish, Japanese, 2 responses, 35 conditions

Design:

Expanded factorial 5 visual * 5 auditory + 5 visual + auditory design Binary choice ba or da response.

See sample analysis program df8a2.f which produced 6 anova input files from the raw data files in this archive. The initial part of a typical data file is appended below.

Within a data file (from one experimental session): Header gives subject #s, up to 4 Ss run at the same time. Each S participated in 4 sessions, labeled RUN #1-4 Each line is one trial, and contains the response and latency data.

The first 8 columns of each line are set aside for the independent variables, and this experiment use two. The first column gives the visual level and the second gives the auditory level gong from 1 for the most ba-like to 5 for the most da-like. For unimodal trials either the visual or auditory level is set to 0. The responses are 1 for ba and 2 for da and latencies are given in msec. There were 6 blocks of 35=210 trials/session, and each subject participated in 4 sessions.

Initial part of a typical data file df8001.dat which contains first of 4 sessions for subjects 1-3:


 @DATA FILE : DF8001  # RT11A 
 @DF8 : b.6 2resp                         
 @EXP11    14-FEB-91  00:14:08
 @GROUP #  1  RUN #  1  SUBJECTS    1   2   3   0 SEEDS  11111 15151
   3  3  0  0  0  0  0  0   2   2   2   0   945  1335  1069     0
   1  5  0  0  0  0  0  0   2   2   2   0  1353  1107  1709     0
   2  4  0  0  0  0  0  0   2   2   2   0  1573  1142  1410     0
   1  3  0  0  0  0  0  0   1   2   1   0   958  1023   894     0

 -Independent Variables--   ---Response--  -----latency----------
 Vis Aud                   S1  S2  S3  S4    S1    S2    S3    S4

Output files from program df8a2.f:

aurs unimodal auditory responses
virs unimodal visual responses
birs bimodal responses

Anova outputs (using Pearlman's anova):

aurso unimodal auditory responses
virso unimodal visual responses
birso bimodal responses

Data Files:
Click to download

Subject trial by trial data:
English subjects: mass93a_eng.zip
Spanish subjects: mass93a_spa.zip
ZIP files include f77 analysis program, anova input and output files.

Subject mean data:

Stimuli:
Click to download
Quicktime: 01_4.mov

Model Fitting:
Model fitting is done with STEPIT (Chandler, 1969). Click on mass93amods.zip for a ZIP file containing: Analysis software for converting the trial by trial data to subject means (separately for each language group: df8a3e.f, df8a3s.f), the resulting data files (df8emd.dat, df8smd.dat), fuzzy logical model programs (mass93aem1.f, mass93asm1.f), weighted averaging model programs (mass93aem2.f, mass93asm2.f), the *.fit (observed and predicted) data files, and the *.lst files giving the best fitting parameters, RMSD errors, and statistics (correlation and t test) for the fit of each individual subject. In addition to the fits for each subject, an N+1st mean subject is fit, followed by the mean of the N subject fits.