From: malcolm@interval.com (Malcolm Slaney)
Date: Mon, 20 Mar 1995 11:28:56 -0800
Subject: Speech and Hands at CCRMA
Message-Id: <v02110100ab93720d0cce@[192.203.7.70]>


Speech continues to be in the forefront of the Hearing Seminar.  This week
Sid Fels will be presenting his work on GloveTalk, a system that translates
hand gestures into speech.

Sid gave one of the most interesting and entertaining talks at NIPS (Neural
Information Processing Society) meeting last fall.  This talk describes his
PhD thesis work with Geoffrey Hinton.  He uses neural nets to map hand
gestures into speech, and also neural nets were used in the original
training (mapping what was heard into hand gestures.)  Hopefully we'll hear
about both this week at CCRMA.

This week we'll be meeting in the Ballroom (the AV facilities are better.)

        Who:    Sid Fels (Virtex and formerly U. of Toronto)
        What:   GloveTalk (Neural Nets, Gesture Recognition, Speech Synthesis)
        When:   Thursday March 23 at 11AM
        Where:  CCRMA Ballroom (Main floor of the Knoll at Stanford.

See you at CCRMA.

-- Malcolm
P.S.  Seeing GloveTalk sing is pretty wonderful.  Sheila (Perry Cook's
system) doesn't have to worry about the aural competition but GloveTalk is
a wonderful sight.




Glove-TalkII is a system which translates hand gestures to speech through an
adaptive interface.  Hand gestures are mapped continuously to 10 control
parameters of a parallel formant speech synthesizer. The mapping allows the
hand to act as an artificial vocal tract that produces speech in real time.
This gives an unlimited vocabulary in addition to direct control of
fundamental
frequency and volume. Currently, the best version of Glove-TalkII uses several
input devices (including a Cyberglove, a ContactGlove, a 3-space tracker, and a
foot-pedal), a parallel formant speech synthesizer and 3 neural networks.  The
gesture-to-speech task is divided into vowel and consonant production by using
a gating network to weight the outputs of a vowel and a consonant neural
network. The gating network and the consonant network are trained with
examples from the user. The vowel network implements a fixed, user-defined
relationship between hand-position and vowel sound and does not require any
training examples from the user. Volume, fundamental frequency and stop
consonants are produced with a fixed mapping from the input devices. One
subject ( an accomplished pianist) has trained to speak intelligibly with Glove-
TalkII.  He speaks slowly but with far more natural-sounding pitch
variations than a text-to-speech synthesizer.

In my talk I will show the subject speak with Glove-TalkII.  If time
permits, I'll also  discuss some of the process involved in learning to
speak with Glove-TalkII.   This should fit in with Ben Golds talk from
last week.