BMVA 
The British Machine Vision Association and Society for Pattern Recognition 

BibTeX entry

@PHDTHESIS{201603Helen_L._Bear,
  AUTHOR={Helen L. Bear},
  TITLE={Decoding visemes: improving machine lip-reading},
  SCHOOL={University of East Anglia},
  MONTH=Mar,
  YEAR=2016,
  URL={http://www.bmva.org/theses/2016/2016-bear.pdf},
}

Abstract

This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision.

Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution affects. We show that high definition video is not needed to successfully lip-read with a computer.

The term “viseme” is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally defined, we use the common working definition “a viseme is a group of phonemes with identical appearance on the lips”. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classification. Our results show Lee’s [82] is best. Lee’s classification also outperforms machine lip-reading systems which use the popular Fisher phoneme-to-viseme map.

Further to this, we propose three methods of deriving speaker-dependent phoneme-to-viseme maps and compare our new approaches to Lee’s. Our results show the sensitivity of phoneme clustering and we use our new knowledge for our first suggested augmentation to the conventional lip-reading system.

Speaker independence in machine lip-reading classication is another unsolved obstacle. It has been observed, in the visual domain, that classifiers need training on the test subject to achieve the best classication. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classication of a speaker not present in the classifier’s training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual.

Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last difficulty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classier, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classifiers. This new method of classifier training demonstrates significant increase in classification with a word language network.