Lipreading across multiple views

Lucey, Patrick Joseph (2007) Lipreading across multiple views. PhD thesis, Queensland University of Technology.


Visual information from a speaker's mouth region is known to improve automatic speech recognition (ASR) robustness, especially in the presence of acoustic noise. Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal images of the speaker's face, which is a rather restrictive human-computer interaction (HCI) scenario. The lack of research into AVASR across multiple views has been dictated by the lack of large corpora that contains varying pose/viewpoint speech data. Recently, research has concentrated on recognising human be- haviours within &quotmeeting " or &quotlecture " type scenarios via &quotsmart-rooms ". This has resulted in the collection of audio-visual speech data which allows for the recognition of visual speech from both frontal and non-frontal views to occur. Using this data, the main focus of this thesis was to investigate and develop various methods within the confines of a lipreading system which can recognise visual speech across multiple views. This reseach constitutes the first published work within the field which looks at this particular aspect of AVASR.

The task of recognising visual speech from non-frontal views (i.e. profile) is in principle very similar to that of frontal views, requiring the lipreading system to initially locate and track the mouth region and subsequently extract visual features. However, this task is far more complicated than the frontal case, because the facial features required to locate and track the mouth lie in a much more limited spatial plane. Nevertheless, accurate mouth region tracking can be achieved by employing techniques similar to frontal facial feature localisation. Once the mouth region has been extracted, the same visual feature extraction process can take place to the frontal view. A novel contribution of this thesis, is to quantify the degradation in lipreading performance between the frontal and profile views. In addition to this, novel patch-based analysis of the various views is conducted, and as a result a novel multi-stream patch-based representation is formulated. Having a lipreading system which can recognise visual speech from both frontal and profile views is a novel contribution to the field of AVASR. How- ever, given both the frontal and profile viewpoints, this begs the question, is there any benefit of having the additional viewpoint? Another major contribution of this thesis, is an exploration of a novel multi-view lipreading system. This system shows that there does exist complimentary information in the additional viewpoint (possibly that of lip protrusion), with superior performance achieved in the multi-view system compared to the frontal-only system.

Even though having a multi-view lipreading system which can recognise visual speech from both front and profile views is very beneficial, it can hardly considered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e. front or profile). In an effort to make the lipreading system more realistic, a unified system based on a single camera was developed which enables a lipreading system to recognise visual speech from both frontal and profile poses. This is called pose-invariant lipreading. Pose-invariant lipreading can be performed on either stationary or continuous tasks. Methods which effectively normalise the various poses into a single pose were investigated for the stationary scenario and in another contribution of this thesis, an algorithm based on regularised linear regression was employed to project all the visual speech features into a uniform pose. This particular method is shown to be beneficial when the lipreading system was biased towards the dominant pose (i.e. frontal). The final contribution of this thesis is the formulation of a continuous pose-invariant lipreading system which contains a pose-estimator at the start of the visual front-end. This system highlights the complexity of developing such a system, as introducing more flexibility within the lipreading system invariability means the introduction of more error.

All the works contained in this thesis present novel and innovative contributions to the field of AVASR, and hopefully this will aid in the future deployment of an AVASR system in realistic scenarios.

Impact and interest:

Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

752 since deposited on 03 Dec 2008
26 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 16676
Item Type: QUT Thesis (PhD)
Supervisor: Sridharan, Subramanian & Chandran, Vinod
Keywords: audio-visual automatic speech recognition, lipreading, frontal pose, profile pose, multi-view, visual front-end, visual feature extraction, pose-invariance, multi-stream fusion
Divisions: Past > QUT Faculties & Divisions > Faculty of Built Environment and Engineering
Past > Schools > School of Engineering Systems
Department: Faculty of Built Environment and Engineering
Institution: Queensland University of Technology
Copyright Owner: Copyright Patrick Joseph Lucey
Deposited On: 03 Dec 2008 04:08
Last Modified: 28 Oct 2011 19:50

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page