Patch-Based Analysis of Visual Speech From Multiple Views
Lucey, Patrick J., Potamianos, Gerasimons, & Sridharan, Sridha (2008) Patch-Based Analysis of Visual Speech From Multiple Views. In Goecke, Roland, Lucey, Patrick J., & Lucey, Simon (Eds.) International Conference on Auditory-Visual Speech Processing, 26-29 September, Tangalooma, Australia.
Obtaining a robust feature representation of visual speech is of crucial importance in the design of audio-visual automatic speech recognition systems. In the literature, when visual appearance based features are employed for this purpose, they are typically extracted using a "holistic" approach. Namely, a transformation of the pixel values of the entire region-of-interest (ROI) is obtained, with the ROI covering the speaker's mouth and often surrounding facial area. In this paper, we instead consider a "patch" based visual feature extraction approach, within the appearance based framework. In particular, we conduct a novel analysis to determine which areas (patches) of the mouth ROI are the most informative for visual speech. Furthermore, we extend this analysis beyond the traditional frontal views, by investigating profile views as well. Not surprisingly, and for both frontal and profile views, we conclude that the central mouth patches are the most informative, but less so than the holistic features of the entire ROI. Nevertheless, fusion of holistic and the best patch based features further improves visual speech recognition performance, compared to either feature set alone. Finally, we discuss scenarios where the patch based approach may be preferable to holistic features.
Impact and interest:
Citation countsare sourced monthly fromand citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
Full-text downloadsdisplays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.
|Item Type:||Conference Paper|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Image Processing (080106)|
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Natural Language Processing (080107)
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Built Environment and Engineering|
Past > Institutes > Information Security Institute
|Copyright Owner:||Copyright 2008 AVISA (the Auditory-VIsual Speech Association)|
|Deposited On:||20 Oct 2008|
|Last Modified:||29 Feb 2012 23:46|
Repository Staff Only: item control page