Weighting and normalisation of synchronous HMMs for audio-visual speech recognition
Dean, David B., Lucey, Patrick J., Sridharan, Sridha, & Wark, Timothy J. (2007) Weighting and normalisation of synchronous HMMs for audio-visual speech recognition. In International Conference on Auditory-Visual Speech Processing 2007 (AVSP2007), August 31 - September 3, 2007, Kasteel Groenendaal, Hilvarenbeek, The Netherlands.
In this paper, we examine the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition. Rather than considering the stream weights to be the same for training and testing, we examine the effect of different stream weights for each task on the final speech-recognition performance. Evaluating our system under varying levels of audio and video degradation on the XM2VTS database, we show that the final performance is primarily a function of the choice of stream weight used in testing, and that the choice of stream weight used for training has a very minor corresponding effect. By varying the value of the testing stream weights we show that the best average speech recognition performance occurs with the streams weighted at around 80% audio and 20% video. However, by examining the distribution of frame-by-frame scores for each stream on a left-out section of the database, we show that these testing weights chosen primarily serve to normalise the two stream score distributions, rather than indicating the dependence of the final performance on either stream. By using a novel adaption of zero-normalisation to normalise each stream's models before performing the weighted-fusion, we show that the actual contribution of the audio and video scores to the best performing speech system is closer to equal that appears to be indicated by the un-normalised stream weighting parameters alone.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.
|Item Type:||Conference Paper|
|Keywords:||audio, visual speech recognition, multi, stream hidden Markov models, normalisation|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Computer Vision (080104)
Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Natural Language Processing (080107)
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Built Environment and Engineering
Past > Institutes > Information Security Institute
|Copyright Owner:||Copyright 2007 (please consult author)|
|Deposited On:||22 Apr 2008 00:00|
|Last Modified:||22 Feb 2013 06:43|
Repository Staff Only: item control page