Incorporating visual information for spoken term detection

Kalantari, Shahram, Dean, David, & Sridharan, Sridha (2015) Incorporating visual information for spoken term detection. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Interspeech 2015, International Speech Communication Association, Maritim International Congress Center, Dresden, Germany, pp. 558-562.

View at publisher


Spoken term detection (STD) is the task of looking up a spoken term in a large volume of speech segments. In order to provide fast search, speech segments are first indexed into an intermediate representation using speech recognition engines which provide multiple hypotheses for each speech segment. Approximate matching techniques are usually applied at the search stage to compensate the poor performance of automatic speech recognition engines during indexing. Recently, using visual information in addition to audio information has been shown to improve phone recognition performance, particularly in noisy environments. In this paper, we will make use of visual information in the form of lip movements of the speaker in indexing stage and will investigate its effect on STD performance. Particularly, we will investigate if gains in phone recognition accuracy will carry through the approximate matching stage to provide similar gains in the final audio-visual STD system over a traditional audio only approach. We will also investigate the effect of using visual information on STD performance in different noise environments.

Impact and interest:

2 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

20 since deposited on 27 Jul 2015
6 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 86034
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: Spoken term detection, keyword spotting, audio visual phone recognition, DMLS system
ISSN: 1990-9770
Divisions: Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2015 [Please consult the author]
Deposited On: 27 Jul 2015 22:36
Last Modified: 24 Sep 2015 14:02

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page