Speech enhancement and recognition in meetings with an audio–visual sensor array

Maganti, Hari Krishna, Gatica-Perez, Daniel, & McCowan, Iain A. (2007) Speech enhancement and recognition in meetings with an audio–visual sensor array. IEEE Transactions on Audio, Speech and Language Processing, 15(8), 2257 -2269.

View at publisher


This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.

Impact and interest:

33 citations in Scopus
24 citations in Web of Science®
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

Full-text downloads:

428 since deposited on 23 Jul 2008
24 in the past twelve months

Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.

ID Code: 14132
Item Type: Journal Article
Refereed: Yes
Additional URLs:
Keywords: array signal processing, audio signal processing, audio, visual systems, filtering theory, microphone arrays, speaker recognition, speech enhancement, tracking filters
DOI: 10.1109/TASL.2007.906197
ISSN: 1558-7916
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100) > Natural Language Processing (080107)
Australian and New Zealand Standard Research Classification > ENGINEERING (090000) > ELECTRICAL AND ELECTRONIC ENGINEERING (090600) > Signal Processing (090609)
Australian and New Zealand Standard Research Classification > LANGUAGES COMMUNICATION AND CULTURE (200000) > LINGUISTICS (200400)
Divisions: Past > QUT Faculties & Divisions > Faculty of Built Environment and Engineering
Past > Institutes > Information Security Institute
Copyright Owner: Copyright 2007 IEEE
Copyright Statement: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Deposited On: 23 Jul 2008 00:00
Last Modified: 25 Jun 2017 14:39

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page