Speaker verification incorporating high-level linguistic features
Baker, Brendan J. (2008) Speaker verification incorporating high-level linguistic features. PhD thesis, Queensland University of Technology.
Brendan Baker Thesis
Administrators only | Request a copy from author
Speaker verification is the process of verifying or disputing the claimed identity of a speaker based on a recorded sample of their speech. Automatic speaker verification technology can be applied to a variety of person authentication and identification applications including forensics, surveillance, national security measures for combating terrorism, credit card and transaction verification, automation and indexing of speakers in audio data, voice based signatures, and over-the-phone security access. The ubiquitous nature of modern telephony systems allows for the easy acquisition and delivery of speech signals for processing by an automated speaker recognition system. Traditionally, approaches to automatic speaker verification have involved holistic modelling of low-level acoustic-based features in order to characterise physiological aspects of a speaker such as the length and shape of the vocal tract. Although the use of these low-level features has proved highly successful, there are numerous other sources of speaker specific information in the speech signal that have largely been ignored. In spontaneous and conversational speech, perceptually higher levels of in- formation such as the linguistic content, pronunciation idiosyncrasies, idiolectal word usage, speaking rates and prosody, can also provide useful cues as to identify of a speaker. The main aim of this work is to explore the incorporation of higher levels of information into the verification process. Specifically, linguistic constructs such as words, syllables and phones are examined for their usefulness as features for text-independent speaker verification. Two main approaches to incorporating these linguistic features are explored. Firstly, the direct modelling of linguistic feature sequences is examined. Stochastic language models are used to model word and phonetic sequences obtained from automatically obtained transcripts. Experimentation indicates that significant speaker characterising information is indeed contained in both word and phone-level transcripts. It is shown, however, that model estimation issues arise when limited speech is available for training. This speaker model estimation problem is addressed by employing an adaptive model training strategy that significantly improves the performance and extended the usefulness of both lexical and phonetic techniques to short training length situations. An alternate approach to incorporating linguistic information is also examined. Rather than modelling the high-level features independently of acoustic information, linguistic information is instead used to constrain and aid acoustic- based speaker verification techniques. It is hypothesised that a ext-constrained" approach provides direct benefits by facilitating more detailed modelling, as well as providing useful insight into which articulatory events provide the most useful speaker-characterising information. A novel framework for text-constrained speaker verification is developed. This technique is presented as a generalised framework capable of using di®erent feature sets and modelling paradigms, and is based upon the use of a newly defined pseudo-syllabic segmentation unit. A detailed exploration of the speaker characterising power of both broad phonetic and syllabic events is performed and used to optimise the system configuration. An evaluation of the proposed text- constrained framework using cepstral features demonstrates the benefits of such an approach over holistic approaches, particularly in extended training length scenarios. Finally, a complete evaluation of the developed techniques on the NIST2005 speaker recognition evaluation database is presented. The benefit of including high-level linguistic information is demonstrated when a fusion of both high- and low-level techniques is performed.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
|Item Type:||QUT Thesis (PhD)|
|Supervisor:||Sridharan, Subramanian, Mason, Michael, & Vogt, Robert|
|Keywords:||speaker recognition, speaker verification, high-level features, idiolect, phonetic speaker recognition, session variability, text-constrained speaker recognition|
|Divisions:||Past > QUT Faculties & Divisions > Faculty of Built Environment and Engineering
Past > Schools > School of Engineering Systems
|Institution:||Queensland University of Technology|
|Deposited On:||10 Feb 2009 03:19|
|Last Modified:||28 Oct 2011 19:51|
Repository Staff Only: item control page