NOTE: The following materials are presented for timely
dissemination of academic and technical work. Copyright and all other rights
therein are reserved by authors and/or other copyright holders. Persoanl
use of the following materials is permitted and, however, people using
the materials or information are expected to adhere to the terms and
constraints invoked by the related copyright.
Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
ABSTRACT
Speech conveys different yet mixed information ranging from linguistic to speaker-specific components,
and each of them should be exclusively used in a specific task. However, it is extremely difficult to
extract a specific information component given the fact that nearly all existing acoustic representations
carry all types of speech information. Thus, the use of the same representation in both speech and speaker
recognition hinders a system from producing better performance due to interference of irrelevant information.
In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs.
As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and
regularization via normalizing interference of non-speaker related information and avoiding information loss.
With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific
representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs
and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and
relate our approach to previous work.
Click
nips2011.pdf for full text and
Appendix for the supplementary material.