Project

Voice Privacy Preservation

Wonjune Kang

Spoken language is an information-rich medium that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, voice is a distinct biomarker, and there exist many settings in which a speaker’s voice may need to be anonymized in order to protect their identity. These include audio recordings where people are sharing sensitive information, media interviews, or more general situations in which a speaker would like to protect personal information such as geographical background or ethnicity.

Some modern methods for speaker anonymization view this task as a voice conversion (VC) problem—that of converting the vocal identity of an utterance to sound like another person without changing the linguistic content that is conveyed. Anonymization can be achieved by using VC methods to change the identity of a speaker to that of a different individual. However, the naturalness of changed voices even for state-of-the-art VC models still l… View full description

Spoken language is an information-rich medium that combines words with various information about emotions, feelings, and excitation through modulations in tone and pitch. In discourse, this allows for maintaining a human element that is lacking in many other channels, such as writing or social media. However, voice is a distinct biomarker, and there exist many settings in which a speaker’s voice may need to be anonymized in order to protect their identity. These include audio recordings where people are sharing sensitive information, media interviews, or more general situations in which a speaker would like to protect personal information such as geographical background or ethnicity.

Some modern methods for speaker anonymization view this task as a voice conversion (VC) problem—that of converting the vocal identity of an utterance to sound like another person without changing the linguistic content that is conveyed. Anonymization can be achieved by using VC methods to change the identity of a speaker to that of a different individual. However, the naturalness of changed voices even for state-of-the-art VC models still lags behind that of true speech, and there is often a significant degradation in audio quality.

In this project, we aim to develop a novel approach for speaker anonymization based on voice conversion methods. Unlike most recent VC methods, our model can be trained in an end-to-end manner and produces audio directly. This is beneficial for maintaining high audio quality and usability for anonymizing speech recorded under diverse background noise conditions.