Speech Representation Disentangling with Adversarial Mutual Information Learning for One-shot Voice Conversion

Submitted to INTERSPEECH2022

Download this project as a .zip file Download this project as a tar.gz file

Sicheng Yang*, Methawee Tantrawenith*, Haolin Zhuang*, Zhiyong Wu†, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng

* Equal contribution.

† Corresponding author.

Abstract

One-shot voice conversion (VC) with only a single target speaker speech for reference has become a new research direction. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive logratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentanglement during training. Experiments on the VCTK dataset show the model is a state-of-the-art one-shot VC framework in terms of naturalness and intellgibility of converted speech. In addition, we can transfer style of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement.

Conversion Task

I. One-shot VC (Timbre Conversion)

Speakers are unseen during training.

Source Audio Target Audio Model Converted Audio

"We felt at the moment, it was not possible."


"The prognosis is excellent."

SkipVQVC
VQMIVC
AutoVC
AdaIN-VC
ClsVC
Proposed Model

"Not one person spoke badly."


"Gerhard Schroeder was the victor."

SkipVQVC
VQMIVC
AutoVC
AdaIN-VC
ClsVC
Proposed Model

"It was some time before she found out he was safe."


"The issues are very intense."

SkipVQVC
VQMIVC
AutoVC
AdaIN-VC
ClsVC
Proposed Model

"They are very clever in their use of words."

"The task force continues to be in discussion with both companies."
SkipVQVC
VQMIVC
AutoVC
AdaIN-VC
ClsVC
Proposed Model

II. Different Representation Conversion (Not Only Timbre Conversion)

Speakers are seen during training.

We pick speaker p252 and speaker p232 with “Please call Stella.” and they are clearly different in rhyme and pitch.

Source Audio Target Audio Type Converted Audio

"Please call Stella."


"Please call Stella."

Pitch
Rhythm
Timbre
Pitch+Rhythm
Pitch+Timbre
Rhythm+Timbre
Pitch+Rhythm+Timbre

Speakers are unseen during training.

Source Audio Target Audio Model Type Converted Audio

"I love Human Computer Speech Interaction Lab."


"I love Human Computer Speech Interaction Lab."

Proposed Model Pitch
Rhythm
Timbre
Pitch+Rhythm
Pitch+Timbre
Rhythm+Timbre
Pitch+Rhythm+Timbre
SkipVQVC
VQMIVC
AutoVC
AdaIN-VC
ClsVC

Code and Pre-trained Model

I. Code

Click to find the code.

II. Pre-trained Model

Click to find the pre-trained model.