Sicheng Yang*, Methawee Tantrawenith*, Haolin Zhuang*, Zhiyong Wu†, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng
* Equal contribution.
† Corresponding author.
Abstract
One-shot voice conversion (VC) with only a single target speaker speech for reference has become a new research direction. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive logratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentanglement during training. Experiments on the VCTK dataset show the model is a state-of-the-art one-shot VC framework in terms of naturalness and intellgibility of converted speech. In addition, we can transfer style of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement.
Conversion Task
I. One-shot VC (Timbre Conversion)
Speakers are unseen during training.
Source Audio | Target Audio | Model | Converted Audio |
---|---|---|---|
"We felt at the moment, it was not possible." |
"The prognosis is excellent." |
SkipVQVC | |
VQMIVC | |||
AutoVC | |||
AdaIN-VC | |||
ClsVC | |||
Proposed Model | |||
"Not one person spoke badly." |
"Gerhard Schroeder was the victor." |
SkipVQVC | |
VQMIVC | |||
AutoVC | |||
AdaIN-VC | |||
ClsVC | |||
Proposed Model | |||
"It was some time before she found out he was safe." |
"The issues are very intense." |
SkipVQVC | |
VQMIVC | |||
AutoVC | |||
AdaIN-VC | |||
ClsVC | |||
Proposed Model | |||
"They are very clever in their use of words." |
"The task force continues to be in discussion with both companies." |
SkipVQVC | |
VQMIVC | |||
AutoVC | |||
AdaIN-VC | |||
ClsVC | |||
Proposed Model |
II. Different Representation Conversion (Not Only Timbre Conversion)
Speakers are seen during training.
We pick speaker p252 and speaker p232 with “Please call Stella.” and they are clearly different in rhyme and pitch.
Source Audio | Target Audio | Type | Converted Audio |
---|---|---|---|
"Please call Stella." |
"Please call Stella." |
Pitch | |
Rhythm | |||
Timbre | |||
Pitch+Rhythm | |||
Pitch+Timbre | |||
Rhythm+Timbre | |||
Pitch+Rhythm+Timbre |
Speakers are unseen during training.
Source Audio | Target Audio | Model | Type | Converted Audio |
---|---|---|---|---|
"I love Human Computer Speech Interaction Lab." |
"I love Human Computer Speech Interaction Lab." |
Proposed Model | Pitch | |
Rhythm | ||||
Timbre | ||||
Pitch+Rhythm | ||||
Pitch+Timbre | ||||
Rhythm+Timbre | ||||
Pitch+Rhythm+Timbre | ||||
SkipVQVC | ||||
VQMIVC | ||||
AutoVC | ||||
AdaIN-VC | ||||
ClsVC |