Skip to content

Implementation code of non-parallel sequence-to-sequence VC

License

Notifications You must be signed in to change notification settings

kimjj-geek/nonparaSeq2seqVC_code

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Non-parallel Seq2seq Voice Conversion

Implementation code of Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations.

For audio samples, please visit our demo page.

The structure overview of the model

Dependencies

  • Python 3.6
  • PyTorch 1.0.1
  • CUDA 10.0

Data

It is recommended you download the VCTK and CMU-ARCTIC datasets.

Usage

Installation

Install Python dependencies.

$ pip install -r requirements.txt

Feature Extraction

Extract Mel-Spectrograms

Install and use deepvoice3_pytorch for extracting audio features.

For VCTK, you can use the following:

deepvoice$ python preprocess.py --preset=presets/deepvoice3_vctk.json vctk VCTK-Corpus/ VCTK-processed/

Extract Phonemes

It's suggested to use the grapheme-to-phoneme module of Festival to obtain the inputs for the text encoder. An easy way to do this is with the phonemizer tool, with Festival as a backend:

$ phonemize -b festival -l en-us transcripts.txt -o transcripts.phones --strip

Customize data reader

If you use data other than VCTK or CMU-arctic, you will need to modify the data reader to read your training data. The following are scripts you will need to modify.

For pre-training:

For fine-tuning:

Pre-train the model

Add correct paths to your local data, and run the bash script:

$ cd pre-train
$ bash run.sh

Run the inference code to generate audio samples on multi-speaker dataset. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ python inference.py

Fine-tune the model

Fine-tune the model and generate audio samples on conversion pair. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ cd fine-tune
$ bash run.sh

References

  • "Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai, accepted by IEEE/ACM Transactions on Aduio, Speech and Language Processing, 2019.
  • "Sequence-to-Sequence Acoustic Modeling for Voice Conversion", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, Li-Rong Dai, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 3, pp. 631-644, March 2019.
  • "Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai, ICASSP, pp. 4789–4793, 2018.

Acknowledgements

Part of code was adapted from the following project:

About

Implementation code of non-parallel sequence-to-sequence VC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%