Implementation code of Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations.
For audio samples, please visit our demo page.
- Python 3.6
- PyTorch 1.0.1
- CUDA 10.0
It is recommended you download the VCTK and CMU-ARCTIC datasets.
Install Python dependencies.
$ pip install -r requirements.txt
Install and use deepvoice3_pytorch for extracting audio features.
For VCTK, you can use the following:
deepvoice$ python preprocess.py --preset=presets/deepvoice3_vctk.json vctk VCTK-Corpus/ VCTK-processed/
It's suggested to use the grapheme-to-phoneme
module of Festival to obtain the inputs for the text encoder. An easy way to do this is with the phonemizer
tool, with Festival as a backend:
$ phonemize -b festival -l en-us transcripts.txt -o transcripts.phones --strip
If you use data other than VCTK or CMU-arctic, you will need to modify the data reader to read your training data. The following are scripts you will need to modify.
For pre-training:
For fine-tuning:
Add correct paths to your local data, and run the bash script:
$ cd pre-train
$ bash run.sh
Run the inference code to generate audio samples on multi-speaker dataset. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.
$ python inference.py
Fine-tune the model and generate audio samples on conversion pair. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.
$ cd fine-tune
$ bash run.sh
- "Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai, accepted by IEEE/ACM Transactions on Aduio, Speech and Language Processing, 2019.
- "Sequence-to-Sequence Acoustic Modeling for Voice Conversion", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, Li-Rong Dai, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 3, pp. 631-644, March 2019.
- "Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis", Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai, ICASSP, pp. 4789–4793, 2018.
Part of code was adapted from the following project: