- HifiTTS: high-resolution multi-speaker english dataset used here as baseline. Can be downloaded here.
-
Generate phonetic alignment using GlowTTS:
a) Download GlowTTS model checkpoint.
b) Update
GLOW_TTS_CKPT_PATH
in compute_glowtts_alignments.py script.c) Prepare a GlowTTS filelist or use this example for HiFiTTS dataset (you need to download the dataset first).
d) Prepare a GlowTTS config, changing:
- `"training_files"` to your filelist, - `"cmudict_path"` to `<nansypp_path>/static/tts/cmu_dictionary`.
e) Run the alignment script:
python src/data/preprocessing/compute_glowtts_alignments.py <config_file> <input_dir> <output_dir>
-
Decode audio using:
python src/data/preprocessing/decode.py -i <input_dir> -o <output_dir> -sr 44100
- Compute TTS targets using:
python -m src.data.preprocessing.precompute_tts_targets \
<decoded_output_dir>/dataset.csv \
<sample_rate> \
<tts_targets_dir> \
<backbone_exp_dir> \
<backbone_ckpt_name>
- Train/test split:
head -n 1001 <tts_targets_dir>/dataset.csv > <tts_targets_dir>/validation_dataset.csv
head -n 1 <tts_targets_dir>/dataset.csv > <tts_targets_dir>/train_dataset.csv
sed -n '1002,$p' tts_targets_dir>/dataset.csv >> <tts_targets_dir>/train_dataset.csv
-
Edit TTS training config: specify
<tts_targets_dir>
and<alignment_dir>
. -
Run the training script:
python src/train/tts.py --config-name=hifitts +trainer.devices=<list_of_gpu_ids>
Run download_backbone_ckpt.py
that will download a checkpoint we trained using this repository for 200k training-steps and will place it in the right directory so that following inference and app work smoothly.
python src/utilities/download_checkpoints.py
An inferencer class is provided in source code and can be called from command-line as follows:
python src/inference/tts.py \
<experiment_directory> \
<checkpoint_filename> \
<audio_path> \
<text> \
<output_path> \
-d <device>
Example:
python src/inference/tts.py \
"static/runs/runs_tts/hifitts/2023-10-03_18-23-00" \
"steps=step=15000.ckpt" \
"static/samples/vctk/p238_001.wav" \
"To be or not to be that is the question" \
"static/tmp/to_be.wav"
streamlit run app/text_to_speech.py --server.port <port_number>
Along training you can visualize logs using the following command:
tensorboard --logdir=static/runs/runs_tts --bind_all --port <port_number>
Observations and key R&D results are detailed here.
Results from checkpoints trained with this repo are showcased on this Notion page.