Merge branch 'open-mmlab:main' into eval_refactor

open-mmlab · Feb 27, 2024 · 8d7155e · 8d7155e
2 parents 7109a7c + 5b71bcf
commit 8d7155e
Show file tree

Hide file tree

Showing 74 changed files with 3,933 additions and 262 deletions.
diff --git a/.gitignore b/.gitignore
@@ -47,6 +47,7 @@ ckpts
 *.wav
 *.flac
 pretrained/wenet/*conformer_exp
+!egs/tts/VALLE/prompt_examples/*.wav
 
 # Runtime data dirs
 processed_data

diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,64 @@
+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+# Other version: https://hub.docker.com/r/nvidia/cuda/tags
+FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04
+
+ARG DEBIAN_FRONTEND=noninteractive
+ARG PYTORCH='2.0.0'
+ARG CUDA='cu118'
+ARG SHELL='/bin/bash'
+ARG MINICONDA='Miniconda3-py39_23.3.1-0-Linux-x86_64.sh'
+
+ENV LANG=en_US.UTF-8 PYTHONIOENCODING=utf-8 PYTHONDONTWRITEBYTECODE=1 CUDA_HOME=/usr/local/cuda CONDA_HOME=/opt/conda SHELL=${SHELL}
+ENV PATH=$CONDA_HOME/bin:$CUDA_HOME/bin:$PATH \
+    LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH \
+    LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH \
+    CONDA_PREFIX=$CONDA_HOME \
+    NCCL_HOME=$CUDA_HOME
+
+# Install ubuntu packages
+RUN sed -i 's/archive.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
+    && sed -i 's/security.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
+    && rm /etc/apt/sources.list.d/cuda.list \
+    && apt-get update \
+    && apt-get -y install \
+    python3-pip ffmpeg git less wget libsm6 libxext6 libxrender-dev \
+    build-essential cmake pkg-config libx11-dev libatlas-base-dev \
+    libgtk-3-dev libboost-python-dev vim libgl1-mesa-glx \
+    libaio-dev software-properties-common tmux \
+    espeak-ng
+
+# Install miniconda with python 3.9
+USER root
+# COPY Miniconda3-py39_23.3.1-0-Linux-x86_64.sh /root/anaconda.sh
+RUN wget -t 0 -c -O /tmp/anaconda.sh https://repo.anaconda.com/miniconda/${MINICONDA} \
+    && mv /tmp/anaconda.sh /root/anaconda.sh \
+    && ${SHELL} /root/anaconda.sh -b -p $CONDA_HOME \
+    && rm /root/anaconda.sh
+
+RUN conda create -y --name amphion python=3.9.15
+
+WORKDIR /app
+COPY env.sh env.sh
+RUN chmod +x ./env.sh
+
+RUN ["conda", "run", "-n", "amphion", "-vvv", "--no-capture-output", "./env.sh"]
+
+RUN conda init \
+    && echo "\nconda activate amphion\n" >> ~/.bashrc
+
+CMD ["/bin/bash"]
+
+# *** Build ***
+# docker build -t realamphion/amphion .
+
+# *** Run ***
+# cd Amphion
+# docker run --runtime=nvidia --gpus all -it -v .:/app -v /mnt:/mnt_host realamphion/amphion
+
+# *** Push and release ***
+# docker login
+# docker push realamphion/amphion
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are
 )
 
 ## 🚀 News
-
+- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/1w5xgsfaLxBcUvzq3rgejZ6jfgu6hwC0c/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
 - **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
 - **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)
 
@@ -79,8 +79,19 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th
 
 Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
 
+### Visualization
+
+Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.
+
+Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/1w5xgsfaLxBcUvzq3rgejZ6jfgu6hwC0c/view)
+
+
 ## 📀 Installation
 
+Amphion can be installed through either Setup Installer or Docker Image.
+
+### Setup Installer
+
 ```bash
 git clone https://github.com/open-mmlab/Amphion.git
 cd Amphion
@@ -93,6 +104,21 @@ conda activate amphion
 sh env.sh
 ```
 
+### Docker Image
+
+1. Install [Docker](https://docs.docker.com/get-docker/), [NVIDIA Driver](https://www.nvidia.com/download/index.aspx), [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and [CUDA](https://developer.nvidia.com/cuda-downloads).
+
+2. Run the following commands:
+```bash
+git clone https://github.com/open-mmlab/Amphion.git
+cd Amphion
+
+docker pull realamphion/amphion
+docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
+```
+Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.
+
+
 ## 🐍 Usage in Python
 
 We detail the instructions of different tasks in the following recipes:
@@ -102,6 +128,7 @@ We detail the instructions of different tasks in the following recipes:
 - [Text to Audio (TTA)](egs/tta/README.md)
 - [Vocoder](egs/vocoder/README.md)
 - [Evaluation](egs/metrics/README.md)
+- [Visualization](egs/visualization/README.md)
 
 ## 👨‍💻 Contributing
 We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
@@ -127,9 +154,9 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
 ```bibtex
 @article{zhang2023amphion,
       title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
-      author={Xueyao Zhang and Liumeng Xue and Yuancheng Wang and Yicheng Gu and Xi Chen and Zihao Fang and Haopeng Chen and Lexiao Zou and Chaoren Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
+      author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
       journal={arXiv},
-      year={2023},
+      year={2024},
       volume={abs/2312.09911}
 }
 ```
diff --git a/bins/svc/train.py b/bins/svc/train.py
@@ -87,9 +87,11 @@ def main():
         for dataset in cfg.preprocess.data_augment:
             new_datasets = [
                 f"{dataset}_pitch_shift" if cfg.preprocess.use_pitch_shift else None,
-                f"{dataset}_formant_shift"
-                if cfg.preprocess.use_formant_shift
-                else None,
+                (
+                    f"{dataset}_formant_shift"
+                    if cfg.preprocess.use_formant_shift
+                    else None
+                ),
                 f"{dataset}_equalizer" if cfg.preprocess.use_equalizer else None,
                 f"{dataset}_time_stretch" if cfg.preprocess.use_time_stretch else None,
             ]

diff --git a/bins/tts/preprocess.py b/bins/tts/preprocess.py
@@ -88,11 +88,11 @@ def extract_phonme_sequences(dataset, output_path, cfg, dataset_types):
         dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
         with open(dataset_file, "r") as f:
             metadata.extend(json.load(f))
-    phone_extractor.extract_utt_phone_sequence(cfg, metadata)
+    phone_extractor.extract_utt_phone_sequence(dataset, cfg, metadata)
 
 
 def preprocess(cfg, args):
-    """Proprocess raw data of single or multiple datasets (in cfg.dataset)
+    """Preprocess raw data of single or multiple datasets (in cfg.dataset)
 
     Args:
         cfg (dict): dictionary that stores configurations

diff --git a/bins/tts/train.py b/bins/tts/train.py
@@ -86,9 +86,11 @@ def main():
         for dataset in cfg.preprocess.data_augment:
             new_datasets = [
                 f"{dataset}_pitch_shift" if cfg.preprocess.use_pitch_shift else None,
-                f"{dataset}_formant_shift"
-                if cfg.preprocess.use_formant_shift
-                else None,
+                (
+                    f"{dataset}_formant_shift"
+                    if cfg.preprocess.use_formant_shift
+                    else None
+                ),
                 f"{dataset}_equalizer" if cfg.preprocess.use_equalizer else None,
                 f"{dataset}_time_stretch" if cfg.preprocess.use_time_stretch else None,
             ]

diff --git a/config/base.json b/config/base.json
@@ -122,7 +122,7 @@
     "align_mel_duration": false
   },
   "train": {
-    "ddp": true,
+    "ddp": false,
     "random_seed": 970227,
     "batch_size": 16,
     "max_steps": 1000000,

diff --git a/config/comosvc.json b/config/comosvc.json
@@ -127,7 +127,7 @@
             "sigma_min": 0.002,
             "sigma_max": 80,
             "rho": 7,
-            "n_timesteps": 40,
+            "n_timesteps": 18,
         },
         "diffusion": {
             // Diffusion steps encoder
@@ -154,7 +154,7 @@
     "train": {
         // Basic settings
         "fast_steps": 0,
-        "batch_size": 32,
+        "batch_size": 64,
         "gradient_accumulation_step": 1,
         "max_epoch": -1,
         // -1 means no limit
@@ -195,7 +195,7 @@
         // Optimizer
         "optimizer": "AdamW",
         "adamw": {
-            "lr": 4.0e-4
+            "lr": 5.0e-5
             // nn model lr
         },
         // LR Scheduler
@@ -204,7 +204,7 @@
             "factor": 0.8,
             "patience": 10,
             // unit is epoch
-            "min_lr": 1.0e-4
+            "min_lr": 5.0e-6
         }
     },
     "inference": {

diff --git a/config/fs2.json b/config/fs2.json
@@ -93,6 +93,7 @@
     },
     "train":{
       "batch_size": 16,
+      "max_epoch": 100,
       "sort_sample": true,
       "drop_last": true,
       "group_size": 4,

diff --git a/config/tts.json b/config/tts.json
@@ -8,14 +8,15 @@
   ],
   "task_type": "tts",
   "preprocess": {
-    "language": "en-us",
+    "language": "en-us", //  espeak supports 100 languages https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md
     // linguistic features
     "extract_phone": true,
     "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
     "lexicon_path": "./text/lexicon/librispeech-lexicon.txt",
     // Directory names of processed data or extracted features
     "phone_dir": "phones",
     "use_phone": true,
+    "add_blank": true
   },
   "model": {
       "text_token_num": 512,

diff --git a/egs/datasets/README.md b/egs/datasets/README.md
@@ -6,6 +6,7 @@ Amphion support the following academic datasets (sort alphabetically):
   - [AudioCaps](#audiocaps)
   - [CSD](#csd)
   - [CustomSVCDataset](#customsvcdataset)
+  - [Hi-Fi TTS](#hifitts)
   - [KiSing](#kising)
   - [LibriLight](#librilight)
   - [LibriTTS](#libritts)
@@ -23,6 +24,8 @@ Amphion support the following academic datasets (sort alphabetically):
 
 The downloading link and the file structure tree of each dataset is displayed as follows.
 
+> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details.
+
 ## AudioCaps
 
 AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information.
@@ -73,6 +76,36 @@ We support custom dataset for Singing Voice Conversion. Organize your data in th
  ┣ ...
 ```
 
+
+## Hi-Fi TTS
+
+Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below:
+
+```plaintext
+[Hi-Fi TTS dataset path]
+ ┣ audio
+ ┃ ┣ 11614_other {Speaker_ID}_{SNR_subset}
+ ┃ ┃ ┣ 10547 {Book_ID}
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0001.flac
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0003.flac
+ ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0004.flac
+ ┃ ┃ ┃ ┣ ...
+ ┃ ┃ ┣ ...
+ ┃ ┣ ...
+ ┣ 92_manifest_clean_dev.json
+ ┣ 92_manifest_clean_test.json
+ ┣ 92_manifest_clean_train.json
+ ┣ ...
+ ┣ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json
+ ┣ ...
+ ┣ books_bandwidth.tsv
+ ┣ LICENSE.txt
+ ┣ readers_books_clean.txt
+ ┣ readers_books_other.txt
+ ┣ README.txt
+
+```
+
 ## KiSing
 
 Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:

diff --git a/egs/datasets/docker.md b/egs/datasets/docker.md
@@ -0,0 +1,19 @@
+# Mount dataset in Docker container
+
+When using Docker to run Amphion, mount the dataset to the container first is needed. It is recommend to mounte dataset to `/mnt/<dataset_name>` in the container, where `<dataset_name>` is the name of the dataset.
+
+When configuring the dataset in `exp_config.json`, you should use the path `/mnt/<dataset_name>` as the dataset path instead of the actual path on your host machine. Otherwise, the dataset will not be found in the container.
+
+## Mount Example
+
+```bash
+docker run --runtime=nvidia --gpus all -it -v .:/app -v <dataset_path1>:/mnt/<dataset_name1> -v <dataset_path2>:/mnt/<dataset_name2> amphion
+```
+
+For example, if you want to use the `LJSpeech` dataset, you can mount the dataset to `/mnt/LJSpeech` in the container.
+
+```bash
+docker run --runtime=nvidia --gpus all -it -v .:/app -v /home/username/datasets/LJSpeech:/mnt/LJSpeech amphion
+```
+
+If you want to use multiple datasets, you can mount them to different directories in the container by adding more `-v` options.
diff --git a/egs/tts/FastSpeech2/README.md b/egs/tts/FastSpeech2/README.md
@@ -83,6 +83,11 @@ sh egs/tts/FastSpeech2/run.sh --stage 2 --name [YourExptName]
 
 ## 4. Inference
 
+### Pre-trained Fastspeech 2 and HiFi-GAN Download
+
+We released a pre-trained Amphion [Fastspeech 2](https://huggingface.co/amphion/fastspeech2_ljspeech) model and [HiFi-GAN](https://huggingface.co/amphion/hifigan_ljspeech) trained on LJSpeech. So you can download the them and generate speech according to the following inference instruction.
+
+
 ### Configuration
 
 For inference, you need to specify the following configurations when running `run.sh`:
@@ -96,6 +101,8 @@ For inference, you need to specify the following configurations when running `ru
 | `--infer_dataset`                            | The dataset used for inference.  |  For LJSpeech dataset, the inference dataset would be `LJSpeech`.                                                                                                                                    |
 | `--infer_testing_set`                             | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be  "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set.                                                                                                                                    |
 | `--infer_text`                            | The text to be synthesized. | "`This is a clip of generated speech with the given text from a TTS model.`"                                                                                                                                    |
+| `--vocoder_dir`                           | The directory for the vocoder. | "`ckpts/vocoder/hifigan_ljspeech`"                                                                                                                                    |
+
 
 ### Run
 For example, if you want to generate speech of all testing set split from LJSpeech, just run:
@@ -106,7 +113,8 @@ sh egs/tts/FastSpeech2/run.sh --stage 3 \
     --infer_output_dir ckpts/tts/[YourExptName]/result \
     --infer_mode "batch" \
     --infer_dataset "LJSpeech" \
-    --infer_testing_set "test"
+    --infer_testing_set "test" \
+    --vocoder_dir ckpts/vocoder/hifigan_ljspeech/checkpoints
 ```
 
 Or, if you want to generate a single clip of speech from a given text, just run:
@@ -116,10 +124,28 @@ sh egs/tts/FastSpeech2/run.sh --stage 3 \
     --infer_expt_dir ckpts/tts/[YourExptName] \
     --infer_output_dir ckpts/tts/[YourExptName]/result \
     --infer_mode "single" \
-    --infer_text "This is a clip of generated speech with the given text from a TTS model."
+    --infer_text "This is a clip of generated speech with the given text from a TTS model." \
+    --vocoder_dir ckpts/vocoder/hifigan_ljspeech
+```
+
+### ISSUES and Solutions
+
 ```
+NotImplementedError: Using RTX 3090 or 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
+2024-02-24 10:57:49 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
+```
+The error message is related to an incompatibility issue with the NVIDIA RTX 3090 or 4000 series GPUs when trying to use peer-to-peer (P2P) communication or InfiniBand (IB) for faster communication. This incompatibility arises within the PyTorch accelerate library, which facilitates distributed training and inference.
+
+To fix this issue, before running your script, you can set the environment variables in your terminal:
+```
+export NCCL_P2P_DISABLE=1
+export NCCL_IB_DISABLE=1
+```
+
+### Noted
+Extensive logging messages related to `torch._subclasses.fake_tensor` and `torch._dynamo.output_graph` may be observed during inference. Despite attempts to ignore these logs, no effective solution has been found. However, it does not impact the inference process.
+
 
-We will release a pre-trained FastSpeech2 model trained on LJSpeech. So you can download the pre-trained model and generate speech following the above inference instruction.
 
 
 ```bibtex