Skip to content

Commit

Permalink
Merge branch 'open-mmlab:main' into eval_refactor
Browse files Browse the repository at this point in the history
  • Loading branch information
VocodexElysium authored Feb 27, 2024
2 parents 7109a7c + 5b71bcf commit 8d7155e
Show file tree
Hide file tree
Showing 74 changed files with 3,933 additions and 262 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ ckpts
*.wav
*.flac
pretrained/wenet/*conformer_exp
!egs/tts/VALLE/prompt_examples/*.wav

# Runtime data dirs
processed_data
Expand Down
64 changes: 64 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright (c) 2023 Amphion.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

# Other version: https://hub.docker.com/r/nvidia/cuda/tags
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTORCH='2.0.0'
ARG CUDA='cu118'
ARG SHELL='/bin/bash'
ARG MINICONDA='Miniconda3-py39_23.3.1-0-Linux-x86_64.sh'

ENV LANG=en_US.UTF-8 PYTHONIOENCODING=utf-8 PYTHONDONTWRITEBYTECODE=1 CUDA_HOME=/usr/local/cuda CONDA_HOME=/opt/conda SHELL=${SHELL}
ENV PATH=$CONDA_HOME/bin:$CUDA_HOME/bin:$PATH \
LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH \
LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH \
CONDA_PREFIX=$CONDA_HOME \
NCCL_HOME=$CUDA_HOME

# Install ubuntu packages
RUN sed -i 's/archive.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
&& sed -i 's/security.ubuntu.com/mirrors.cloud.tencent.com/g' /etc/apt/sources.list \
&& rm /etc/apt/sources.list.d/cuda.list \
&& apt-get update \
&& apt-get -y install \
python3-pip ffmpeg git less wget libsm6 libxext6 libxrender-dev \
build-essential cmake pkg-config libx11-dev libatlas-base-dev \
libgtk-3-dev libboost-python-dev vim libgl1-mesa-glx \
libaio-dev software-properties-common tmux \
espeak-ng

# Install miniconda with python 3.9
USER root
# COPY Miniconda3-py39_23.3.1-0-Linux-x86_64.sh /root/anaconda.sh
RUN wget -t 0 -c -O /tmp/anaconda.sh https://repo.anaconda.com/miniconda/${MINICONDA} \
&& mv /tmp/anaconda.sh /root/anaconda.sh \
&& ${SHELL} /root/anaconda.sh -b -p $CONDA_HOME \
&& rm /root/anaconda.sh

RUN conda create -y --name amphion python=3.9.15

WORKDIR /app
COPY env.sh env.sh
RUN chmod +x ./env.sh

RUN ["conda", "run", "-n", "amphion", "-vvv", "--no-capture-output", "./env.sh"]

RUN conda init \
&& echo "\nconda activate amphion\n" >> ~/.bashrc

CMD ["/bin/bash"]

# *** Build ***
# docker build -t realamphion/amphion .

# *** Run ***
# cd Amphion
# docker run --runtime=nvidia --gpus all -it -v .:/app -v /mnt:/mnt_host realamphion/amphion

# *** Push and release ***
# docker login
# docker push realamphion/amphion
33 changes: 30 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are
)

## 🚀 News

- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/1w5xgsfaLxBcUvzq3rgejZ6jfgu6hwC0c/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
- **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
- **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)

Expand Down Expand Up @@ -79,8 +79,19 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th

Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).

### Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/1w5xgsfaLxBcUvzq3rgejZ6jfgu6hwC0c/view)


## 📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

### Setup Installer

```bash
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion
Expand All @@ -93,6 +104,21 @@ conda activate amphion
sh env.sh
```

### Docker Image

1. Install [Docker](https://docs.docker.com/get-docker/), [NVIDIA Driver](https://www.nvidia.com/download/index.aspx), [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and [CUDA](https://developer.nvidia.com/cuda-downloads).

2. Run the following commands:
```bash
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion
```
Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.


## 🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:
Expand All @@ -102,6 +128,7 @@ We detail the instructions of different tasks in the following recipes:
- [Text to Audio (TTA)](egs/tta/README.md)
- [Vocoder](egs/vocoder/README.md)
- [Evaluation](egs/metrics/README.md)
- [Visualization](egs/visualization/README.md)

## 👨‍💻 Contributing
We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
Expand All @@ -127,9 +154,9 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
```bibtex
@article{zhang2023amphion,
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
author={Xueyao Zhang and Liumeng Xue and Yuancheng Wang and Yicheng Gu and Xi Chen and Zihao Fang and Haopeng Chen and Lexiao Zou and Chaoren Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
journal={arXiv},
year={2023},
year={2024},
volume={abs/2312.09911}
}
```
8 changes: 5 additions & 3 deletions bins/svc/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,11 @@ def main():
for dataset in cfg.preprocess.data_augment:
new_datasets = [
f"{dataset}_pitch_shift" if cfg.preprocess.use_pitch_shift else None,
f"{dataset}_formant_shift"
if cfg.preprocess.use_formant_shift
else None,
(
f"{dataset}_formant_shift"
if cfg.preprocess.use_formant_shift
else None
),
f"{dataset}_equalizer" if cfg.preprocess.use_equalizer else None,
f"{dataset}_time_stretch" if cfg.preprocess.use_time_stretch else None,
]
Expand Down
4 changes: 2 additions & 2 deletions bins/tts/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,11 +88,11 @@ def extract_phonme_sequences(dataset, output_path, cfg, dataset_types):
dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
with open(dataset_file, "r") as f:
metadata.extend(json.load(f))
phone_extractor.extract_utt_phone_sequence(cfg, metadata)
phone_extractor.extract_utt_phone_sequence(dataset, cfg, metadata)


def preprocess(cfg, args):
"""Proprocess raw data of single or multiple datasets (in cfg.dataset)
"""Preprocess raw data of single or multiple datasets (in cfg.dataset)
Args:
cfg (dict): dictionary that stores configurations
Expand Down
8 changes: 5 additions & 3 deletions bins/tts/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,11 @@ def main():
for dataset in cfg.preprocess.data_augment:
new_datasets = [
f"{dataset}_pitch_shift" if cfg.preprocess.use_pitch_shift else None,
f"{dataset}_formant_shift"
if cfg.preprocess.use_formant_shift
else None,
(
f"{dataset}_formant_shift"
if cfg.preprocess.use_formant_shift
else None
),
f"{dataset}_equalizer" if cfg.preprocess.use_equalizer else None,
f"{dataset}_time_stretch" if cfg.preprocess.use_time_stretch else None,
]
Expand Down
2 changes: 1 addition & 1 deletion config/base.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@
"align_mel_duration": false
},
"train": {
"ddp": true,
"ddp": false,
"random_seed": 970227,
"batch_size": 16,
"max_steps": 1000000,
Expand Down
8 changes: 4 additions & 4 deletions config/comosvc.json
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@
"sigma_min": 0.002,
"sigma_max": 80,
"rho": 7,
"n_timesteps": 40,
"n_timesteps": 18,
},
"diffusion": {
// Diffusion steps encoder
Expand All @@ -154,7 +154,7 @@
"train": {
// Basic settings
"fast_steps": 0,
"batch_size": 32,
"batch_size": 64,
"gradient_accumulation_step": 1,
"max_epoch": -1,
// -1 means no limit
Expand Down Expand Up @@ -195,7 +195,7 @@
// Optimizer
"optimizer": "AdamW",
"adamw": {
"lr": 4.0e-4
"lr": 5.0e-5
// nn model lr
},
// LR Scheduler
Expand All @@ -204,7 +204,7 @@
"factor": 0.8,
"patience": 10,
// unit is epoch
"min_lr": 1.0e-4
"min_lr": 5.0e-6
}
},
"inference": {
Expand Down
1 change: 1 addition & 0 deletions config/fs2.json
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
},
"train":{
"batch_size": 16,
"max_epoch": 100,
"sort_sample": true,
"drop_last": true,
"group_size": 4,
Expand Down
3 changes: 2 additions & 1 deletion config/tts.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@
],
"task_type": "tts",
"preprocess": {
"language": "en-us",
"language": "en-us", // espeak supports 100 languages https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md
// linguistic features
"extract_phone": true,
"phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
"lexicon_path": "./text/lexicon/librispeech-lexicon.txt",
// Directory names of processed data or extracted features
"phone_dir": "phones",
"use_phone": true,
"add_blank": true
},
"model": {
"text_token_num": 512,
Expand Down
33 changes: 33 additions & 0 deletions egs/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Amphion support the following academic datasets (sort alphabetically):
- [AudioCaps](#audiocaps)
- [CSD](#csd)
- [CustomSVCDataset](#customsvcdataset)
- [Hi-Fi TTS](#hifitts)
- [KiSing](#kising)
- [LibriLight](#librilight)
- [LibriTTS](#libritts)
Expand All @@ -23,6 +24,8 @@ Amphion support the following academic datasets (sort alphabetically):

The downloading link and the file structure tree of each dataset is displayed as follows.

> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details.
## AudioCaps

AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information.
Expand Down Expand Up @@ -73,6 +76,36 @@ We support custom dataset for Singing Voice Conversion. Organize your data in th
┣ ...
```


## Hi-Fi TTS

Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below:

```plaintext
[Hi-Fi TTS dataset path]
┣ audio
┃ ┣ 11614_other {Speaker_ID}_{SNR_subset}
┃ ┃ ┣ 10547 {Book_ID}
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0001.flac
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0003.flac
┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0004.flac
┃ ┃ ┃ ┣ ...
┃ ┃ ┣ ...
┃ ┣ ...
┣ 92_manifest_clean_dev.json
┣ 92_manifest_clean_test.json
┣ 92_manifest_clean_train.json
┣ ...
┣ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json
┣ ...
┣ books_bandwidth.tsv
┣ LICENSE.txt
┣ readers_books_clean.txt
┣ readers_books_other.txt
┣ README.txt
```

## KiSing

Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:
Expand Down
19 changes: 19 additions & 0 deletions egs/datasets/docker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Mount dataset in Docker container

When using Docker to run Amphion, mount the dataset to the container first is needed. It is recommend to mounte dataset to `/mnt/<dataset_name>` in the container, where `<dataset_name>` is the name of the dataset.

When configuring the dataset in `exp_config.json`, you should use the path `/mnt/<dataset_name>` as the dataset path instead of the actual path on your host machine. Otherwise, the dataset will not be found in the container.

## Mount Example

```bash
docker run --runtime=nvidia --gpus all -it -v .:/app -v <dataset_path1>:/mnt/<dataset_name1> -v <dataset_path2>:/mnt/<dataset_name2> amphion
```

For example, if you want to use the `LJSpeech` dataset, you can mount the dataset to `/mnt/LJSpeech` in the container.

```bash
docker run --runtime=nvidia --gpus all -it -v .:/app -v /home/username/datasets/LJSpeech:/mnt/LJSpeech amphion
```

If you want to use multiple datasets, you can mount them to different directories in the container by adding more `-v` options.
32 changes: 29 additions & 3 deletions egs/tts/FastSpeech2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,11 @@ sh egs/tts/FastSpeech2/run.sh --stage 2 --name [YourExptName]

## 4. Inference

### Pre-trained Fastspeech 2 and HiFi-GAN Download

We released a pre-trained Amphion [Fastspeech 2](https://huggingface.co/amphion/fastspeech2_ljspeech) model and [HiFi-GAN](https://huggingface.co/amphion/hifigan_ljspeech) trained on LJSpeech. So you can download the them and generate speech according to the following inference instruction.


### Configuration

For inference, you need to specify the following configurations when running `run.sh`:
Expand All @@ -96,6 +101,8 @@ For inference, you need to specify the following configurations when running `ru
| `--infer_dataset` | The dataset used for inference. | For LJSpeech dataset, the inference dataset would be `LJSpeech`. |
| `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be  "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set. |
| `--infer_text` | The text to be synthesized. | "`This is a clip of generated speech with the given text from a TTS model.`" |
| `--vocoder_dir` | The directory for the vocoder. | "`ckpts/vocoder/hifigan_ljspeech`" |


### Run
For example, if you want to generate speech of all testing set split from LJSpeech, just run:
Expand All @@ -106,7 +113,8 @@ sh egs/tts/FastSpeech2/run.sh --stage 3 \
--infer_output_dir ckpts/tts/[YourExptName]/result \
--infer_mode "batch" \
--infer_dataset "LJSpeech" \
--infer_testing_set "test"
--infer_testing_set "test" \
--vocoder_dir ckpts/vocoder/hifigan_ljspeech/checkpoints
```

Or, if you want to generate a single clip of speech from a given text, just run:
Expand All @@ -116,10 +124,28 @@ sh egs/tts/FastSpeech2/run.sh --stage 3 \
--infer_expt_dir ckpts/tts/[YourExptName] \
--infer_output_dir ckpts/tts/[YourExptName]/result \
--infer_mode "single" \
--infer_text "This is a clip of generated speech with the given text from a TTS model."
--infer_text "This is a clip of generated speech with the given text from a TTS model." \
--vocoder_dir ckpts/vocoder/hifigan_ljspeech
```

### ISSUES and Solutions

```
NotImplementedError: Using RTX 3090 or 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
2024-02-24 10:57:49 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
```
The error message is related to an incompatibility issue with the NVIDIA RTX 3090 or 4000 series GPUs when trying to use peer-to-peer (P2P) communication or InfiniBand (IB) for faster communication. This incompatibility arises within the PyTorch accelerate library, which facilitates distributed training and inference.

To fix this issue, before running your script, you can set the environment variables in your terminal:
```
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
```

### Noted
Extensive logging messages related to `torch._subclasses.fake_tensor` and `torch._dynamo.output_graph` may be observed during inference. Despite attempts to ignore these logs, no effective solution has been found. However, it does not impact the inference process.


We will release a pre-trained FastSpeech2 model trained on LJSpeech. So you can download the pre-trained model and generate speech following the above inference instruction.


```bibtex
Expand Down
Loading

0 comments on commit 8d7155e

Please sign in to comment.