Add HuggingFace and Openxlab gradio demos for SVC, TTA and TTS (#42)

* update demo badges and links * add svc and tta pretrained models badges
open-mmlab · Dec 19, 2023 · d76cf2d · d76cf2d
1 parent c0a1958
commit d76cf2d
Show file tree

Hide file tree

Showing 5 changed files with 18 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 <div>
     <a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
     <a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
+    <a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
     <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
     <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
     <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
@@ -62,7 +63,7 @@ Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are
     - Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
     - Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
     - Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
-- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
+- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
 
 ### Evaluation
 

diff --git a/egs/svc/MultipleContentsSVC/README.md b/egs/svc/MultipleContentsSVC/README.md
@@ -2,6 +2,9 @@
 
 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160)
 [![demo](https://img.shields.io/badge/SVC-Demo-red)](https://www.zhangxueyao.com/data/MultipleContentsSVC/index.html)
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/amphion/singing_voice_conversion)
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/singing_voice_conversion)
+[![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/singing_voice_conversion)
 
 <br>
 <div align="center">

diff --git a/egs/tta/RECIPE.md b/egs/tta/RECIPE.md
@@ -1,5 +1,11 @@
 # Text-to-Audio with Latent Diffusion Model
 
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830)
+[![demo](https://img.shields.io/badge/SVC-Demo-red)](https://audit-demo.github.io/)
+[![model](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-pink)](https://huggingface.co/amphion/text_to_audio)
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/Text-to-Audio)
+[![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/Text-to-Audio)
+
 This is the quicktour for training a text-to-audio model with the popular and powerful generative model: [Latent Diffusion Model](https://arxiv.org/abs/2112.10752). Specially, this recipe is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper "[AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models](https://arxiv.org/abs/2304.00830)". You can check the last part of [AUDIT demos](https://audit-demo.github.io/) to see same text-to-audio examples.
 
 <br>

diff --git a/egs/tts/NaturalSpeech2/README.md b/egs/tts/NaturalSpeech2/README.md
@@ -1,12 +1,14 @@
 # NaturalSpeech2 Recipe
 
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/NaturalSpeech2)
+
 In this recipe, we will show how to train [NaturalSpeech2](https://arxiv.org/abs/2304.09116) using Amphion's infrastructure. NaturalSpeech2 is a zero-shot TTS architecture that predicts latent representations of a neural audio codec.
 
 There are three stages in total:
 
 1. Data processing
-3. Training
-4. Inference
+2. Training
+3. Inference
 
 > **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
 > ```bash

diff --git a/egs/tts/VITS/README.md b/egs/tts/VITS/README.md
@@ -1,6 +1,8 @@
-
 # VITS Recipe
 
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/Text-to-Speech)
+[![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/Text-to-Speech)
+
 In this recipe, we will show how to train [VITS](https://arxiv.org/abs/2106.06103) using Amphion's infrastructure. VITS is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
 
 There are four stages in total: