Skip to content

Commit

Permalink
Update readme and more
Browse files Browse the repository at this point in the history
  • Loading branch information
Plachtaa committed Sep 14, 2024
1 parent 1573815 commit f4c03d2
Show file tree
Hide file tree
Showing 31 changed files with 3,164 additions and 94 deletions.
46 changes: 44 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,55 @@
# Seed-VC
Zero-shot voice conversion trained according to the scheme described in SEED-TTS.
The VC quality is surprisingly good in terms of both audio quality and timbre similarity. We decide to continue along this pathway see where it can achieve.
The VC quality is surprisingly good in terms of both audio quality and timbre similarity. We decide to continue along this pathway see where it can achieve.

TODO:
## Installation
```bash
pip install -r requirements.txt
```

## Usage
Checkpoints of the latest model release will be downloaded automatically when first run inference.

Command line inference:
```bash
python inference.py --source <source-wav> \
--target <referene-wav>
--output <output-dir>
--diffusion-steps 10
--length-adjust 1.0
--inference-cfg-rate 0.7
--n-quantizers 3
```
where:
- `source` is the path to the speech file to convert to reference voice
- `target` is the path to the speech file as voice reference
- `output` is the path to the output directory
- `diffusion-steps` is the number of diffusion steps to use, default is 10, use 50~100 for best quality
- `length-adjust` is the length adjustment factor, default is 1.0, set <1.0 for speed-up speech, >1.0 for slow-down speech
- `inference-cfg-rate` has subtle difference in the output, default is 0.7
- `n-quantizers` is the number of quantizers from FAcodec to use, default is 3, the less quantizer used, the less prosody of source audio is preserved

Gradio web interface:
```bash
python app.py
```
Then open the browser and go to `http://localhost:7860/` to use the web interface.
## TODO
- [x] Release code
- [x] Release v0.1 pretrained model: [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-SeedVC-blue)](https://huggingface.co/Plachta/Seed-VC)
- [x] Huggingface space demo: [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Plachta/Seed-VC)
- [x] HTML demo page (maybe with comparisons to other VC models): [Demo](https://plachtaa.github.io/seed-vc/)
- [ ] Code for training on custom data
- [ ] Streaming inference
- [ ] Singing voice conversion
- [ ] Noise resiliency for source & reference audio
- [ ] Potential architecture improvements
- [x] U-ViT style skip connections
- [x] Changed input to [FAcodec](https://github.com/Plachtaa/FAcodec) tokens
- [ ] More to be added

## CHANGELOGS
- 2024-09-14:
- Updated v0.2 pretrained model, with smaller size and less diffusion steps to achieve same quality, and additional ability to control prosody preservation
- Added command line inference script
- Added installation and usage instructions
115 changes: 86 additions & 29 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dit_checkpoint_path, dit_config_path = load_custom_model_from_hf("Plachta/Seed-VC",
"DiT_step_315000_seed_v2_online_pruned.pth",
"config_dit_mel_seed.yml")
"DiT_step_298000_seed_uvit_facodec_small_wavenet_pruned.pth",
"config_dit_mel_seed_facodec_small_wavenet.yml")

config = yaml.safe_load(open(dit_config_path, 'r'))
model_params = recursive_munch(config['model_params'])
Expand All @@ -31,7 +31,7 @@
from modules.campplus.DTDNN import CAMPPlus

campplus_model = CAMPPlus(feat_dim=80, embedding_size=192)
campplus_model.load_state_dict(torch.load(config['model_params']['style_encoder']['campplus_path']))
campplus_model.load_state_dict(torch.load(config['model_params']['style_encoder']['campplus_path'], map_location='cpu'))
campplus_model.eval()
campplus_model.to(device)

Expand All @@ -47,12 +47,25 @@
hift_gen.eval()
hift_gen.to(device)

from modules.cosyvoice_tokenizer.frontend import CosyVoiceFrontEnd

speech_tokenizer_path = load_custom_model_from_hf("Plachta/Seed-VC", "speech_tokenizer_v1.onnx", None)

cosyvoice_frontend = CosyVoiceFrontEnd(speech_tokenizer_model=speech_tokenizer_path,
device='cuda', device_id=0)
speech_tokenizer_type = config['model_params']['speech_tokenizer'].get('type', 'cosyvoice')
if speech_tokenizer_type == 'cosyvoice':
from modules.cosyvoice_tokenizer.frontend import CosyVoiceFrontEnd
speech_tokenizer_path = load_custom_model_from_hf("Plachta/Seed-VC", "speech_tokenizer_v1.onnx", None)
cosyvoice_frontend = CosyVoiceFrontEnd(speech_tokenizer_model=speech_tokenizer_path,
device='cuda', device_id=0)
elif speech_tokenizer_type == 'facodec':
ckpt_path, config_path = load_custom_model_from_hf("Plachta/FAcodec", 'pytorch_model.bin', 'config.yml')

codec_config = yaml.safe_load(open(config_path))
codec_model_params = recursive_munch(codec_config['model_params'])
codec_encoder = build_model(codec_model_params, stage="codec")

ckpt_params = torch.load(ckpt_path, map_location="cpu")

for key in codec_encoder:
codec_encoder[key].load_state_dict(ckpt_params[key], strict=False)
_ = [codec_encoder[key].eval() for key in codec_encoder]
_ = [codec_encoder[key].to(device) for key in codec_encoder]
# Generate mel spectrograms
mel_fn_args = {
"n_fft": config['preprocess_params']['spect_params']['n_fft'],
Expand All @@ -70,10 +83,25 @@

@torch.no_grad()
@torch.inference_mode()
def voice_conversion(source, target, diffusion_steps, length_adjust, inference_cfg_rate):
def voice_conversion(source, target, diffusion_steps, length_adjust, inference_cfg_rate, n_quantizers):
# Load audio
source_audio = librosa.load(source, sr=sr)[0]
ref_audio = librosa.load(target, sr=sr)[0]
# source_sr, source_audio = source
# ref_sr, ref_audio = target
# # if any of the inputs has 2 channels, take the first only
# if source_audio.ndim == 2:
# source_audio = source_audio[:, 0]
# if ref_audio.ndim == 2:
# ref_audio = ref_audio[:, 0]
#
# source_audio, ref_audio = source_audio / 32768.0, ref_audio / 32768.0
#
# # if source or audio sr not equal to default sr, resample
# if source_sr != sr:
# source_audio = librosa.resample(source_audio, source_sr, sr)
# if ref_sr != sr:
# ref_audio = librosa.resample(ref_audio, ref_sr, sr)

# Process audio
source_audio = torch.tensor(source_audio[:sr * 30]).unsqueeze(0).float().to(device)
Expand All @@ -84,23 +112,42 @@ def voice_conversion(source, target, diffusion_steps, length_adjust, inference_c
ref_waves_16k = torchaudio.functional.resample(ref_audio, sr, 16000)

# Extract features
S_alt = cosyvoice_frontend.extract_speech_token(source_waves_16k)[0]
S_ori = cosyvoice_frontend.extract_speech_token(ref_waves_16k)[0]
if speech_tokenizer_type == 'cosyvoice':
S_alt = cosyvoice_frontend.extract_speech_token(source_waves_16k)[0]
S_ori = cosyvoice_frontend.extract_speech_token(ref_waves_16k)[0]
elif speech_tokenizer_type == 'facodec':
converted_waves_24k = torchaudio.functional.resample(source_audio, sr, 24000)
wave_lengths_24k = torch.LongTensor([converted_waves_24k.size(1)]).to(converted_waves_24k.device)
waves_input = converted_waves_24k.unsqueeze(1)
z = codec_encoder.encoder(waves_input)
(
quantized,
codes
) = codec_encoder.quantizer(
z,
waves_input,
)
S_alt = torch.cat([codes[1], codes[0]], dim=1)

# S_ori should be extracted in the same way
waves_24k = torchaudio.functional.resample(ref_audio, sr, 24000)
waves_input = waves_24k.unsqueeze(1)
z = codec_encoder.encoder(waves_input)
(
quantized,
codes
) = codec_encoder.quantizer(
z,
waves_input,
)
S_ori = torch.cat([codes[1], codes[0]], dim=1)

mel = to_mel(source_audio.to(device).float())
mel2 = to_mel(ref_audio.to(device).float())

target_lengths = torch.LongTensor([int(mel.size(2) * length_adjust)]).to(mel.device)
target2_lengths = torch.LongTensor([mel2.size(2)]).to(mel2.device)

# Style encoding
feat = torchaudio.compliance.kaldi.fbank(source_waves_16k,
num_mel_bins=80,
dither=0,
sample_frequency=16000)
feat = feat - feat.mean(dim=0, keepdim=True)
style1 = campplus_model(feat.unsqueeze(0))

feat2 = torchaudio.compliance.kaldi.fbank(ref_waves_16k,
num_mel_bins=80,
dither=0,
Expand All @@ -109,8 +156,8 @@ def voice_conversion(source, target, diffusion_steps, length_adjust, inference_c
style2 = campplus_model(feat2.unsqueeze(0))

# Length regulation
cond = model.length_regulator(S_alt, ylens=target_lengths)[0]
prompt_condition = model.length_regulator(S_ori, ylens=target2_lengths)[0]
cond = model.length_regulator(S_alt, ylens=target_lengths, n_quantizers=int(n_quantizers))[0]
prompt_condition = model.length_regulator(S_ori, ylens=target2_lengths, n_quantizers=int(n_quantizers))[0]
cat_condition = torch.cat([prompt_condition, cond], dim=1)

# Voice Conversion
Expand All @@ -121,19 +168,29 @@ def voice_conversion(source, target, diffusion_steps, length_adjust, inference_c
# Convert to waveform
vc_wave = hift_gen.inference(vc_target)

return (sr, vc_wave.squeeze(0).cpu().numpy())
return sr, vc_wave.squeeze(0).cpu().numpy()


if __name__ == "__main__":
description = "Zero-shot voice conversion with in-context learning. Check out our [GitHub repository](https://github.com/Plachtaa/seed-vc) for details and updates."
inputs = [
gr.Audio(source="upload", type="filepath", label="Source Audio"),
gr.Audio(source="upload", type="filepath", label="Reference Audio"),
gr.Slider(minimum=1, maximum=200, value=100, step=1, label="Diffusion Steps"),
gr.Slider(minimum=0.5, maximum=2.0, step=0.1, value=1.0, label="Length Adjust"),
gr.Slider(minimum=0.0, maximum=1.0, step=0.1, value=0.7, label="Inference CFG Rate"),
gr.Audio(type="filepath", label="Source Audio"),
gr.Audio(type="filepath", label="Reference Audio"),
gr.Slider(minimum=1, maximum=200, value=10, step=1, label="Diffusion Steps", info="10 by default, 50~100 for best quality"),
gr.Slider(minimum=0.5, maximum=2.0, step=0.1, value=1.0, label="Length Adjust", info="<1.0 for speed-up speech, >1.0 for slow-down speech"),
gr.Slider(minimum=0.0, maximum=1.0, step=0.1, value=0.7, label="Inference CFG Rate", info="has subtle influence"),
gr.Slider(minimum=1, maximum=3, step=1, value=3, label="N Quantizers", info="the less quantizer used, the less prosody of source audio is preserved"),
]

examples = [["examples/source/yae_0.wav", "examples/reference/dingzhen_0.wav", 50, 1.0, 0.7, 1]]

outputs = gr.Audio(label="Output Audio")

gr.Interface(fn=voice_conversion, description=description, inputs=inputs, outputs=outputs, title="Seed Voice Conversion").launch()
gr.Interface(fn=voice_conversion,
description=description,
inputs=inputs,
outputs=outputs,
title="Seed Voice Conversion",
examples=examples,
cache_examples=False,
).launch()
Binary file removed campplus_cn_common.bin
Binary file not shown.
97 changes: 97 additions & 0 deletions configs/config_dit_mel_seed_facodec_small.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
log_dir: "./runs/run_dit_mel_seed_facodec_small"
save_freq: 1
log_interval: 10
save_interval: 1000
device: "cuda"
epochs: 1000 # number of epochs for first stage training (pre-training)
batch_size: 2
batch_length: 100 # maximum duration of audio in a batch (in seconds)
max_len: 80 # maximum number of frames
pretrained_model: ""
pretrained_encoder: ""
load_only_params: False # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "modules/JDC/bst.t7"

data_params:
train_data: "./data/train.txt"
val_data: "./data/val.txt"
root_path: "./data/"

preprocess_params:
sr: 22050
spect_params:
n_fft: 1024
win_length: 1024
hop_length: 256
n_mels: 80

model_params:
dit_type: "DiT" # uDiT or DiT
reg_loss_type: "l1" # l1 or l2

speech_tokenizer:
type: 'facodec' # facodec or cosyvoice
path: "checkpoints/speech_tokenizer_v1.onnx"

style_encoder:
dim: 192
campplus_path: "checkpoints/campplus_cn_common.bin"

DAC:
encoder_dim: 64
encoder_rates: [2, 5, 5, 6]
decoder_dim: 1536
decoder_rates: [ 6, 5, 5, 2 ]
sr: 24000

length_regulator:
channels: 512
is_discrete: true
content_codebook_size: 1024
in_frame_rate: 80
out_frame_rate: 80
sampling_ratios: [1, 1, 1, 1]
token_dropout_prob: 0.3 # probability of performing token dropout
token_dropout_range: 1.0 # maximum percentage of tokens to drop out
n_codebooks: 3
quantizer_dropout: 0.5
f0_condition: false
n_f0_bins: 512

DiT:
hidden_dim: 512
num_heads: 8
depth: 13
class_dropout_prob: 0.1
block_size: 8192
in_channels: 80
style_condition: true
final_layer_type: 'wavenet'
target: 'mel' # mel or codec
content_dim: 512
content_codebook_size: 1024
content_type: 'discrete'
f0_condition: true
n_f0_bins: 512
content_codebooks: 1
is_causal: false
long_skip_connection: true
zero_prompt_speech_token: false # for prompt component, do not input corresponding speech token
time_as_token: false
style_as_token: false
uvit_skip_connection: true
add_resblock_in_transformer: false

wavenet:
hidden_dim: 512
num_layers: 8
kernel_size: 5
dilation_rate: 1
p_dropout: 0.2
style_condition: true

loss_params:
base_lr: 0.0001
lambda_mel: 45
lambda_kl: 1.0
File renamed without changes.
16 changes: 16 additions & 0 deletions dac/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
__version__ = "1.0.0"

# preserved here for legacy reasons
__model_version__ = "latest"

import audiotools

audiotools.ml.BaseModel.INTERN += ["dac.**"]
audiotools.ml.BaseModel.EXTERN += ["einops"]


from . import nn
from . import model
from . import utils
from .model import DAC
from .model import DACFile
36 changes: 36 additions & 0 deletions dac/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import sys

import argbind

from dac.utils import download
from dac.utils.decode import decode
from dac.utils.encode import encode

STAGES = ["encode", "decode", "download"]


def run(stage: str):
"""Run stages.
Parameters
----------
stage : str
Stage to run
"""
if stage not in STAGES:
raise ValueError(f"Unknown command: {stage}. Allowed commands are {STAGES}")
stage_fn = globals()[stage]

if stage == "download":
stage_fn()
return

stage_fn()


if __name__ == "__main__":
group = sys.argv.pop(1)
args = argbind.parse_args(group=group)

with argbind.scope(args):
run(group)
4 changes: 4 additions & 0 deletions dac/model/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .base import CodecMixin
from .base import DACFile
from .dac import DAC
from .discriminator import Discriminator
Loading

0 comments on commit f4c03d2

Please sign in to comment.