Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture of Supervoice #1

Open
rishikksh20 opened this issue Feb 12, 2024 · 23 comments
Open

Architecture of Supervoice #1

rishikksh20 opened this issue Feb 12, 2024 · 23 comments

Comments

@rishikksh20
Copy link

Hi, just saw your repo, and bit confused regarding the architecture and philosophy behind you TTS model. Could please add little bit regarding your architecture, like you training LLM for TTS but you also don training for duration which seems something new as most Large model TTS rely on autoregressive model for duration itself.

Although I will go through your code and try to be figured it out myself.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 12, 2024

Hey, i am mostly reproducing the paper VoiceBox from meta. I don't have llm now, it is just a transformer that translates phonemes to sound + duration model that predicts number of audio segments for a phoneme. LLM might appear later to emit phonemes and durations.

@rishikksh20
Copy link
Author

rishikksh20 commented Feb 13, 2024

I am interested in Duration predictor, Fastspeech duration predictor is quite naive and not able to model expressive prosody. I would prefer autoregressive duration predictor with Gaussian Up sampling for expressive natural sounding speech.
Do you have any thought on duration prediction, or you did any experimentation for same?

@rishikksh20 rishikksh20 reopened this Feb 13, 2024
@ex3ndr
Copy link
Owner

ex3ndr commented Feb 13, 2024

I also didn't like duration predictor, but i am blaming my dataset or data is too simple to train. I feel that some kind of context is needed to properly train duration network

@rishikksh20
Copy link
Author

rishikksh20 commented Feb 13, 2024

Completely agreed, I think Naturalspeech 2's duration predictor which takes prompt and do cross attn between prompt feature to text feature is one of the good ways to predict duration as it's considered input voice and prosody from prompt and linguistic feature from text.
New paper from Microsoft : https://arxiv.org/pdf/2402.07383.pdf
also based on Voicebox like architecture and same duration predictor

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 13, 2024

Nice paper! Confirms my feeling that this models are the future, but we want to adjust more and more features. Honestly i am playing around with vocoders right now, i have tested vocos and hifigan (both training them from scratch) and only hifigan works well for me, i am also trying to upsample from 16khz to 24khz in such vocoders. All papers are confusing since they claim to outperform hifigan, but in my tests hifigan converges reliably and outperforms other models.

@rishikksh20
Copy link
Author

rishikksh20 commented Feb 13, 2024

I have extreme level of expertise in vocoders, I have implemented approx all good GAN based vocoders, hifigan-v1 and univnet are the best ever I encountered. Another vocoder named FreGAN also performed equal or better in sometimes compared to hifigan, but it depends upon data to data. Some vocoders are noise robust, some vocoders generalize better, some perform good with large data, some perform good on small one, some is good for finetuning. Overall, on average hifigan-v1 and univnet are the best, vocos is good but only when you trained on high volumne diverse data.
I prefer to use this: https://github.com/rishikksh20/iSTFTNet-pytorch as its easily converged and trained on small amount of data but sometimes have mid tone frequency lines which hurt the quality otherwise it's as good as hifigan-v1 and 2x fast. For your use case hifigan-v1 will be best.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 13, 2024

I just tried the Vocos and it turns crisp voice to a dull one. This is exactly an effect i am trying to avoid. My current goal is to raise bar for quality and i think that the first low hanging fruit is to make voice crisp first, then natural.

Have you tried this one? https://github.com/sony/bigvsan their demo page is weird, but they trained it further and i just tested and it performed really well.

@rishikksh20
Copy link
Author

rishikksh20 commented Feb 13, 2024

BigVSAN and BigVGAN both are good but I'm not sure are they crisper, because I have also struggle lot to find crisp vocoders.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 14, 2024

I have tested BigVSAN and i am really impressed. They also the only one team that published weights that were trained for 10m iterations instead of 1m, therefore i am using them now and i have prepared nice repo for easier to use: https://github.com/ex3ndr/supervoice-vocoder

You can see how nice it's quality:
Source: https://github.com/ex3ndr/supervoice-vocoder/blob/master/sample.wav
Re-copy: https://github.com/ex3ndr/supervoice-vocoder/blob/master/resynth.wav

@rishikksh20
Copy link
Author

rishikksh20 commented Feb 19, 2024

https://github.com/ex3ndr/supervoice/blob/1bb4a32f0628afd57e909257bb0be29362c9fdc2/supervoice/model.py#L24
update this with your vocoder as model_vocoder file is not there.

@rishikksh20
Copy link
Author

@rishikksh20
Copy link
Author

Hi @ex3ndr , checked your latest commit on duration predictor. Have you trained duration predictor ?

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 26, 2024

In the process here: https://github.com/ex3ndr/supervoice-gpt
It is phonemizer + duration model in one

@rishikksh20
Copy link
Author

I am also planing to implement same

@rishikksh20
Copy link
Author

Are you treating phoneme duration as a classification task as phoneme duration is discrete value not continuous and more or less, they range between 0 to 50 at max?

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 26, 2024

I treat them as a normal token 0-100 duration.

For some reason it feels too fast somehow, I don't understand why. Do you have similar experience?

@rishikksh20
Copy link
Author

if you use a standard token and predict token, you treat it as a classification task which I also support. It should be fast because I don't think it is a complicated task for the model.
I have a thought that if we pass voice prompt along with text for prosody modeling because duration will be part of prosody.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 27, 2024

No, i mean the phonemes are kind of feel too fast (short), comparing to human-generated ones. I feel that something is missing here.

@rishikksh20
Copy link
Author

Yes, when you predict duration using duration predictor it always come fast no matter what, in some case it comes out normal. One way to tackle this problem is to use MoE based duration predictor like in this paper: https://arxiv.org/pdf/2107.02530.pdf .

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 27, 2024

Interesting, but i am not convinced:

  1. GPT learns full distribution, not the only optimal one
  2. GPT samples durations, not predicts
  3. GPT has inserted durations between words that are also sampled

It is just weirdly slow, i multiply by 1.1 to 1.2 and it works better, which is double weird because audio model is trained on 12.5ms tokens, but GPT is on 10ms one, which means that GPT is already emits longer tokens.

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

@rishikksh20
Copy link
Author

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

Yes.

@rishikksh20
Copy link
Author

@ex3ndr Samples sounds decent 👍🏽

@rishikksh20
Copy link
Author

Some initial feedback:

  • Issue with special characters like - for example, it take long pause between open and source while pronouncing open-source.
  • Issue while pronouncing Abbreviated words like HTML , CEO etc.

Otherwise, the voice sounds exactly like a human and very natural flow, amazing job 👍🏽 .
Maybe training a bigger model with more variety of data will help to overcome the above issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants