-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecture of Supervoice #1
Comments
Hey, i am mostly reproducing the paper VoiceBox from meta. I don't have llm now, it is just a transformer that translates phonemes to sound + duration model that predicts number of audio segments for a phoneme. LLM might appear later to emit phonemes and durations. |
I am interested in Duration predictor, Fastspeech duration predictor is quite naive and not able to model expressive prosody. I would prefer autoregressive duration predictor with Gaussian Up sampling for expressive natural sounding speech. |
I also didn't like duration predictor, but i am blaming my dataset or data is too simple to train. I feel that some kind of context is needed to properly train duration network |
Completely agreed, I think Naturalspeech 2's duration predictor which takes prompt and do cross attn between prompt feature to text feature is one of the good ways to predict duration as it's considered input voice and prosody from prompt and linguistic feature from text. |
Nice paper! Confirms my feeling that this models are the future, but we want to adjust more and more features. Honestly i am playing around with vocoders right now, i have tested vocos and hifigan (both training them from scratch) and only hifigan works well for me, i am also trying to upsample from 16khz to 24khz in such vocoders. All papers are confusing since they claim to outperform hifigan, but in my tests hifigan converges reliably and outperforms other models. |
I have extreme level of expertise in vocoders, I have implemented approx all good GAN based vocoders, hifigan-v1 and univnet are the best ever I encountered. Another vocoder named FreGAN also performed equal or better in sometimes compared to hifigan, but it depends upon data to data. Some vocoders are noise robust, some vocoders generalize better, some perform good with large data, some perform good on small one, some is good for finetuning. Overall, on average hifigan-v1 and univnet are the best, vocos is good but only when you trained on high volumne diverse data. |
I just tried the Vocos and it turns crisp voice to a dull one. This is exactly an effect i am trying to avoid. My current goal is to raise bar for quality and i think that the first low hanging fruit is to make voice crisp first, then natural. Have you tried this one? https://github.com/sony/bigvsan their demo page is weird, but they trained it further and i just tested and it performed really well. |
BigVSAN and BigVGAN both are good but I'm not sure are they crisper, because I have also struggle lot to find crisp vocoders. |
I have tested BigVSAN and i am really impressed. They also the only one team that published weights that were trained for 10m iterations instead of 1m, therefore i am using them now and i have prepared nice repo for easier to use: https://github.com/ex3ndr/supervoice-vocoder You can see how nice it's quality: |
https://github.com/ex3ndr/supervoice/blob/1bb4a32f0628afd57e909257bb0be29362c9fdc2/supervoice/model.py#L24 |
Hi @ex3ndr , checked your latest commit on duration predictor. Have you trained duration predictor ? |
In the process here: https://github.com/ex3ndr/supervoice-gpt |
I am also planing to implement same |
Are you treating phoneme duration as a classification task as phoneme duration is discrete value not continuous and more or less, they range between 0 to 50 at max? |
I treat them as a normal token 0-100 duration. For some reason it feels too fast somehow, I don't understand why. Do you have similar experience? |
if you use a standard token and predict token, you treat it as a classification task which I also support. It should be fast because I don't think it is a complicated task for the model. |
No, i mean the phonemes are kind of feel too fast (short), comparing to human-generated ones. I feel that something is missing here. |
Yes, when you predict duration using duration predictor it always come fast no matter what, in some case it comes out normal. One way to tackle this problem is to use MoE based duration predictor like in this paper: https://arxiv.org/pdf/2107.02530.pdf . |
Interesting, but i am not convinced:
It is just weirdly slow, i multiply by I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme... |
Yes. |
@ex3ndr Samples sounds decent 👍🏽 |
Some initial feedback:
Otherwise, the voice sounds exactly like a human and very natural flow, amazing job 👍🏽 . |
Hi, just saw your repo, and bit confused regarding the architecture and philosophy behind you TTS model. Could please add little bit regarding your architecture, like you training LLM for TTS but you also don training for duration which seems something new as most Large model TTS rely on autoregressive model for duration itself.
Although I will go through your code and try to be figured it out myself.
The text was updated successfully, but these errors were encountered: