Architecture of Supervoice #1

rishikksh20 · 2024-02-12T12:03:55Z

Hi, just saw your repo, and bit confused regarding the architecture and philosophy behind you TTS model. Could please add little bit regarding your architecture, like you training LLM for TTS but you also don training for duration which seems something new as most Large model TTS rely on autoregressive model for duration itself.

Although I will go through your code and try to be figured it out myself.

ex3ndr · 2024-02-12T16:34:33Z

Hey, i am mostly reproducing the paper VoiceBox from meta. I don't have llm now, it is just a transformer that translates phonemes to sound + duration model that predicts number of audio segments for a phoneme. LLM might appear later to emit phonemes and durations.

rishikksh20 · 2024-02-13T05:31:17Z

I am interested in Duration predictor, Fastspeech duration predictor is quite naive and not able to model expressive prosody. I would prefer autoregressive duration predictor with Gaussian Up sampling for expressive natural sounding speech.
Do you have any thought on duration prediction, or you did any experimentation for same?

ex3ndr · 2024-02-13T05:33:39Z

I also didn't like duration predictor, but i am blaming my dataset or data is too simple to train. I feel that some kind of context is needed to properly train duration network

rishikksh20 · 2024-02-13T06:30:21Z

Completely agreed, I think Naturalspeech 2's duration predictor which takes prompt and do cross attn between prompt feature to text feature is one of the good ways to predict duration as it's considered input voice and prosody from prompt and linguistic feature from text.
New paper from Microsoft : https://arxiv.org/pdf/2402.07383.pdf
also based on Voicebox like architecture and same duration predictor

ex3ndr · 2024-02-13T07:12:25Z

Nice paper! Confirms my feeling that this models are the future, but we want to adjust more and more features. Honestly i am playing around with vocoders right now, i have tested vocos and hifigan (both training them from scratch) and only hifigan works well for me, i am also trying to upsample from 16khz to 24khz in such vocoders. All papers are confusing since they claim to outperform hifigan, but in my tests hifigan converges reliably and outperforms other models.

rishikksh20 · 2024-02-13T08:26:56Z

I have extreme level of expertise in vocoders, I have implemented approx all good GAN based vocoders, hifigan-v1 and univnet are the best ever I encountered. Another vocoder named FreGAN also performed equal or better in sometimes compared to hifigan, but it depends upon data to data. Some vocoders are noise robust, some vocoders generalize better, some perform good with large data, some perform good on small one, some is good for finetuning. Overall, on average hifigan-v1 and univnet are the best, vocos is good but only when you trained on high volumne diverse data.
I prefer to use this: https://github.com/rishikksh20/iSTFTNet-pytorch as its easily converged and trained on small amount of data but sometimes have mid tone frequency lines which hurt the quality otherwise it's as good as hifigan-v1 and 2x fast. For your use case hifigan-v1 will be best.

ex3ndr · 2024-02-13T08:32:55Z

I just tried the Vocos and it turns crisp voice to a dull one. This is exactly an effect i am trying to avoid. My current goal is to raise bar for quality and i think that the first low hanging fruit is to make voice crisp first, then natural.

Have you tried this one? https://github.com/sony/bigvsan their demo page is weird, but they trained it further and i just tested and it performed really well.

rishikksh20 · 2024-02-13T09:31:37Z

BigVSAN and BigVGAN both are good but I'm not sure are they crisper, because I have also struggle lot to find crisp vocoders.

ex3ndr · 2024-02-14T05:47:13Z

I have tested BigVSAN and i am really impressed. They also the only one team that published weights that were trained for 10m iterations instead of 1m, therefore i am using them now and i have prepared nice repo for easier to use: https://github.com/ex3ndr/supervoice-vocoder

You can see how nice it's quality:
Source: https://github.com/ex3ndr/supervoice-vocoder/blob/master/sample.wav
Re-copy: https://github.com/ex3ndr/supervoice-vocoder/blob/master/resynth.wav

rishikksh20 · 2024-02-19T10:05:08Z

https://github.com/ex3ndr/supervoice/blob/1bb4a32f0628afd57e909257bb0be29362c9fdc2/supervoice/model.py#L24
update this with your vocoder as model_vocoder file is not there.

rishikksh20 · 2024-02-20T15:22:02Z

https://arxiv.org/pdf/2402.12208.pdf

rishikksh20 · 2024-02-26T07:48:12Z

Hi @ex3ndr , checked your latest commit on duration predictor. Have you trained duration predictor ?

ex3ndr · 2024-02-26T09:43:51Z

In the process here: https://github.com/ex3ndr/supervoice-gpt
It is phonemizer + duration model in one

rishikksh20 · 2024-02-26T09:53:55Z

I am also planing to implement same

rishikksh20 · 2024-02-26T10:02:04Z

Are you treating phoneme duration as a classification task as phoneme duration is discrete value not continuous and more or less, they range between 0 to 50 at max?

ex3ndr · 2024-02-26T11:17:03Z

I treat them as a normal token 0-100 duration.

For some reason it feels too fast somehow, I don't understand why. Do you have similar experience?

rishikksh20 · 2024-02-26T12:29:09Z

if you use a standard token and predict token, you treat it as a classification task which I also support. It should be fast because I don't think it is a complicated task for the model.
I have a thought that if we pass voice prompt along with text for prosody modeling because duration will be part of prosody.

ex3ndr · 2024-02-27T12:09:50Z

No, i mean the phonemes are kind of feel too fast (short), comparing to human-generated ones. I feel that something is missing here.

rishikksh20 · 2024-02-27T12:51:51Z

Yes, when you predict duration using duration predictor it always come fast no matter what, in some case it comes out normal. One way to tackle this problem is to use MoE based duration predictor like in this paper: https://arxiv.org/pdf/2107.02530.pdf .

ex3ndr · 2024-02-27T13:10:15Z

Interesting, but i am not convinced:

GPT learns full distribution, not the only optimal one
GPT samples durations, not predicts
GPT has inserted durations between words that are also sampled

It is just weirdly slow, i multiply by 1.1 to 1.2 and it works better, which is double weird because audio model is trained on 12.5ms tokens, but GPT is on 10ms one, which means that GPT is already emits longer tokens.

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

rishikksh20 · 2024-02-28T11:54:15Z

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

Yes.

rishikksh20 · 2024-03-21T08:12:07Z

@ex3ndr Samples sounds decent 👍🏽

rishikksh20 · 2024-03-21T08:41:23Z

Some initial feedback:

Issue with special characters like - for example, it take long pause between open and source while pronouncing open-source.
Issue while pronouncing Abbreviated words like HTML , CEO etc.

Otherwise, the voice sounds exactly like a human and very natural flow, amazing job 👍🏽 .
Maybe training a bigger model with more variety of data will help to overcome the above issue.

rishikksh20 closed this as completed Feb 12, 2024

rishikksh20 reopened this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture of Supervoice #1

Architecture of Supervoice #1

rishikksh20 commented Feb 12, 2024

ex3ndr commented Feb 12, 2024

rishikksh20 commented Feb 13, 2024 •

edited

Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 •

edited

Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 •

edited

Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 •

edited

Loading

ex3ndr commented Feb 14, 2024

rishikksh20 commented Feb 19, 2024 •

edited

Loading

rishikksh20 commented Feb 20, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 27, 2024

rishikksh20 commented Feb 27, 2024

ex3ndr commented Feb 27, 2024

rishikksh20 commented Feb 28, 2024

rishikksh20 commented Mar 21, 2024

rishikksh20 commented Mar 21, 2024

Architecture of Supervoice #1

Architecture of Supervoice #1

Comments

rishikksh20 commented Feb 12, 2024

ex3ndr commented Feb 12, 2024

rishikksh20 commented Feb 13, 2024 • edited Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 • edited Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 • edited Loading

ex3ndr commented Feb 13, 2024

rishikksh20 commented Feb 13, 2024 • edited Loading

ex3ndr commented Feb 14, 2024

rishikksh20 commented Feb 19, 2024 • edited Loading

rishikksh20 commented Feb 20, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 26, 2024

rishikksh20 commented Feb 26, 2024

ex3ndr commented Feb 27, 2024

rishikksh20 commented Feb 27, 2024

ex3ndr commented Feb 27, 2024

rishikksh20 commented Feb 28, 2024

rishikksh20 commented Mar 21, 2024

rishikksh20 commented Mar 21, 2024

rishikksh20 commented Feb 13, 2024 •

edited

Loading

rishikksh20 commented Feb 13, 2024 •

edited

Loading

rishikksh20 commented Feb 13, 2024 •

edited

Loading

rishikksh20 commented Feb 13, 2024 •

edited

Loading

rishikksh20 commented Feb 19, 2024 •

edited

Loading