You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello community! I’m an ML engineer, but I am not experienced in audio/speech synthesis. Can somebody please explain how this works like I am 5? (Please, no papers.)
Is there a vocoder (Mel-to-audio converter) and a synthesizer (such as a Mel generator, GAN, or diffusion model)? Is converting audio to Mel just a simple Python script, or does it also involve an AI model? I understand it is based on Bert-VITS2, but I'm unfamiliar with that concept.
LLMs generate tokens one-by-one like the human brain (a.k.a. thinking), but why is this a non-autoregressive model? It generates all at once?
Thank you!
The text was updated successfully, but these errors were encountered:
Hello community! I’m an ML engineer, but I am not experienced in audio/speech synthesis. Can somebody please explain how this works like I am 5? (Please, no papers.)
Is there a vocoder (Mel-to-audio converter) and a synthesizer (such as a Mel generator, GAN, or diffusion model)? Is converting audio to Mel just a simple Python script, or does it also involve an AI model? I understand it is based on Bert-VITS2, but I'm unfamiliar with that concept.
LLMs generate tokens one-by-one like the human brain (a.k.a. thinking), but why is this a non-autoregressive model? It generates all at once?
Thank you!
The text was updated successfully, but these errors were encountered: