Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and enable XTTS streaming #478

Open
wants to merge 9 commits into
base: alltalkbeta
Choose a base branch
from

Conversation

SilyNoMeta
Copy link

@SilyNoMeta SilyNoMeta commented Jan 4, 2025

  • adds the ability to enable streaming on XTTS settings (disabled on others depending of capabilities)
  • uses the state of the streaming flag when using the Open AI compatible Speech API
  • fix the streaming mode

SilyNoMeta and others added 9 commits January 3, 2025 18:32
* adds langdetect as requirement for colab, standalone and textgen
* adds "auto" to the language dropdown in the Advanced Engine/Model Settings panel
* replace the hardcoded "en" by "auto" when called by the OpenAI compatible Speech API
Add initial support for pickletensor models to F5-TTS
@erew123
Copy link
Owner

erew123 commented Jan 7, 2025

Hi @SilyNoMeta

As you may note there is a github merge/sequencing thing going on here with the next 4 PR's that you sent, all seemingly back to the tts_server.py. Should be easy enough to sort out, but I am looking deeper at the code changes before I pull things in. I push them all to a staging area first and then up to the alltalkbeta.

That aside, I have 2x questions for you on this update:

  1. Any reason you have set the central generate function to specify "none" for the output file name? It doesnt matter either way specifically, however the original code should do exactly the same behaviour as the changes you made, just in less lines (though it will add a file name, if one existed, but that wont matter for streaming), its just more compact. I was wondering if there was a specific issue you encountered?

text, voice, language, temperature, repetition_penalty, speed, pitch, output_file=None, streaming=True

  1. I see you are re-defining/setting up a new variable for the model_engine class as current_model_engine

alltalk_tts/tts_server.py

Lines 1129 to 1130 in c83faf9

# Load current model engine configuration
current_model_engine = tts_class()

but its already pulled in as model_engine:

model_engine = tts_class()

is this because you are attempting to re-load the variables from the actual underlying engine on each run, in case the mapped voice changed? If so, Im probably going to move this back to an update of model_engine just to keep all variables the same throughout the script. But Im just checking thats what I think you are doing, or if there was some other reason/issue you encountered?

Sorry to have to ask, but I do like to ensure I know why the code is doing certain things and I actually have a huge update 80% done that I will have to merge in after all these new PR's and there are quite a few changes to do with generation of TTS and a new rvc pipeline, so I just need to be certain on the core functionality of the generate functions in my head.

Thanks

@SilyNoMeta
Copy link
Author

Hey !
I do not have a lot of time currently but I'll do my best explaining why I did these changes a few days ago.

Most of the changes in tts_server.py and model_engine.py are as you said, not necessary so you shouldn't really bother merging those if it will conflicts with your working branches.

Actually, what happend was, as I was trying to enable streaming support through my new settings, I've got errors.
So I started looking at the code and I refactored it in a way that was more readable for me (and perhaps me only 😆).
As I had a few more ideas in mind, when it was working, I didn't thought about reverting it and just carried on creating a new branch for something else.

The "true" fix ended up being the addition of the StreamingResponse when the new flag was set on the OpenAI Speech API compatible webservice :

#

alltalk_tts/tts_server.py

Lines 1150 to 1156 in c83faf9

if current_model_engine.streaming_enabled:
audio_stream = await generate_audio(
cleaned_string, mapped_voice, "auto", current_model_engine.temperature_set,
float(str(current_model_engine.repetitionpenalty_set).replace(',', '.')), speed, current_model_engine.pitch_set,
output_file=None, streaming=True
)
return StreamingResponse(audio_stream, media_type="audio/wav")

As for the model engine redefinition, what happends was that when I was playing with the GUI, the new saved settings were not used when I tested the API.
So I debugged what was in the "model_engine" variable and it contained the settings of when I launched the app, not the newly saved ones (I don't know if I'm clear..)
Honestly I didn't understood why and didn't looked into it very much. Perhaps when we save the news configuration on the GUI, this variable is not updated correctly ?
The easy solution for me was to re-load it as you saw here but it might not be a "production-worthy" fix I can't deny it !

#

alltalk_tts/tts_server.py

Lines 1129 to 1130 in c83faf9

# Load current model engine configuration
current_model_engine = tts_class()

Good luck with your work ! I'm now hyped !! 🍿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants