-
-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208
base: main
Are you sure you want to change the base?
Conversation
- 3 modes for streaming: "off" (behaves as before), "whole" (waits for full LLM output before TTS streaming) and "sentences" (streams sentences as LLM output is generating) - Sentences streaming depends on TGUI `output_modifier_stream` (PR under review atm) - Streaming not available (not implemented) for when narrator is enabled - Streaming only available for XTTSv2 local (as before) - TGUI "stream" mode enabled when sentences-streaming is activated - TTS server: - API extended with "streaming_set" endpoint (off/whole/sentences) - Extended with "stream" endpoint - this renders the WAV chunks as they're generated - When streaming is on, starts a background task to stream the WAV chunks - see also tts_stream.py; when streaming finalizes, it's written to an output WAV file, as before - XTTSv2 PR soon to follow on HF for inference_stream_text API (implemented locally), but still works without it (falls back to inference_stream)
TTS PR added meanwhile: coqui-ai/TTS#3724 |
Hi @czuzu That's pretty awesome! So if its just merged into AT right now, it would just work? (once Text-gen-webui have merged in their changes I'm guessing). I may however wait for TGWUI to import the PR on their side before I push this live. I'm currently somewhere in the 60% ish re-write of AT to make it a version 2 of it. There are quite a lot of new features in it such as more control over the API settings, extra TTS engines etc. Im trying to make it more modular in some ways, ensuring the code is clear to work with. At the moment Im still working on features and also cleaning up code so that across the board, if you have to work in different bits of script, the variables are all the same name etc. As it goes, working on some aspects of the streaming were on my list of things to look into! Its possible that you might have helped bump me further along the route to get a beta version out there, which is great! :) So Ill be happy to pull in this PR if its just going to be a case of it just working? I can then have a play and try figure out whats doing what. Though, if I have a beta of AllTalk out in a few weeks time, if I cant quite get things integrated into the beta version, would you be willing to take a look? Thanks for the PR and your work on this! Its really great! :) |
Hello @erew123, Thanks for the props, it was quite the adventure 🍺 To test it locally once you cherry-pick the TGUI PR it should just work 🤞 (also added pysbd as a requirement though) and regarding the merge here, I agree it makes more sense to first wait for the TGUI part to be accepted. Regarding the beta: hit me up if you need help with integration of this there, although I cannot promise I'll be able to, depending also on when it's out, but I'll try. Hopefully it will work out of the box on your side, if not and I need to fix something, let me know. Later edit: I've pushed also the fix I was mentioning for not correctly handling the WAV files, make sure to take that one too and with that, I think the PR is open for review |
@czuzu Awesome! Thanks! Ill wait on TGWUI and see what happens there. As far as AT v2, its not like the new code base is all suddenly different, so your code should be a near enough simple import, perhaps with a few variable name changes. The API within AT v2 will deal with "Is this TTS engine capable of streaming or not?", so its not like your code would have to worry about that type of issue, as any engines that arent streaming capable, the API calls will tell people "im not doing that" and that will link all the way back to things like the Text-Gen-webui interface which would remove the selection option for streaming in the interface when someone selects a model that isnt streaming capable. So you wouldnt have to worry about those kind of outlier situations. As I say though, there's a lot of rough code in there at the moment and I'm working my way through it bit by bit and hope to post a BETA sometime soon. |
That's looking good! |
Hi @czuzu Just to say Im keeping an eye on the PR at TGWUI... I've not forgotten your PR |
Hello,
Very nice work with this project! 🔥
(eager to use the fine-tuning part soon 🤓)
I've set up TGUI with the extension locally, modified the setup to allow for passing LLM replies as they're being generated (chat "stream" mode, and to subsequently do real-time TTS (aka "incremental" TTS).
(TGUI PR for reference: oobabooga/text-generation-webui#5994)
This allows for real-time TTS, i.e. XTTSv2 streaming is started as soon as the LLM generated its first sentence.
3 streaming options in TGUI chat mode:
Changes were done such that the extension is still functional until the TGUI PR is merged - sentences streaming will simply not work until that happens, the rest should though, as before.
Still a bit of a work-in-progress having some slight issues with handling the WAV files, but creating the PR for early review and in case someone is eager to set this up 😎
Extra note: locally I've also modified a bit the XTTSv2 to add an
inference_stream_text
method, which is a slightly friendlier to text-streaming, I'll be creating a PR there too. Nevertheless, the logic checks if it's available and falls back to using the existing API until that PR will be merged (hopefully).Cheers! 🍺🍺