TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

czuzu · 2024-05-08T13:32:21Z

Hello,

Very nice work with this project! 🔥
(eager to use the fine-tuning part soon 🤓)

I've set up TGUI with the extension locally, modified the setup to allow for passing LLM replies as they're being generated (chat "stream" mode, and to subsequently do real-time TTS (aka "incremental" TTS).

(TGUI PR for reference: oobabooga/text-generation-webui#5994)

This allows for real-time TTS, i.e. XTTSv2 streaming is started as soon as the LLM generated its first sentence.

3 streaming options in TGUI chat mode:

"off" - TTS streaming is off, as before
"whole" - first wait for whole LLM text, then generate TTS with XTTSv2 streaming mode (inference_stream)
"sentences" - as soon as the first LLM sentence is complete, XTTSv2 will start TTS generation in streaming mode; subsequent sentences are added incrementally; e.g. if the LLMs reply is 1000-words long, you'll start hearing TTS (almost) as soon as the first sentence is complete

Changes were done such that the extension is still functional until the TGUI PR is merged - sentences streaming will simply not work until that happens, the rest should though, as before.

Still a bit of a work-in-progress having some slight issues with handling the WAV files, but creating the PR for early review and in case someone is eager to set this up 😎

Extra note: locally I've also modified a bit the XTTSv2 to add an inference_stream_text method, which is a slightly friendlier to text-streaming, I'll be creating a PR there too. Nevertheless, the logic checks if it's available and falls back to using the existing API until that PR will be merged (hopefully).

Cheers! 🍺🍺

- 3 modes for streaming: "off" (behaves as before), "whole" (waits for full LLM output before TTS streaming) and "sentences" (streams sentences as LLM output is generating) - Sentences streaming depends on TGUI `output_modifier_stream` (PR under review atm) - Streaming not available (not implemented) for when narrator is enabled - Streaming only available for XTTSv2 local (as before) - TGUI "stream" mode enabled when sentences-streaming is activated - TTS server: - API extended with "streaming_set" endpoint (off/whole/sentences) - Extended with "stream" endpoint - this renders the WAV chunks as they're generated - When streaming is on, starts a background task to stream the WAV chunks - see also tts_stream.py; when streaming finalizes, it's written to an output WAV file, as before - XTTSv2 PR soon to follow on HF for inference_stream_text API (implemented locally), but still works without it (falls back to inference_stream)

czuzu · 2024-05-08T14:28:26Z

TTS PR added meanwhile: coqui-ai/TTS#3724

erew123 · 2024-05-08T15:59:40Z

Hi @czuzu

That's pretty awesome! So if its just merged into AT right now, it would just work? (once Text-gen-webui have merged in their changes I'm guessing). I may however wait for TGWUI to import the PR on their side before I push this live.

I'm currently somewhere in the 60% ish re-write of AT to make it a version 2 of it. There are quite a lot of new features in it such as more control over the API settings, extra TTS engines etc. Im trying to make it more modular in some ways, ensuring the code is clear to work with. At the moment Im still working on features and also cleaning up code so that across the board, if you have to work in different bits of script, the variables are all the same name etc.

As it goes, working on some aspects of the streaming were on my list of things to look into! Its possible that you might have helped bump me further along the route to get a beta version out there, which is great! :)

So Ill be happy to pull in this PR if its just going to be a case of it just working? I can then have a play and try figure out whats doing what. Though, if I have a beta of AllTalk out in a few weeks time, if I cant quite get things integrated into the beta version, would you be willing to take a look?

Thanks for the PR and your work on this! Its really great! :)

czuzu · 2024-05-08T16:16:03Z

Hello @erew123,

Thanks for the props, it was quite the adventure 🍺

To test it locally once you cherry-pick the TGUI PR it should just work 🤞 (also added pysbd as a requirement though) and regarding the merge here, I agree it makes more sense to first wait for the TGUI part to be accepted.

Regarding the beta: hit me up if you need help with integration of this there, although I cannot promise I'll be able to, depending also on when it's out, but I'll try.

Hopefully it will work out of the box on your side, if not and I need to fix something, let me know.

Later edit: I've pushed also the fix I was mentioning for not correctly handling the WAV files, make sure to take that one too and with that, I think the PR is open for review

erew123 · 2024-05-08T16:24:59Z

@czuzu Awesome! Thanks! Ill wait on TGWUI and see what happens there.

As far as AT v2, its not like the new code base is all suddenly different, so your code should be a near enough simple import, perhaps with a few variable name changes.

The API within AT v2 will deal with "Is this TTS engine capable of streaming or not?", so its not like your code would have to worry about that type of issue, as any engines that arent streaming capable, the API calls will tell people "im not doing that" and that will link all the way back to things like the Text-Gen-webui interface which would remove the selection option for streaming in the interface when someone selects a model that isnt streaming capable. So you wouldnt have to worry about those kind of outlier situations.

As I say though, there's a lot of rough code in there at the moment and I'm working my way through it bit by bit and hope to post a BETA sometime soon.

czuzu · 2024-05-08T16:30:07Z

That's looking good!

erew123 · 2024-06-14T10:51:55Z

Hi @czuzu Just to say Im keeping an eye on the PR at TGWUI... I've not forgotten your PR

czuzu marked this pull request as draft May 8, 2024 13:33

Fix stream loading from file after it is committed

22f4923

czuzu changed the title ~~[WIP] TGUI: add support for XTTSv2 local streaming (including sentences streaming)~~ TGUI: add support for XTTSv2 local streaming (including sentences streaming) May 8, 2024

czuzu marked this pull request as ready for review May 8, 2024 16:06

czuzu added 2 commits May 8, 2024 19:25

Add mistakenly ommitted break

75c03ef

Fix tts_stop_generation logic

7c92910

erew123 mentioned this pull request May 10, 2024

Start audio generation before the text is finished #215

Closed

erew123 added the Awaiting 3rd party change Awaiting a change in a codebase that isn't AllTalk's code label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

czuzu commented May 8, 2024

czuzu commented May 8, 2024

erew123 commented May 8, 2024

czuzu commented May 8, 2024 •

edited

Loading

erew123 commented May 8, 2024

czuzu commented May 8, 2024

erew123 commented Jun 14, 2024

TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

Are you sure you want to change the base?

TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

Conversation

czuzu commented May 8, 2024

czuzu commented May 8, 2024

erew123 commented May 8, 2024

czuzu commented May 8, 2024 • edited Loading

erew123 commented May 8, 2024

czuzu commented May 8, 2024

erew123 commented Jun 14, 2024

czuzu commented May 8, 2024 •

edited

Loading