Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGUI: add support for XTTSv2 local streaming (including sentences streaming) #208

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

czuzu
Copy link

@czuzu czuzu commented May 8, 2024

Hello,

Very nice work with this project! 🔥
(eager to use the fine-tuning part soon 🤓)

I've set up TGUI with the extension locally, modified the setup to allow for passing LLM replies as they're being generated (chat "stream" mode, and to subsequently do real-time TTS (aka "incremental" TTS).

(TGUI PR for reference: oobabooga/text-generation-webui#5994)

This allows for real-time TTS, i.e. XTTSv2 streaming is started as soon as the LLM generated its first sentence.

3 streaming options in TGUI chat mode:

  • "off" - TTS streaming is off, as before
  • "whole" - first wait for whole LLM text, then generate TTS with XTTSv2 streaming mode (inference_stream)
  • "sentences" - as soon as the first LLM sentence is complete, XTTSv2 will start TTS generation in streaming mode; subsequent sentences are added incrementally; e.g. if the LLMs reply is 1000-words long, you'll start hearing TTS (almost) as soon as the first sentence is complete

Changes were done such that the extension is still functional until the TGUI PR is merged - sentences streaming will simply not work until that happens, the rest should though, as before.

Still a bit of a work-in-progress having some slight issues with handling the WAV files, but creating the PR for early review and in case someone is eager to set this up 😎

Extra note: locally I've also modified a bit the XTTSv2 to add an inference_stream_text method, which is a slightly friendlier to text-streaming, I'll be creating a PR there too. Nevertheless, the logic checks if it's available and falls back to using the existing API until that PR will be merged (hopefully).

Cheers! 🍺🍺

- 3 modes for streaming: "off" (behaves as before), "whole" (waits for full
  LLM output before TTS streaming) and "sentences" (streams sentences as
  LLM output is generating)
- Sentences streaming depends on TGUI `output_modifier_stream` (PR
  under review atm)
- Streaming not available (not implemented) for when narrator is enabled
- Streaming only available for XTTSv2 local (as before)
- TGUI "stream" mode enabled when sentences-streaming is activated
- TTS server:
    - API extended with "streaming_set" endpoint (off/whole/sentences)
    - Extended with "stream" endpoint - this renders the WAV chunks as
      they're generated
    - When streaming is on, starts a background task to stream the WAV
      chunks - see also tts_stream.py; when streaming finalizes, it's
      written to an output WAV file, as before
    - XTTSv2 PR soon to follow on HF for inference_stream_text API
      (implemented locally), but still works without it (falls back to
      inference_stream)
@czuzu czuzu marked this pull request as draft May 8, 2024 13:33
@czuzu
Copy link
Author

czuzu commented May 8, 2024

TTS PR added meanwhile: coqui-ai/TTS#3724

@erew123
Copy link
Owner

erew123 commented May 8, 2024

Hi @czuzu

That's pretty awesome! So if its just merged into AT right now, it would just work? (once Text-gen-webui have merged in their changes I'm guessing). I may however wait for TGWUI to import the PR on their side before I push this live.

I'm currently somewhere in the 60% ish re-write of AT to make it a version 2 of it. There are quite a lot of new features in it such as more control over the API settings, extra TTS engines etc. Im trying to make it more modular in some ways, ensuring the code is clear to work with. At the moment Im still working on features and also cleaning up code so that across the board, if you have to work in different bits of script, the variables are all the same name etc.

As it goes, working on some aspects of the streaming were on my list of things to look into! Its possible that you might have helped bump me further along the route to get a beta version out there, which is great! :)

So Ill be happy to pull in this PR if its just going to be a case of it just working? I can then have a play and try figure out whats doing what. Though, if I have a beta of AllTalk out in a few weeks time, if I cant quite get things integrated into the beta version, would you be willing to take a look?

Thanks for the PR and your work on this! Its really great! :)

@czuzu czuzu changed the title [WIP] TGUI: add support for XTTSv2 local streaming (including sentences streaming) TGUI: add support for XTTSv2 local streaming (including sentences streaming) May 8, 2024
@czuzu czuzu marked this pull request as ready for review May 8, 2024 16:06
@czuzu
Copy link
Author

czuzu commented May 8, 2024

Hello @erew123,

Thanks for the props, it was quite the adventure 🍺

To test it locally once you cherry-pick the TGUI PR it should just work 🤞 (also added pysbd as a requirement though) and regarding the merge here, I agree it makes more sense to first wait for the TGUI part to be accepted.

Regarding the beta: hit me up if you need help with integration of this there, although I cannot promise I'll be able to, depending also on when it's out, but I'll try.

Hopefully it will work out of the box on your side, if not and I need to fix something, let me know.

Later edit: I've pushed also the fix I was mentioning for not correctly handling the WAV files, make sure to take that one too and with that, I think the PR is open for review

@erew123
Copy link
Owner

erew123 commented May 8, 2024

@czuzu Awesome! Thanks! Ill wait on TGWUI and see what happens there.

As far as AT v2, its not like the new code base is all suddenly different, so your code should be a near enough simple import, perhaps with a few variable name changes.

The API within AT v2 will deal with "Is this TTS engine capable of streaming or not?", so its not like your code would have to worry about that type of issue, as any engines that arent streaming capable, the API calls will tell people "im not doing that" and that will link all the way back to things like the Text-Gen-webui interface which would remove the selection option for streaming in the interface when someone selects a model that isnt streaming capable. So you wouldnt have to worry about those kind of outlier situations.

image

As I say though, there's a lot of rough code in there at the moment and I'm working my way through it bit by bit and hope to post a BETA sometime soon.

@czuzu
Copy link
Author

czuzu commented May 8, 2024

That's looking good!

@erew123
Copy link
Owner

erew123 commented Jun 14, 2024

Hi @czuzu Just to say Im keeping an eye on the PR at TGWUI... I've not forgotten your PR

@erew123 erew123 added the Awaiting 3rd party change Awaiting a change in a codebase that isn't AllTalk's code label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting 3rd party change Awaiting a change in a codebase that isn't AllTalk's code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants