Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instant availablity of speech input after opening #151

Open
nebkrid opened this issue Jan 15, 2023 · 9 comments
Open

Instant availablity of speech input after opening #151

nebkrid opened this issue Jan 15, 2023 · 9 comments
Labels
discussion Discussions or plans for the future tts&stt Speech-to-text and text-to-speech requests or bugs, including Vosk

Comments

@nebkrid
Copy link
Contributor

nebkrid commented Jan 15, 2023

Hi,
after #109 for speech input recognition for other apps is implemented (@Stypox thanks for your positive feedback :D ), with time I realized that reloading the speech model each time anew is annoying (my hardware needs about 10s). This also applies for the main app, at least for the first opening (and after system shut it down, when it was too long in background), but for the "external" speech service is even more critical.
So in order to improve I thought about whether a continuosly running background service with the loaded database might be an option, which is then requested by dicio for speech processing. I assume this could be connected with #126 and #54 referenced from your roadmap #129. Maybe this is even the best way to implement this as a background service (because due to it's definition as speech recognition service it is hopefully not stopped so easily from the system)
But before I try to do so: Does anyone have hints / thoughts about this?

@Stypox Stypox added discussion Discussions or plans for the future tts&stt Speech-to-text and text-to-speech requests or bugs, including Vosk labels Jan 15, 2023
@Stypox
Copy link
Owner

Stypox commented Jan 15, 2023

A possible solution would be to start listening as soon as the app starts, and then feed the input from the microphone to Vosk at maximum speed as soon as it is ready.

@AyoungDukie
Copy link

AyoungDukie commented Jan 15, 2023

Out of curiosity, does the responsiveness change based on enabling/disabling battery optimization? E.g. setting to unrestricted in Android settings?

I'd just be curious if it being optimized for doze affects this or not

@nebkrid
Copy link
Contributor Author

nebkrid commented Jan 16, 2023

A possible solution would be to start listening as soon as the app starts, and then feed the input from the microphone to Vosk at maximum speed as soon as it is ready.

Sounds like a good option not just as an workaround but even for the first time of the service. I didn't expect that increasing speed is possible / does not influence the vosk voice recognition

Out of curiosity, does the responsiveness change based on enabling/disabling battery optimization? E.g. setting to unrestricted in Android settings?

I tried it but it seems to make no difference. However I think the initialization of the vosk model has more to do in which thread / context it is running. E.g. I observed the following behaviour:

  1. Opening Dicio first time (or a "longer" time unused) -> initializing model
  2. switching to any other app
  3. switching back to Dicio -> instant available
  4. choose any 3rd app for a voice input intent -> starting voice input -> dicio overlay intializing
  5. making another voice input directly afterwards -> dicio overlay intializing again
  6. switching back to Dicio main app (still in background) -> still instant available

=> So I think there needs to be any main instance of a dicio service / process / background thread which is accessed for voice input by dicio main app, the app overlay and a system registered speech recognizer service. But I am not sure whether defining a service in the manifest means that it will be instanciated only once or always anew for each requesting app. (As it seems to happen with the overlay which is in a different context than dicio main app, although defined and provided by dicio)

@Stypox
Copy link
Owner

Stypox commented Jan 16, 2023

I didn't expect that increasing speed is possible

I don't actually mean increasing the speed of audio, but rather just feeding audio samples as fast as possible to the Vosk recognizer. The recognizer will still interpret each sample as if it was some small value milliseconds long.

@nebkrid
Copy link
Contributor Author

nebkrid commented Jan 24, 2023

I have uploaded a first draft in my fork. It's to early for merging, but may you can test how it behaves on your phones?
The main thing what I have done so far is to split the VoskInputDevice.java in 3 parts: The dicio recognition service SttService.java using vosk, the SpeechRecogServiceInputDevice.java as a more generalized Input for Dicio and the VoskInputDevice.java which handles downloading of vosk models
So dicio is shown as system-wide TTS option on my phone (however I am not sure whether it's always working. Does someone know another app using this feature for testing?)
Additionally I added a preference option to use system provided stt service for dicio instead of vosk. Mainly for testing, but I think I read that someone already asked for a feature to use the pre-installed TTS Service.

Regarding Instant availability: Switching between Dicio main app and dicio overlay is now working instantly for some time - until background service is shut down from system due to inactivity. ( @AyoungDukie unfortunatley battery optimzation seems not to influence this) The initialization however still needs its time.

@Stypox I looked at the way vosk is started and what other options of vosk initialization may be possible, but I still don't really have a starting point how to implement your idea

just feeding audio samples as fast as possible to the Vosk recognizer

with vosk. In order to do so: Does this mean that some kind of buffering would have to be implemented in the SttService?

@Stypox
Copy link
Owner

Stypox commented Jan 28, 2023

I built an APK if you want to test it: app-debug.zip. I will test it a bit tomorrow.

@Stypox
Copy link
Owner

Stypox commented Jan 29, 2023

Overall it looks promising, thanks for working on this. I can finally see the service appears in "Voice input", though I don't know how to test if it actually works in other apps, too. Feel free to open a PR from that branch, you will still be able to push commits without issues.

I found these problems:

  • when the model has not been downloaded yet, Dicio just crashes when trying to listen
  • Dicio also crashes if a non-English model was downloaded
  • If you first download the English model and the switch to another language, the model used by the background service is still the English one

Also, maybe it would be a good idea to completely extract STT from Dicio and build a separate app Dicio can interact with through SttService? That way people who want to use a STT different than Vosk can just install it. Let me know your thoughts.

Does this mean that some kind of buffering would have to be implemented in the SttService?

I think so, yeah. But for now, don't worry about that. The background service is already more responsive than before.

@nebkrid
Copy link
Contributor Author

nebkrid commented Feb 1, 2023

Ok, then I will do a PR so it will show up directly in the main repository here
Thanks for testing. I will have a look on the things you already found.

Also, maybe it would be a good idea to completely extract STT from Dicio and build a separate app Dicio can interact with through SttService? That way people who want to use a STT different than Vosk can just install it. Let me know your thoughts.

Indeed actually this (a simple STT service) was actually the starting point what I was looking for, when I found dicio, in which the whole thing of downloading the stt service etc. was already implemented, However, now I think there are multiple points of views on this:

  • For someone who don't want/need an assistant, but STT: Yes, seperating would be definitly the best thing. Expecially when dicio is growing and getting more functionalities, it is less a matter of disk usage etc, but a matter of necessary permissions which the user would have to accept without actually using it.
  • The developer view: Keeping it together simplifies things. Especially in the "transition time" of developing the STT service but still keep dicio reliable working. Additionally, what I am not sure is about the VoiceInteractionService: Neither what it is exactly for (but it sounds useful for an assistant), nor whether it can be implemented with dicio without having its own STT service.
  • dicio user view: If someone wants to have not only the assistant interface, but offline STT, it would be easier to isntall only one instead of two apps and set the appropiate system settings. => not seperating. However, there are probably users, too, who are not satisfied of vosk in their language, so they actually would not need the STT module. => seperating.

To sum it up, I think, too, after the STT is reliable seperated within dicio, completely separating would most flexible for all. But for dicio, the users would then need a good step-by-step guide to instal and activate the vosk/dicio-STT-service-standalone-app on their system.

@paolo-caroni
Copy link

Especially in the "transition time" of developing the STT service but still keep dicio reliable working.

I don't think that developing a STT service alone is really needed. Mantain and develop both would be a big job for a developer. There is much work in progress offline STT and TTS engines, I have listed here. No one of them actually is useful but much are in active development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussions or plans for the future tts&stt Speech-to-text and text-to-speech requests or bugs, including Vosk
Projects
None yet
Development

No branches or pull requests

4 participants