You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is speech SFT feasable for new domain (e.g. new language) so all "linked" parts (Whisper, LLM and ChatTTS) would learn in e2e fashion? Or should one first try some sort of CPT for individual parts to improve new domain?
The text was updated successfully, but these errors were encountered:
Hello! Generally speaking, for any field where each module has had no prior exposure, such as Spanish Speech Q&A, it is necessary to first ensure that each module possesses the corresponding foundational capabilities: Whisper should have the ability to extract Spanish features, LLM should have the ability to understand Spanish and reply in Spanish, and ChatTTS should have the ability to read Spanish. For training these capabilities, end-to-end training is the most efficient optimization strategy and should yield the best results. There’s no need for stage-wise training, as it will significantly reduce your data utilization efficiency. Moreover, training new foundational capabilities requires a lot of data. If you are constrained by GPU resources, I recommend applying LoRA to the MiniCPM-o 2.6's LLM while enabling both Whisper and ChatTTS tuned.
Generally speaking I agree. Can you provide training example for such SFT. ChatTTS will need to adapt most, since Whisper and Qwen have seen decent multililingual PT data and images are multilingual ;)
Is speech SFT feasable for new domain (e.g. new language) so all "linked" parts (Whisper, LLM and ChatTTS) would learn in e2e fashion? Or should one first try some sort of CPT for individual parts to improve new domain?
The text was updated successfully, but these errors were encountered: