Speech SFT #752

zidsi · 2025-01-18T11:56:37Z

Is speech SFT feasable for new domain (e.g. new language) so all "linked" parts (Whisper, LLM and ChatTTS) would learn in e2e fashion? Or should one first try some sort of CPT for individual parts to improve new domain?

Cuiunbo · 2025-01-18T15:40:01Z

Hello! Generally speaking, for any field where each module has had no prior exposure, such as Spanish Speech Q&A, it is necessary to first ensure that each module possesses the corresponding foundational capabilities: Whisper should have the ability to extract Spanish features, LLM should have the ability to understand Spanish and reply in Spanish, and ChatTTS should have the ability to read Spanish. For training these capabilities, end-to-end training is the most efficient optimization strategy and should yield the best results. There’s no need for stage-wise training, as it will significantly reduce your data utilization efficiency. Moreover, training new foundational capabilities requires a lot of data. If you are constrained by GPU resources, I recommend applying LoRA to the MiniCPM-o 2.6's LLM while enabling both Whisper and ChatTTS tuned.

zidsi · 2025-01-19T18:11:51Z

Generally speaking I agree. Can you provide training example for such SFT. ChatTTS will need to adapt most, since Whisper and Qwen have seen decent multililingual PT data and images are multilingual ;)

Or maybe TTS part is not trained at all?

Might just wait for tech report. Looking forward.

YuzaChongyi assigned Cuiunbo Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech SFT #752

Speech SFT #752

zidsi commented Jan 18, 2025

Cuiunbo commented Jan 18, 2025

zidsi commented Jan 19, 2025

Speech SFT #752

Speech SFT #752

Comments

zidsi commented Jan 18, 2025

Cuiunbo commented Jan 18, 2025

zidsi commented Jan 19, 2025