-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support conversion for distil-whisper model #1529
Support conversion for distil-whisper model #1529
Conversation
please rebase or fix your PR the same way as here: 5258816 |
Previous code was assuming same number of encoder and decoder layer. Removed this assumptions and obtain the number of layer separately.
a255174
to
cd26b3e
Compare
Can you share the code to convert it from HF to CTranslate2? |
@vince62s have rebased as requested. |
If it helps I converted them to CTranslate2 float16 with this PR here: |
Thank you for your work! I tried this converted models and I'm getting error:
Can you confirm that it's working in your case so I can be sure it's something on my end? |
I've observed this with the medium.en model, but not with large-v2. For my applications I've never used the English-only models, so I haven't taken the time (yet?) to dig into this more. |
Thanks! Is there a way to convert into CTranslate2 while preserving the original float32 precision? |
This is also something I haven't experimented with. For a variety of reasons most ctranslate2 users don't utilize float32 and this is why you see most conversion scripts, examples, etc targeting float16. Also many/most ctranslate2 users end up using int8, int8_float16, etc so conversion to float16 for distribution of models makes the most sense for most use cases. Ctranslate2 includes the incredible ability to automatically and very performantly use float16 converted models in the user's configured precision automatically at runtime. |
I love Float32. I've done tests with the small.en and other models, and most of them actually run faster when using the float32 format when compared to intn8_float32, for example. The trade off is that float32 requires more vram. Float 32 also performs faster than bfloat16 in my testing (RTX 4090 using CUDA). Again, bfloat16 uses less VRAM. Interestingly, float16 is significantly faster than bfloat16 in my tests...not sure why that is. At any rate, having ALL the ctranslate2 quants for the distil-whisper model would be helpful so people can choose. You'll notice how I've converted all whisper models to all permutations of ctranslate2 quants here: https://huggingface.co/ctranslate2-4you I would love to do the same for the distil-whisper models, but need the conversion command as well as the updated version of ctranslate2 library to actually support converting distil-whisper models. Any help would be much appreciated on this and I'd be happy to convert any and all quants for people to use. I also want to add...Converting at "runtime" is not always the most efficient way of running the models...hence, why additional converted quants are preferable. There's the overhead of conversion (albeit small), but more importantly, it's my understanding that quality will suffer if you convert at runtime...let me explain. Scenario 1 = Model in float32 but converted to int8_float32. Runtime, no conversion necessary since already in int8_float32. Scenario 2 - Model in flolat16 but converted to int8_float32 at runtime. It's my understanding that even though the model is run in int8_float32 at runtime, Scenario #1 is higher quality. This is because certain data is lost in converting to float16 first...and then to int8_float32. This comes down to how the "exponent" and "fraction" bits of the floating point format itself are essentially zeroed out or truncated...you can't gain them back. Float16 has truncated the bits...and then when you try to use int8_float32...you don't get the full "float32" portion because your basing the runtime conversion on float16... Thus, having the quants you want natively is helpful. I'm willing to do the work converting if I only knew how. Thanks. |
Previous code was assuming same number of encoder and decoder layer which is no longer true with the distil-whisper model. Removed this assumptions and obtain the number of layer separately.
distil-whisper: https://github.com/huggingface/distil-whisper
Fixes this issue: huggingface/distil-whisper#3