Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update cuda 11.2 to cuda 12.2 #1590

Merged
merged 10 commits into from
Feb 7, 2024

Conversation

minhthuc2502
Copy link
Collaborator

No description provided.

@minhthuc2502 minhthuc2502 changed the title update cuda 11.2 to cuda 12.2 update cuda 11.2 to cuda 12.2 WIP Dec 20, 2023
@minhthuc2502 minhthuc2502 changed the title update cuda 11.2 to cuda 12.2 WIP update cuda 11.2 to cuda 12.2 Dec 21, 2023
@Purfview
Copy link

Purfview commented Dec 22, 2023

Thx for the PR, but I got disappointed by CUDA 12...
Would be nice if you could keep future CUDA 11 builds.

RTX 3050 GPU [diff vs CUDA 11]:

float16:      -10% drop in speed 
bfloat16:      -8% drop in speed
int8_bfloat16:  0% same

@BBC-Esq
Copy link

BBC-Esq commented Dec 22, 2023

I'm all for keeping backwards compatibility with CUDA 11, if feasible. Not sure if Purfview was suggesting to not pursue CUDA 12 builds altogether...but if that's the case I'd definitely recommend a lot more testing. But in terms of backward compatibility, virtually always a good thing IMHO...coming from a non-professional and hobbyist developer so...

(apologize in advance if backwards compatibility wasn't the correct phrase...)

@nguyendc-systran
Copy link
Contributor

Thx for the PR, but I got disappointed by CUDA 12... Would be nice if you could keep future CUDA 11 builds.

RTX 3050 GPU [diff vs CUDA 11]:

float16:      -10% drop in speed 
bfloat16:      -8% drop in speed
int8_bfloat16:  0% same

Thanks @Purfview for sharing information.
We will try to make some bench-marking on our side at beginning of 2024. If nothing is blocking I think we would merge to support CUDA 12, and still keep the support CUDA 11 for a while (if this support is interesting for community).

@Purfview
Copy link

Purfview commented Dec 26, 2023

Diff from the tests in a new environment [various optimizations for performance]:

float16:       -1% drop in speed 
bfloat16:      -5% drop in speed
int8_float16: -21% drop in speed

EDIT:
This and the previous tests were done on Windows [Nvidia Driver Version: 546.33].
Every compute type ran 3 times and the speed was averaged.
"new environment" had a hyper option enabled -> Hardware-accelerated GPU scheduling: ON

@BBC-Esq
Copy link

BBC-Esq commented Dec 26, 2023

Thx for the PR, but I got disappointed by CUDA 12... Would be nice if you could keep future CUDA 11 builds.
RTX 3050 GPU [diff vs CUDA 11]:

float16:      -10% drop in speed 
bfloat16:      -8% drop in speed
int8_bfloat16:  0% same

Thanks @Purfview for sharing information. We will try to make some bench-marking on our side at beginning of 2024. If nothing is blocking I think we would merge to support CUDA 12, and still keep the support CUDA 11 for a while (if this support is interesting for community).

On my end (and I'm just one amateur developer among professionals), I know that my user base (however small) would appreciate CUDA 11 support for awhile longer at least. Not all computer setups support CUDA 12, python libraries, etc. so to have ctranslate2 only work with one version of CUDA at a time would be harsh. I've noticed PyTorch's policy is to generally advertise two major version support...Maybe that could be a policy of ctranslate2?
image

And then ctranslate2 could have a repository of older builds, easy for user's to understand, showing which version of CUDA are supported up to which version of Ctranslate2...like PyTorch has an "old builds" page.

Anyways, I'm excited! Just saw this...IMHO, a year is kinda long to not yet have CUDA 12 support without having to compile from source...
image

@BBC-Esq
Copy link

BBC-Esq commented Jan 25, 2024

Does anyone know if this is still being worked on? It was on the verge of being incorporating CUDA 12+ but it's been awhile.

@Qubitium
Copy link

Qubitium commented Feb 2, 2024

Get this merged asap! I see no regression on my end.

@BBC-Esq
Copy link

BBC-Esq commented Feb 2, 2024

Get this merged asap! I see no regression on my end.

Yes, please merge. I don't even think they're talking about removing support for CUDA 11.8, but just adding CUDA 12 support!

@minhthuc2502 minhthuc2502 merged commit 8c6715e into OpenNMT:master Feb 7, 2024
17 checks passed
@BBC-Esq
Copy link

BBC-Esq commented Feb 15, 2024

@minhthuc2502 Is it possible to upload this to pypi.org now so that I can "pip install" the newer version that supports CUDA 12?

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Feb 16, 2024

Do you have an idea why I get a nice speedup with the small whisper model with bfloat16 compared to auto which selects int8_float16 but an horrible slowdown for the medium model? GPU is A10G. With bfloat16, runtime is also fluctuating a lot with weird outlier runs that are super slow. What's also interesting that it does not seem to happen with the large-v2 model.

Whisper config is the following:

temperature: 0
beam_size: 1
condition_on_previous_text: false
vad_filter: false
  • Each experiment is run 5 times, median time is reported. Audio file has 5 minutes of content. I report the number of words generated at each trial as well. They're consistently the same across 5 runs so no stochasticity happens in generation.

  • You can see that *bfloat16 dtype always generates much more text than other data types but for large-v2 the situation is much better. These are marked with * and in parentheses you can see the actual word count for that run. However it's never getting near 120 (closest is 130-138 for large-v2) which is the golden transcription produced by other dtypes

  • Number of words generated do not change across CUDA/CTranslate2 versions.

  • I see no significant speed differences between CTranslate v3 CUDA11 and CTranslate v4 CUDA12 which is good.

#Words CTv3-CUDA11 CTv4-CUDA12
5 runs small medium large-v2 small medium large-v2
Compute Type Time Time Time Time Time Time
float32 ~120 1.72 3.45 6.54 1.71 3.41 6.50
float16 ~120 1.17 2.18 3.88 1.20 2.27 3.82
bfloat16 >500 2.26 6.18 5.20 (258w) 2.37 6.20 5.30 (250w)
int8_float32 ~120 1.47 2.80 4.47 1.50 2.93 4.41
int8_float16 ~120 1.29 2.34 3.73 1.34 2.78 3.56
int8_bfloat16 >500 2.70 6.48 3.50 (137w) 2.30 (400w) 6.92 3.75 (130w)

Conclusion: Main problem is bfloat16 over-generating. Maybe its a quantization issue or a faster-whisper - CTranslate2 conversion issue of the floats.

@Purfview
Copy link

@ozancaglayan Use temperature=0 so the benchmark tests would be consistent.

@ozancaglayan
Copy link
Contributor

Yes I just noticed that but it's again weird that bfloat16 runs have a tendency to be affected by temperature whereas other ones are not.

@Purfview
Copy link

It's not weird at all.

@ozancaglayan
Copy link
Contributor

OK I updated the table, bfloat16 is still inconsistent across model types.

@Purfview
Copy link

Purfview commented Feb 16, 2024

Try much longer tests, not the few seconds.

Btw, are you saying that this inconsistency appeared with CUDA12?

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Feb 16, 2024

I'm now repeating the tests with CTranslate < 4 using CUDA11. Inconsistencies are there as well. I'm running each test 5 times on the same audio file of length 5 minutes so I think it's good enough.

I'm now counting the length of the texts generated at each run and bfloat16 is definitely over-generating the same contents again and again. Applying VAD beforehands seems to cut the number of segments/words generated for bfloat16 but its still over-generating.

@Purfview
Copy link

definitely over-generating the same contents again and again

Ar you using a clear speech audio without noise and silence?

@ozancaglayan
Copy link
Contributor

Very clear speech with very silent blocks. But even if I apply VAD, bfloat16 still seems to over-generate.

@ozancaglayan
Copy link
Contributor

I updated my previous table with final final results btw see #1590 (comment)

@Purfview
Copy link

I'm sure there is something wrong with your test than something else.

@ozancaglayan
Copy link
Contributor

Did you try to do similar benchmarking with and without bfloat16 on a supported device? Why would I only see this consistently with *bfloat16 types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants