-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest commit on cudarc seems to have broken running the examples codes #2175
Comments
We recentrly updated the cudarc crate to a version that includes changes on how cuda is discovered, I guess that it's what is causing the issue. Maybe you can adjust your |
Ok I have added /usr/local/cuda/targets/x86_64-linux/lib/stubs to my LD_LIBRARY_PATH, and confirmed the libcuda.so exists under that dir. However I am still getting into error, like this: $ RUST_BACKTRACE=full cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt interactive $ nvcc --version |
Ok, I've actually tweaked candle to be back at using dynamic linking rather than dynamic loading for cudarc in #2176 as I was able to reproduce the issue in some setups. Could you try it again with this? |
I just tested it, and it still fails to load |
@Cifko do you know where cuda.dll is located on your machine? And also is it called cuda.dll, or is there something called nvcuda.dll on your machine? @hololite are the cuda shared libraries located anywhere else on your machine (e.g. maybe there are some sym links that the stubs directory you shared links to)? |
@coreylowman @LaurentMazare $ find /usr -name libcuda.so The first dir which I tried (and failed running the examples codes) is for the Cuda toolkit. Using the latest commit (d9bc5ec), I am still getting the error. $ cargo run --example quantized --release --features cuda -- --which phi3 --prompt interactive |
I had to take couple files and rename them (they are in path) |
I got same error when i try to run the mnist sample code in the tutorial with CUDA enabled:
my CUDA version is 12.4 |
I also have issue running this on ec2 instance with A10. $ RUST_BACKTRACE=1 cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt chat Caused by: --- stderr $ nvcc --version |
@hololite Huh looks like the header file (cuda.h) on that system has a CUDA_VERSION of 11.5 (11050) even though nvcc supports 12.4? The issue here is that cudarc only includes headers of 11.7 onwards atm ( we could generate bindings for pretty much any version though). We could either add in 11.5/11.6 headers, and/or address this difference in what nvcc reports the supported cuda version as. |
I was able to build a sample with CUDA support, but it looks like there is an issue on Windows with using the CUDA libraries. The system is unable to find cuda.dll, which is not available on my machine. It performed lookup operations, but all lookups failed for this cuda.dll file. On Windows, CUDA libraries have names following this pattern: Has anyone been able to run Candle on Windows with CUDA support? |
@stalek71 yes, I explained above. And I also added it to FAQ in candle. You basically need to rename the libraries. What I did I copied to the project folder so I don't pollute my windows or program files directory. They will eventually fix it. |
Thanks! I just did it the same way :) This is my simple code (I run it on Quadro RTX 4000: - cap. 7.5) fn main() -> Result<(), Box<dyn std::error::Error>> {
//let device = Device::new_cuda(0)?;
let device = Device::Cpu;
let a = Tensor::randn(0f32, 1., (2, 3), &device)?;
let b = Tensor::randn(0f32, 1., (3, 4), &device)?;
let c = a.matmul(&b)?;
println!("{c}");
Ok(())
} This is the result for cpu: This is the result for Gpu: |
@stalek71 this question shouldn't be in this ticket. But if you look at the source code, there is a fixed seed set. |
@Cifko, thanks for the quick response. |
We have a tracking issue in cudarc for this (coreylowman/cudarc#219). Its not fully clear to me all the patterns that are possible on windows. The original issue reporting this mentioned |
On Windows's WSL Ubuntu/linux, this problem still persists. Also on AWS EC2 instance with A10G, the problem also exists although with different error messages. The example codes used to work fine on all of these 3 machines prior to the recent cudarc changes. |
Same issue on a native Ubuntu install. --- stderr this is with I have tried both independently as well |
I think coreylowman/cudarc#240 may fix this? |
Yes, I just tested the most recent changes. It worked now! |
This is finally fixed in recent commit. |
Thanks for confirming, #2229 will update to include the latest changes. |
I just tested and I still get the same issue. I have noticed that I have libcuda.so only in the stubs folder of my cuda-12.3 install. In the main lib64 I only have libcudart.so |
Which OS are you on?
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Preslav Aleksandrov ***@***.***>
Sent: Friday, May 31, 2024 6:10:23 AM
To: huggingface/candle ***@***.***>
Cc: Hololite ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/candle] Latest commit on cudarc seems to have broken running the examples codes (Issue #2175)
I just tested and I still get the same issue.
I have noticed that I have libcuda.so only in the stubs folder of my cuda-12.3 install. In the main lib64 I only have libcudart.so
—
Reply to this email directly, view it on GitHub<#2175 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIDWJVIMMN54YMNE37N7YEDZFBZD7AVCNFSM6AAAAABHNVU3Y2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGEZDAOJZGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I still get this issue after trying to update today:
Last working commit for me is 01794dc. I'm on WSL with cuda:
Edit: Running on RTX 4090 |
Actually I was incorrect in saying this was fixed. Yes it still is there.
What I found are the following: (all running ubuntu 22)
- On the A100 gpu, this problem does not reproduce (works fine)
- On the RTX 4090 and A10, this problem exists.
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: tgmichel ***@***.***>
Sent: Friday, June 7, 2024 10:01:14 AM
To: huggingface/candle ***@***.***>
Cc: Hololite ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/candle] Latest commit on cudarc seems to have broken running the examples codes (Issue #2175)
I still get this issue after trying to update today:
Error: Candle(Cuda(Cuda(thread panicked while processing panic. aborting.
Aborted
Last working commit for me is 01794dc<01794dc>. I'm on WSL with cuda:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
—
Reply to this email directly, view it on GitHub<#2175 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIDWJVKZO3NHSAXVTVNUF23ZGGAGVAVCNFSM6AAAAABHNVU3Y2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUGUYTCMJTGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Any updates on this? Tried to catch up again today (3c815b1) and I still get the same issue over RTX4090 wsl setup. |
Same here |
$ cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt interactive
Updating crates.io index
Finished
release
profile [optimized] target(s) in 1.83sRunning
target/release/examples/quantized --which 7b-open-chat-3.5 --prompt interactive
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
thread 'main' panicked at /home/hikari/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.1/src/driver/sys/mod.rs:38:71:
called
Result::unwrap()
on anErr
value: DlOpen { desc: "libcuda.so: cannot open shared object file: No such file or directory" }note: run with
RUST_BACKTRACE=1
environment variable to display a backtraceIf you checkout the branch at "01794dc1 - 2024-05-05 - Laurent Mazare - Use write rather than try-write on the metal rw-locks. (#2162)", then you can run the example fine.
The text was updated successfully, but these errors were encountered: