Latest commit on cudarc seems to have broken running the examples codes #2175

hololite · 2024-05-08T20:59:33Z

$ cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt interactive
Updating crates.io index
Finished release profile [optimized] target(s) in 1.83s
Running target/release/examples/quantized --which 7b-open-chat-3.5 --prompt interactive
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
thread 'main' panicked at /home/hikari/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.1/src/driver/sys/mod.rs:38:71:
called Result::unwrap() on an Err value: DlOpen { desc: "libcuda.so: cannot open shared object file: No such file or directory" }
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

If you checkout the branch at "01794dc1 - 2024-05-05 - Laurent Mazare - Use write rather than try-write on the metal rw-locks. (#2162)", then you can run the example fine.

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2024-05-08T21:17:37Z

We recentrly updated the cudarc crate to a version that includes changes on how cuda is discovered, I guess that it's what is causing the issue. Maybe you can adjust your LD_LIBRARY_PATH to include the directory in which your libcuda.so is located? Fwiw this works well on my setup.

hololite · 2024-05-08T21:27:53Z

Ok I have added /usr/local/cuda/targets/x86_64-linux/lib/stubs to my LD_LIBRARY_PATH, and confirmed the libcuda.so exists under that dir.

However I am still getting into error, like this:

$ RUST_BACKTRACE=full cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt interactive
Finished release profile [optimized] target(s) in 0.10s
Running target/release/examples/quantized --which 7b-open-chat-3.5 --prompt interactive
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Error: thread panicked while processing panic. aborting.
Aborted

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

LaurentMazare · 2024-05-09T08:39:15Z

Ok, I've actually tweaked candle to be back at using dynamic linking rather than dynamic loading for cudarc in #2176 as I was able to reproduce the issue in some setups. Could you try it again with this?
Cudarc recently changed its default to dynamic loading as detailed in this issue coreylowman/cudarc#197 , candle might have to do the same at some point but that seems a bit early to do so.

Cifko · 2024-05-09T10:01:23Z

I just tested it, and it still fails to load cuda.dll on my windows machine.
The error is the same called 'Result::unwrap()' on an 'Err' value: LoadLibraryExW { source: Os { code: 126, kind: Uncategorized, message: "The specified module could not be found." } }

coreylowman · 2024-05-09T14:01:27Z

@Cifko do you know where cuda.dll is located on your machine? And also is it called cuda.dll, or is there something called nvcuda.dll on your machine?

@hololite are the cuda shared libraries located anywhere else on your machine (e.g. maybe there are some sym links that the stubs directory you shared links to)?

hololite · 2024-05-09T14:52:16Z

@coreylowman @LaurentMazare
Per @coreylowman's suggestions, I have checked my system, it appears that there are 2 locations in my WSL ubuntu that have the libcuda.so:

$ find /usr -name libcuda.so
/usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/lib/wsl/lib/libcuda.so

The first dir which I tried (and failed running the examples codes) is for the Cuda toolkit.
The second dir is for the Cuda drivers which is installed as part of Windows.
Both are running the same cuda version: 12.4

Using the latest commit (d9bc5ec), I am still getting the error.
I tried setting either cuda dir in my LD_LIBRARY_PATH, but both failed with the same error:

$ cargo run --example quantized --release --features cuda -- --which phi3 --prompt interactive
Finished release profile [optimized] target(s) in 0.14s
Running target/release/examples/quantized --which phi3 --prompt interactive
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Error: thread panicked while processing panic. aborting.
Aborted

Cifko · 2024-05-10T06:42:56Z

I had to take couple files and rename them (they are in path)
c:\Windows\System32\nvcuda.dll -> cuda.dll
c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cublas64_12.dll -> cublas.dll
c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\curand64_10.dll -> curand.dll
Now I can run the example.

Nie-Tianyi · 2024-05-10T12:16:54Z

I got same error when i try to run the mnist sample code in the tutorial with CUDA enabled:

called `Result::unwrap()` on an `Err` value: LoadLibraryExW { source: Os { code: 126, kind: Uncategorized, message: "找不到指定的模块。" } }

my CUDA version is 12.4

hololite · 2024-05-13T17:40:51Z

I also have issue running this on ec2 instance with A10.
The error is different though, it's related to building cudarc:

$ RUST_BACKTRACE=1 cargo run --example quantized --release --features cuda -- --which 7b-open-chat-3.5 --prompt chat
Compiling cudarc v0.11.1
error: failed to run custom build command for cudarc v0.11.1
note: To improve backtraces for build dependencies, set the CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true environment variable to enable debug information generation.

Caused by:
process didn't exit successfully: /home/ubuntu/candle/target/release/build/cudarc-12c4a369cb2cb021/build-script-build (exit status: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-env-changed=CUDA_ROOT
cargo:rerun-if-env-changed=CUDA_PATH
cargo:rerun-if-env-changed=CUDA_TOOLKIT_ROOT_DIR

--- stderr
thread 'main' panicked at /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.1/build.rs:54:14:
Unsupported cuda toolkit version: 11050. Please raise a github issue.
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: build_script_build::cuda_version_from_build_system
3: build_script_build::main
4: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

coreylowman · 2024-05-13T19:48:29Z

@hololite Huh looks like the header file (cuda.h) on that system has a CUDA_VERSION of 11.5 (11050) even though nvcc supports 12.4? The issue here is that cudarc only includes headers of 11.7 onwards atm ( we could generate bindings for pretty much any version though).

We could either add in 11.5/11.6 headers, and/or address this difference in what nvcc reports the supported cuda version as.

stalek71 · 2024-05-16T13:30:06Z

I was able to build a sample with CUDA support, but it looks like there is an issue on Windows with using the CUDA libraries. The system is unable to find cuda.dll, which is not available on my machine. It performed lookup operations, but all lookups failed for this cuda.dll file.

On Windows, CUDA libraries have names following this pattern:

Has anyone been able to run Candle on Windows with CUDA support?

Cifko · 2024-05-16T13:50:19Z

@stalek71 yes, I explained above. And I also added it to FAQ in candle. You basically need to rename the libraries. What I did I copied to the project folder so I don't pollute my windows or program files directory. They will eventually fix it.
The cuda.dll you are looking for is the nvcuda.dll in your c:\windows\system32.

stalek71 · 2024-05-16T14:22:21Z

Thanks! I just did it the same way :)
There is some issue yet. I ran a very simple code a few times building it for Gpu and Cpu.
When I run it for cpu I receive every time proper random values.
When I run it for Gpu it looks numbers are the same all the time.
Both tensors are allocated on proper device (or at least I expect they are allocated there)...

This is my simple code (I run it on Quadro RTX 4000: - cap. 7.5)

fn main() -> Result<(), Box<dyn std::error::Error>> {
    //let device = Device::new_cuda(0)?;
    let device = Device::Cpu;

    let a = Tensor::randn(0f32, 1., (2, 3), &device)?;
    let b = Tensor::randn(0f32, 1., (3, 4), &device)?;

    let c = a.matmul(&b)?;

    println!("{c}");
    
    Ok(())
}

This is the result for cpu:

This is the result for Gpu:

Cifko · 2024-05-16T15:06:36Z

@stalek71 this question shouldn't be in this ticket. But if you look at the source code, there is a fixed seed set.
https://github.com/huggingface/candle/blob/main/candle-core/src/cuda_backend/device.rs#L153
Just set some random seed by yourself.

stalek71 · 2024-05-16T15:29:31Z

@Cifko, thanks for the quick response.
Now everything is clear.

coreylowman · 2024-05-16T16:03:54Z

I was able to build a sample with CUDA support, but it looks like there is an issue on Windows with using the CUDA libraries. The system is unable to find cuda.dll, which is not available on my machine. It performed lookup operations, but all lookups failed for this cuda.dll file.

On Windows, CUDA libraries have names following this pattern:

Has anyone been able to run Candle on Windows with CUDA support?

We have a tracking issue in cudarc for this (coreylowman/cudarc#219). Its not fully clear to me all the patterns that are possible on windows. The original issue reporting this mentioned nvrtc64_120_0, which is different than the names you report. It seems like we could fallbacks to <lib name><32 or 64>_<11 or 12>.dll and then maybe <lib name><32 or 64>_<110_0 or 120_0>.dll. I'll work on getting this added for the 11.2 release of cudarc

hololite · 2024-05-16T16:21:13Z

On Windows's WSL Ubuntu/linux, this problem still persists.
Adding the dir containing the libcuda.so to the LD_LIBRARY_PATH did not help.
I have 2 different laptops with WSL/Ubuntu having similar problems.

Also on AWS EC2 instance with A10G, the problem also exists although with different error messages.

The example codes used to work fine on all of these 3 machines prior to the recent cudarc changes.

Helios113 · 2024-05-17T08:40:43Z

Same issue on a native Ubuntu install.
Cargo finds the oldest version installed on the system.

--- stderr
thread 'main' panicked at /nfs-share/pa511/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.1/build.rs:54:14:
Unsupported cuda toolkit version: 10010. Please raise a github issue.
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

this is with
RUSTFLAGS='-L /usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs, and
/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs added to LD_LIBRARY_PATH

I have tried both independently as well

EricLBuehler · 2024-05-30T15:39:22Z

I think coreylowman/cudarc#240 may fix this?

hololite · 2024-05-30T19:43:17Z

I think coreylowman/cudarc#240 may fix this?

Yes, I just tested the most recent changes. It worked now!

hololite · 2024-05-30T19:44:00Z

This is finally fixed in recent commit.

LaurentMazare · 2024-05-30T20:02:16Z

Thanks for confirming, #2229 will update to include the latest changes.

Helios113 · 2024-05-31T13:10:00Z

I just tested and I still get the same issue.

I have noticed that I have libcuda.so only in the stubs folder of my cuda-12.3 install. In the main lib64 I only have libcudart.so

hololite · 2024-05-31T13:53:06Z

Which OS are you on? Sent from Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Preslav Aleksandrov ***@***.***> Sent: Friday, May 31, 2024 6:10:23 AM To: huggingface/candle ***@***.***> Cc: Hololite ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/candle] Latest commit on cudarc seems to have broken running the examples codes (Issue #2175) I just tested and I still get the same issue. I have noticed that I have libcuda.so only in the stubs folder of my cuda-12.3 install. In the main lib64 I only have libcudart.so — Reply to this email directly, view it on GitHub<#2175 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIDWJVIMMN54YMNE37N7YEDZFBZD7AVCNFSM6AAAAABHNVU3Y2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGEZDAOJZGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

tgmichel · 2024-06-07T10:00:51Z

I still get this issue after trying to update today:

Error: Candle(Cuda(Cuda(thread panicked while processing panic. aborting.
Aborted

Last working commit for me is 01794dc. I'm on WSL with cuda:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Edit: Running on RTX 4090

hololite · 2024-06-07T10:40:24Z

Actually I was incorrect in saying this was fixed. Yes it still is there. What I found are the following: (all running ubuntu 22) - On the A100 gpu, this problem does not reproduce (works fine) - On the RTX 4090 and A10, this problem exists. Sent from Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: tgmichel ***@***.***> Sent: Friday, June 7, 2024 10:01:14 AM To: huggingface/candle ***@***.***> Cc: Hololite ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/candle] Latest commit on cudarc seems to have broken running the examples codes (Issue #2175) I still get this issue after trying to update today: Error: Candle(Cuda(Cuda(thread panicked while processing panic. aborting. Aborted Last working commit for me is 01794dc<01794dc>. I'm on WSL with cuda: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Mar_28_02:18:24_PDT_2024 Cuda compilation tools, release 12.4, V12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0 — Reply to this email directly, view it on GitHub<#2175 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIDWJVKZO3NHSAXVTVNUF23ZGGAGVAVCNFSM6AAAAABHNVU3Y2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUGUYTCMJTGQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

tgmichel · 2024-07-21T08:07:39Z

Any updates on this?

Tried to catch up again today (3c815b1) and I still get the same issue over RTX4090 wsl setup.

npuichigo · 2024-08-28T04:06:01Z

Same here

LaurentMazare mentioned this issue May 9, 2024

Discovering the cuda libraries with dynamic loading coreylowman/cudarc#232

Closed

EricLBuehler mentioned this issue May 14, 2024

Update containers to cuda 12.4 + Fix missing libraries EricLBuehler/mistral.rs#302

Merged

EricLBuehler mentioned this issue May 29, 2024

Running model from a GGUF file, only EricLBuehler/mistral.rs#326

Closed

coreylowman mentioned this issue Jun 12, 2024

Unable to find cuda lib under the names ["cuda", "nvcuda"] on WSL coreylowman/cudarc#255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest commit on cudarc seems to have broken running the examples codes #2175

Latest commit on cudarc seems to have broken running the examples codes #2175

hololite commented May 8, 2024 •

edited

Loading

LaurentMazare commented May 8, 2024

hololite commented May 8, 2024 •

edited

Loading

LaurentMazare commented May 9, 2024

Cifko commented May 9, 2024 •

edited

Loading

coreylowman commented May 9, 2024

hololite commented May 9, 2024 •

edited

Loading

Cifko commented May 10, 2024

Nie-Tianyi commented May 10, 2024 •

edited

Loading

hololite commented May 13, 2024

coreylowman commented May 13, 2024 •

edited

Loading

stalek71 commented May 16, 2024

Cifko commented May 16, 2024 •

edited

Loading

stalek71 commented May 16, 2024 •

edited

Loading

Cifko commented May 16, 2024

stalek71 commented May 16, 2024

coreylowman commented May 16, 2024

hololite commented May 16, 2024 •

edited

Loading

Helios113 commented May 17, 2024

EricLBuehler commented May 30, 2024

hololite commented May 30, 2024

hololite commented May 30, 2024

LaurentMazare commented May 30, 2024

Helios113 commented May 31, 2024

hololite commented May 31, 2024 via email

tgmichel commented Jun 7, 2024 •

edited

Loading

hololite commented Jun 7, 2024 via email

tgmichel commented Jul 21, 2024

npuichigo commented Aug 28, 2024

Latest commit on cudarc seems to have broken running the examples codes #2175

Latest commit on cudarc seems to have broken running the examples codes #2175

Comments

hololite commented May 8, 2024 • edited Loading

LaurentMazare commented May 8, 2024

hololite commented May 8, 2024 • edited Loading

LaurentMazare commented May 9, 2024

Cifko commented May 9, 2024 • edited Loading

coreylowman commented May 9, 2024

hololite commented May 9, 2024 • edited Loading

Cifko commented May 10, 2024

Nie-Tianyi commented May 10, 2024 • edited Loading

hololite commented May 13, 2024

coreylowman commented May 13, 2024 • edited Loading

stalek71 commented May 16, 2024

Cifko commented May 16, 2024 • edited Loading

stalek71 commented May 16, 2024 • edited Loading

Cifko commented May 16, 2024

stalek71 commented May 16, 2024

coreylowman commented May 16, 2024

hololite commented May 16, 2024 • edited Loading

Helios113 commented May 17, 2024

EricLBuehler commented May 30, 2024

hololite commented May 30, 2024

hololite commented May 30, 2024

LaurentMazare commented May 30, 2024

Helios113 commented May 31, 2024

hololite commented May 31, 2024 via email

tgmichel commented Jun 7, 2024 • edited Loading

hololite commented Jun 7, 2024 via email

tgmichel commented Jul 21, 2024

npuichigo commented Aug 28, 2024

hololite commented May 8, 2024 •

edited

Loading

hololite commented May 8, 2024 •

edited

Loading

Cifko commented May 9, 2024 •

edited

Loading

hololite commented May 9, 2024 •

edited

Loading

Nie-Tianyi commented May 10, 2024 •

edited

Loading

coreylowman commented May 13, 2024 •

edited

Loading

Cifko commented May 16, 2024 •

edited

Loading

stalek71 commented May 16, 2024 •

edited

Loading

hololite commented May 16, 2024 •

edited

Loading

tgmichel commented Jun 7, 2024 •

edited

Loading