Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU failed in eigh kernel #316

Open
cvsik opened this issue Jan 23, 2025 · 6 comments
Open

Multi-GPU failed in eigh kernel #316

cvsik opened this issue Jan 23, 2025 · 6 comments

Comments

@cvsik
Copy link

cvsik commented Jan 23, 2025

Hi everyone,

I'm trying to run gpu4pyscf 1.3.0 (installed from PyPI) on 2 GPUs, but I get an error when the initial Fock matrix is formed:

Traceback (most recent call last):
  File "/fs/home/cvsik/Projects/gpu4pyscf/examples/00-h2o.py", line 53, in <module>
    e_dft = mf_GPU.kernel()
            ^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 272, in scf
    _kernel(mf, mf.conv_tol, mf.conv_tol_grad,
  File "/fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 192, in _kernel
    _, mf_diis.Corth = mf.eig(fock, s1e)
                       ^^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/pyscf/scf/hf.py", line 1810, in eig
    return self._eigh(h, s)
           ^^^^^^^^^^^^^^^^
  File "/fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/gpu4pyscf/lib/cusolver.py", line 167, in eigh
    raise RuntimeError("failed in eigh kernel")
RuntimeError: failed in eigh kernel

The calculation runs fine on a single GPU (same node and environment).

Here is the output of lib.utils.format_sys_info() in case it helps:

System Information
System: uname_result(system='Linux', node='nvsa001', release='5.14.0-427.13.1.el9_4.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024', machine='x86_64')  Threads 24
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
numpy 1.26.4  scipy 1.12.0  h5py 3.10.0
Date: Thu Jan 23 09:23:37 2025
PySCF version 2.7.0
PySCF path  /fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/pyscf/__init__.py
CUDA Environment
    CuPy 13.1.0
    CUDA Path /opt/shared/apps/core/tools/cuda/11.3.1
    CUDA Build Version 11080
    CUDA Driver Version 12040
    CUDA Runtime Version 11080
CUDA toolkit
    cuSolver (11, 1, 2)
    cuBLAS 11501
    cuTENSOR 20001
Device info
    Device name b'Tesla V100S-PCIE-32GB'
    Device global memory 31.73 GB
GPU4PySCF 1.3.0
GPU4PySCF path  /fs/home/cvsik/miniforge3/envs/tstoolkit/lib/python3.11/site-packages/gpu4pyscf

Any help on how to debug or fix this are greatly appreciated!
Thanks a lot in advance 🚀

Best,
Max

@sunqm
Copy link
Collaborator

sunqm commented Feb 7, 2025

There may be multiple libcusolver.so libraries on the system. I suspect an incorrect version is loaded. Could you run

LD_DEBUG=libs python -c 'import ctypes; ctypes.CDLL("libcusolver.so")'

to check which libcusolver is used.

@cvsik
Copy link
Author

cvsik commented Feb 7, 2025

Here's the output of the command above. I cannot decipher anything, but maybe you can 😁
I'm inside a Conda environment, where also Conda is providing all the cu_ libraries. I would've expected a problem with "wrong cusolver version" to also show up on single-GPU runs...

    565750:     find library=libpthread.so.0 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/glibc-hwcaps/x86-64-v3:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/glibc-hwcaps/x86-64-v2:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib              (RPATH from file python)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/glibc-hwcaps/x86-64-v3/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/glibc-hwcaps/x86-64-v2/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/tls/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/x86_64/libpthread.so.0
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/libpthread.so.0
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libpthread.so.0
    565750:
    565750:     find library=libdl.so.2 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib           (RPATH from file python)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/libdl.so.2
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libdl.so.2
    565750:
    565750:     find library=libutil.so.1 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib           (RPATH from file python)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/libutil.so.1
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libutil.so.1
    565750:
    565750:     find library=libm.so.6 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib           (RPATH from file python)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/libm.so.6
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libm.so.6
    565750:
    565750:     find library=libc.so.6 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib           (RPATH from file python)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/bin/../lib/libc.so.6
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libc.so.6
    565750:
    565750:
    565750:     calling init: /lib64/ld-linux-x86-64.so.2
    565750:
    565750:
    565750:     calling init: /lib64/libc.so.6
    565750:
    565750:
    565750:     calling init: /lib64/libm.so.6
    565750:
    565750:
    565750:     calling init: /lib64/libutil.so.1
    565750:
    565750:
    565750:     calling init: /lib64/libdl.so.2
    565750:
    565750:
    565750:     calling init: /lib64/libpthread.so.0
    565750:
    565750:
    565750:     initialize program: python
    565750:
    565750:
    565750:     transferring control: python
    565750:
    565750:     find library=libffi.so.8 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../glibc-hwcaps/x86-64-v3:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../glibc-hwcaps/x86-64-v2:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64/x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64:/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..          (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../glibc-hwcaps/x86-64-v3/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../glibc-hwcaps/x86-64-v2/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64/x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../tls/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64/x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../x86_64/libffi.so.8
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libffi.so.8
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libffi.so.8
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so
    565750:
    565750:     find library=libcusolver.so [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcusolver.so
    565750:
    565750:     find library=libcublas.so.11 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublas.so.11
    565750:
    565750:     find library=libcublasLt.so.11 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublasLt.so.11
    565750:
    565750:     find library=librt.so.1 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../librt.so.1
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/librt.so.1
    565750:
    565750:     find library=libgcc_s.so.1 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libgcc_s.so.1
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libgcc_s.so.1
    565750:
    565750:
    565750:     calling init: /lib64/librt.so.1
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublasLt.so.11
    565750:
    565750:     find library=libcuda.so.1 [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcuda.so.1
    565750:      search cache=/etc/ld.so.cache
    565750:       trying file=/lib64/libcuda.so.1
    565750:
    565750:
    565750:     calling init: /lib64/libcuda.so.1
    565750:
    565750:     find library=libnvrtc.so [0]; searching
    565750:      search path=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../..             (RPATH from file /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so)
    565750:       trying file=/opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libnvrtc.so
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libnvrtc.so
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublas.so.11
    565750:
    565750:
    565750:     calling init: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcusolver.so
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libnvrtc.so [0]
    565750:
    565750:
    565750:     calling fini: /lib64/libcuda.so.1 [0]
    565750:
    565750:
    565750:     calling fini: python [0]
    565750:
    565750:
    565750:     calling fini: /lib64/libutil.so.1 [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libffi.so.8 [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcusolver.so [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublas.so.11 [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libcublasLt.so.11 [0]
    565750:
    565750:
    565750:     calling fini: /lib64/libm.so.6 [0]
    565750:
    565750:
    565750:     calling fini: /lib64/libdl.so.2 [0]
    565750:
    565750:
    565750:     calling fini: /lib64/libpthread.so.0 [0]
    565750:
    565750:
    565750:     calling fini: /lib64/librt.so.1 [0]
    565750:
    565750:
    565750:     calling fini: /opt/shared/conda/envs/tstoolkit-dev3/lib/python3.11/lib-dynload/../../libgcc_s.so.1 [0]
    565750:

@wxj6000
Copy link
Collaborator

wxj6000 commented Feb 7, 2025

@cvsik Does the following minimal script work on your side?

from gpu4pyscf.lib.cusolver import eigh

import cupy

h = cupy.random.rand(4,4)
h += h.T

s = cupy.random.rand(4,4)
s += s.T

e, v = eigh(h, s)

@cvsik
Copy link
Author

cvsik commented Feb 10, 2025

@wxj6000 Yes, the minimal script works fine both on a single and two GPUs.

@cvsik
Copy link
Author

cvsik commented Feb 10, 2025

Interestingly, sometimes there's no eigh error, but just nan energies all over the place (on 2 GPUs, 1 is always fine) 😢

Here's the pyscf.log file from the 00-h2o.py run, maybe that also helps debugging this problem.

pyscf.log

@wxj6000
Copy link
Collaborator

wxj6000 commented Feb 10, 2025

@cvsik Thanks for the log file. So the eigh solver itself works fine. We have seen the NaN issue before. There a PR trying to fix the issue (#309). The fix has been released in v1.3.1. Please let me know if the issue is still persistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants