You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ?
I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this: export PSM2_GPUDIRECT=1 export PSM2_CUDA=1
run with rccl-tests and got the error:
node37.219242 Unhandled error in TID Update: Bad address
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/fd/rccl-tests-master/build/all_gather_perf --minbytes=2621'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2b7f14a00700 (LWP 219289))]
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64 glibc-2.17-260.el7.x86_64 infinipath-psm-3.3-26_g604758e_open.2.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-59.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6 #1 0x00002b7bcdace8f8 in abort () from /lib64/libc.so.6 #2 0x00002b7decdf3054 in psmi_errhandler_psm (ep=ep@entry=0x0, err=err@entry=PSM2_INTERNAL_ERR, error_string=error_string@entry=0x2b7f149f9acc " Unhandled error in TID Update: Bad address\n", token=token@entry=0x2b7f149f9ac0)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:96 #3 0x00002b7decdf360d in psmi_handle_error (ep=0xfffffffffffffffe, error=PSM2_INTERNAL_ERR, buf=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:183 #4 0x00002b7dece14e7c in ips_tidcache_register (tidc=tidc@entry=0x2b7e303bf458, start=start@entry=47820958728192, length=131072, firstidx=firstidx@entry=0x2b7f149f9e4c)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:221 #5 0x00002b7dece1524c in ips_tidcache_acquire (tidc=tidc@entry=0x2b7e303bf458, buf=0x2b7e2f420000, length=length@entry=0x2b7f149f9ef0, tid_array=tid_array@entry=0x2b7e303bf734, tidcnt=tidcnt@entry=0x2b7f149f9ef4,
tidoff=tidoff@entry=0x2b7f149f9eec) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:471 #6 0x00002b7dece11331 in ips_tid_recv_alloc_frag (nbytes_this=131072, tidrecvc=0x2b7e303bf650, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:1969 #7 ips_tid_recv_alloc (ptidrecvc=, nbytes_this=131072, getreq=, ipsaddr=0x2b7e30853210, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2135 #8 ips_tid_pendtids_timer_callback (timer=timer@entry=0x2b7e303bf610, current=current@entry=0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2379 #9 0x00002b7dece11898 in ips_protoexp_tid_get_from_token (protoexp=0x2b7e303bf440, buf=0x2b7e2f420000, length=2097152, epaddr=0x2b7e30853210, remote_tok=1023, flags=, callback=0x2b7dece16b50 <ips_proto_mq_rv_complete_exp>,
context=0x2b7e301b9920) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:587 #10 0x00002b7dece18793 in ips_proto_mq_rts_match_callback (req=0x2b7e301b9920, was_posted=was_posted@entry=1) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1152 #11 0x00002b7dece19690 in ips_proto_mq_handle_rts (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1536 #12 0x00002b7dece05c8f in ips_proto_process_packet (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_help.h:555 #13 ips_recvhdrq_progress (recvq=0x2b7e301bfb98) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_recvhdrq.c:543 #14 0x00002b7dece0353c in ips_ptl_poll (ptl_gen=0x2b7e301b9e80, _ignored=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ptl.c:541 #15 0x00002b7dece00f8f in __psmi_poll_internal (ep=0x2b7e301b9ac0, poll_amsh=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm.c:1071 #16 0x00002b7decdfaeec in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x2b7f149fa438, mq=0x2b7e3010bf80) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1135 #17 _psm2_mq_ipeek (mq=0x2b7e3010bf80, oreq=0x2b7f149fa438, status=0x0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1174 #18 0x00002b7e23a027b3 in psm2_nccl_test () from /home/fd/psm2-nccl-master/librccl-net.so #19 0x00002b7bc88fee2d in ncclNetTest (request=0x3586a, done=0x2b7f149fa4f4, size=0x2b7f149fa4cc) at /home/fd/rccl-dtk-21.10/src/include/net.h:29 #20 netRecvProxy (args=) at /home/fd/rccl-dtk-21.10/src/transport/net.cc:516 #21 0x00002b7bc8916de4 in progressOps (state=, opsPtr=, idle=, comm=) at /home/fd/rccl-dtk-21.10/src/proxy.cc:342 #22 persistentThread (comm=0x2b7e30000c00) at /home/fd/rccl-dtk-21.10/src/proxy.cc:440 #23 0x00002b7bc839edd5 in start_thread () from /lib64/libpthread.so.0 #24 0x00002b7bcdb94ead in clone () from /lib64/libc.so.6
(gdb)
The text was updated successfully, but these errors were encountered:
Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ?
I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this:
export PSM2_GPUDIRECT=1
export PSM2_CUDA=1
run with rccl-tests and got the error:
Debug the core file:
The text was updated successfully, but these errors were encountered: