Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the GDRDMA on the ROCm platform? #63

Open
flyingdown opened this issue Dec 16, 2021 · 2 comments
Open

How to use the GDRDMA on the ROCm platform? #63

flyingdown opened this issue Dec 16, 2021 · 2 comments

Comments

@flyingdown
Copy link

flyingdown commented Dec 16, 2021

Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ?
I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this:
export PSM2_GPUDIRECT=1
export PSM2_CUDA=1

run with rccl-tests and got the error:

node37.219242 Unhandled error in TID Update: Bad address

[node37:219242] *** Process received signal ***
[node37:219242] Signal: Aborted (6)
[node37:219242] Signal code: (-6)
[node37:219242] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b7bc83a65d0]
[node37:219242] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b7bcdacd207]
[node37:219242] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b7bcdace8f8]
[node37:219242] [ 3] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x16054)[0x2b7decdf3054]
[node37:219242] [ 4] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x1660d)[0x2b7decdf360d]
[node37:219242] [ 5] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x37e7c)[0x2b7dece14e7c]
[node37:219242] [ 6] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3824c)[0x2b7dece1524c]
[node37:219242] [ 7] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34331)[0x2b7dece11331]
[node37:219242] [ 8] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34898)[0x2b7dece11898]
[node37:219242] [ 9] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3b793)[0x2b7dece18793]
[node37:219242] [10] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3c690)[0x2b7dece19690]
[node37:219242] [11] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x28c8f)[0x2b7dece05c8f]
[node37:219242] [12] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x2653c)[0x2b7dece0353c]
[node37:219242] [13] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x23f8f)[0x2b7dece00f8f]
[node37:219242] [14] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(psm2_mq_ipeek+0x7c)[0x2b7decdfaeec]
[node37:219242] [15] /home/fd/psm2-nccl-master/librccl-net.so(psm2_nccl_test+0xb3)[0x2b7e23a027b3]

Debug the core file:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/fd/rccl-tests-master/build/all_gather_perf --minbytes=2621'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2b7f14a00700 (LWP 219289))]
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64 glibc-2.17-260.el7.x86_64 infinipath-psm-3.3-26_g604758e_open.2.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-59.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
#1 0x00002b7bcdace8f8 in abort () from /lib64/libc.so.6
#2 0x00002b7decdf3054 in psmi_errhandler_psm (ep=ep@entry=0x0, err=err@entry=PSM2_INTERNAL_ERR, error_string=error_string@entry=0x2b7f149f9acc " Unhandled error in TID Update: Bad address\n", token=token@entry=0x2b7f149f9ac0)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:96
#3 0x00002b7decdf360d in psmi_handle_error (ep=0xfffffffffffffffe, error=PSM2_INTERNAL_ERR, buf=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:183
#4 0x00002b7dece14e7c in ips_tidcache_register (tidc=tidc@entry=0x2b7e303bf458, start=start@entry=47820958728192, length=131072, firstidx=firstidx@entry=0x2b7f149f9e4c)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:221
#5 0x00002b7dece1524c in ips_tidcache_acquire (tidc=tidc@entry=0x2b7e303bf458, buf=0x2b7e2f420000, length=length@entry=0x2b7f149f9ef0, tid_array=tid_array@entry=0x2b7e303bf734, tidcnt=tidcnt@entry=0x2b7f149f9ef4,
tidoff=tidoff@entry=0x2b7f149f9eec) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:471
#6 0x00002b7dece11331 in ips_tid_recv_alloc_frag (nbytes_this=131072, tidrecvc=0x2b7e303bf650, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:1969
#7 ips_tid_recv_alloc (ptidrecvc=, nbytes_this=131072, getreq=, ipsaddr=0x2b7e30853210, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2135
#8 ips_tid_pendtids_timer_callback (timer=timer@entry=0x2b7e303bf610, current=current@entry=0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2379
#9 0x00002b7dece11898 in ips_protoexp_tid_get_from_token (protoexp=0x2b7e303bf440, buf=0x2b7e2f420000, length=2097152, epaddr=0x2b7e30853210, remote_tok=1023, flags=, callback=0x2b7dece16b50 <ips_proto_mq_rv_complete_exp>,
context=0x2b7e301b9920) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:587
#10 0x00002b7dece18793 in ips_proto_mq_rts_match_callback (req=0x2b7e301b9920, was_posted=was_posted@entry=1) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1152
#11 0x00002b7dece19690 in ips_proto_mq_handle_rts (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1536
#12 0x00002b7dece05c8f in ips_proto_process_packet (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_help.h:555
#13 ips_recvhdrq_progress (recvq=0x2b7e301bfb98) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_recvhdrq.c:543
#14 0x00002b7dece0353c in ips_ptl_poll (ptl_gen=0x2b7e301b9e80, _ignored=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ptl.c:541
#15 0x00002b7dece00f8f in __psmi_poll_internal (ep=0x2b7e301b9ac0, poll_amsh=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm.c:1071
#16 0x00002b7decdfaeec in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x2b7f149fa438, mq=0x2b7e3010bf80) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1135
#17 _psm2_mq_ipeek (mq=0x2b7e3010bf80, oreq=0x2b7f149fa438, status=0x0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1174
#18 0x00002b7e23a027b3 in psm2_nccl_test () from /home/fd/psm2-nccl-master/librccl-net.so
#19 0x00002b7bc88fee2d in ncclNetTest (request=0x3586a, done=0x2b7f149fa4f4, size=0x2b7f149fa4cc) at /home/fd/rccl-dtk-21.10/src/include/net.h:29
#20 netRecvProxy (args=) at /home/fd/rccl-dtk-21.10/src/transport/net.cc:516
#21 0x00002b7bc8916de4 in progressOps (state=, opsPtr=, idle=, comm=) at /home/fd/rccl-dtk-21.10/src/proxy.cc:342
#22 persistentThread (comm
=0x2b7e30000c00) at /home/fd/rccl-dtk-21.10/src/proxy.cc:440
#23 0x00002b7bc839edd5 in start_thread () from /lib64/libpthread.so.0
#24 0x00002b7bcdb94ead in clone () from /lib64/libc.so.6
(gdb)

@BrendanCunningham
Copy link
Contributor

@flyingdown Unfortunately opa-psm2 does not support ROCm GDRDMA.

@flyingdown
Copy link
Author

Thanks for your reply. I am not sure whether the adaptation work should be done by opa or rocm, if opa, Is there any plan for that in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants