GPU-Aware Communication #208

ZwFink · 2021-06-24T14:22:15Z

Goal: Use UCX for both intra-node and inter-node GPU communication.

Building and Running on OLCF Summit

Prerequisites: OpenPMIx, CUDA-enabled UCX

Building OpenPMIx

- Requires libevent
    - wget https://github.com/libevent/libevent/releases/download/release-2.1.12-stable/libevent-2.1.12-stable.tar.gz
    - tar -xf libevent-2.1.12-stable.tar.gz
    - cd libevent-2.1.12-stable
    - mkdir build && mkdir install
    - cd build
    - ../configure --prefix=$HOME/libevent-2.1.12-stable/install
    - make -j && make install
- wget https://github.com/openpmix/openpmix/releases/download/v3.1.5/pmix-3.1.5.tar.gz
- tar -xf pmix-3.1.5.tar.gz
- cd pmix-3.1.5
- mkdir build install
- cd build
- ../configure --prefix=$HOME/work/pmix-3.1.5/install --with-libevent=$HOME/libevent-2.1.12-stable/install
- make -j && make install

Building CUDA-enabled UCX

Commit 971aad12d142341770c8f918cb91727cd180cb31 of master branch recommended, v1.9.0 has issues with ucx_perftest on Summit, latest commit breaks CUDA linkage somehow.

- git clone [email protected]:openucx/ucx.git
- cd ucx
- git checkout 971aad12d142341770c8f918cb91727cd180cb31
- ./autogen.sh
- mkdir build install
- cd build
- ../contrib/configure-release --prefix=$HOME/ucx/install --with-cuda=$CUDA_DIR --with-gdrcopy=/sw/summit/gdrcopy/2.0
- make -j
- make install

Building Charm4Py with UCX

The following diff should be applied to the Charm++ repository, and the paths specified should be changed to the local installation: location

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 724e6d8d7..70703c450 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -865,7 +865,7 @@ if(${TARGET} STREQUAL "charm4py")
   set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
   add_library(charm SHARED empty.cpp)
   set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib_so)
-  target_link_libraries(charm ck converse memory-default threads-default ldb-rand "-Llib/ -standalone -whole-archive -c++stl -shared")
+  target_link_libraries(charm ck converse memory-default threads-default ldb-rand "-L/ccs/home/jchoi/work/ucx-1.9.0/install/lib -L/ccs/home/jchoi/work/pmix-3.1.5/install/lib -L/sw/summit/gdrcopy/2.0/lib64 -Llib/ -lpmix -lucp -lucm -lucs -luct -lgdrapi -standalone -whole-archive -c++stl -shared")
 
   add_dependencies(charm hwloc)
 endif()
(charm4py) [zanef2@login1]~/charms/charm% git diff > diff
(charm4py) [zanef2@login1]~/charms/charm% cat diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 724e6d8d7..70703c450 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -865,7 +865,7 @@ if(${TARGET} STREQUAL "charm4py")
   set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
   add_library(charm SHARED empty.cpp)
   set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib_so)
-  target_link_libraries(charm ck converse memory-default threads-default ldb-rand "-Llib/ -standalone -whole-archive -c++stl -shared")
+  target_link_libraries(charm ck converse memory-default threads-default ldb-rand "-L/ccs/home/jchoi/work/ucx-1.9.0/install/lib -L/ccs/home/jchoi/work/pmix-3.1.5/install/lib -L/sw/summit/gdrcopy/2.0/lib64 -Llib/ -lpmix -lucp -lucm -lucs -luct -lgdrapi -standalone -whole-archive -c++stl -shared")
 
   add_dependencies(charm hwloc)
 endif()

The install directories of OpenPMIx and UCX should be passed in with --basedir.

./build charm4py ucx-linux-ppc64le cuda openpmix -j -g --with-production --basedir=$HOME/work/pmix-3.1.5/install --basedir=$HOME/work/ucx-master/install

Then, Charm4Py can be installed normally:

python3 -m pip install --user .

Running Charm4Py with UCX

You can check if UCX is picking up CUDA and GDRCOPY modules properly on the compute nodes by running jsrun -n1 ucx_info -d | grep cuda and jsrun -n1 ucx_info -d | grep gdr.

You may need to pass in --smpiargs="-disable_gpu_hooks" to jsrun if you observe any CUDA hook library failure messages.

Running the Charm4Py GPU latency benchmark (between 2 GPUs, intra-socket): jsrun -n2 -a1 -c2 -g1 -K2 -r2 --smpiargs="-disable_gpu_hooks" ./latency +ppn 1 +pemap L0,8 +commap L4,12

You can change the rendezvous threshold by using the UCX_RNDV_THRESH environment variable. The values that I found to work best for the OSU benchmarks are 131072 for intra-socket, 65536 for inter-socket, and 524288 for inter-node. Note that a too small value (less than 64 in my tests) will cause hangs, probably due to the UCX layer implementation in Charm++.

Charm4Py API

The Charm4Py implementation uses the Channels API. When a channel has been created between chares, there are two options for sending GPU-direct messages: by passing the buffers themselves and by passing arrays containing the pointers/sizes of the arrays. The latter is an optimization when the same buffers are used for communication multiple times, as the cost of determining the address/size of the buffers is paid only once; this optimiziation saves ~20us for each message.

Direct Access

Assume that partner_channel is a channel between two chares, and that d_data_send and d_data_recv are arrays implementing the CUDA Array Interface. To send these arrays through the channel, the following can be used:

# Called on the sender:
partner_channel.send(d_data_send)

# Called on the receiver:
partner_channel.recv(d_data_recv)

Note that multiple arrays can be sent, and that combinations of GPU and host parameters are allowed.

Persistent Communication Optimization

The Direct Access method extracts the address and size of each array using the CUDA Array Interface. Many applications use the same buffer for communication many times, and using the Direct Access the address and size must be extracted each time the Array is used. While we plan to implement a cache to optimize for these situations, we currently offer a workaround that allows to provide this information to the runtime system.

d_data_recv_addr = array.array('L', [0])
d_data_recv_size = array.array('i', [0])
d_data_send_addr = array.array('L', [0])
d_data_send_size = array.array('i', [0])

partner_channel.send(src_ptrs = d_data_send_addr, src_sizes = d_data_send_size)
partner_channel.recv(post_addresses = d_data_recv_addr,
                                   post_sizes = d_data_recv_size
                                  )

References

https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support
https://openucx.readthedocs.io/en/master/faq.html#working-with-gpu

…emory

…m4py into feature-gpu_direct

…bytes

…ture

…er types

… message

lgtm-com · 2021-06-24T14:33:07Z

This pull request introduces 9 alerts when merging ba3e95c into d95c29d - view on LGTM.com

new alerts:

4 for Unused import
2 for Unused local variable
2 for 'import *' may pollute namespace
1 for Unreachable code

ZwFink and others added 30 commits December 28, 2020 13:41

Start editing benchmark

d8ef62c

Creation of standalone GPU pingpong file

6804a96

Added CPU-only pingpong

28203e1

Added correct print format, pinned memory now used for staging host m…

681129c

…emory

Stage script

5fdfee3

Reset changes to pingpong

5412634

Unify API between the CPU/GPU benchmarks

2d4b2c4

Add macro test

76c50c7

Merge branch 'feature-gpu_direct' of https://github.com/UIUC-PPL/char…

805db5f

…m4py into feature-gpu_direct

add methods for Charm++ CUDA interface

ba4a9f1

sender-side GPU direct

3f34671

add method to get GPU data

4048015

add methods to support receiver-side GPU Direct

fc535bb

hook up to Charm++ GPUDirect functionality

67ed75b

hooks into lib

0f8bf70

fix syntax error

18b4e13

fix libcharm call

22297d2

gpu_recv_bufs now correctly retrieved

0b316d6

fix incorrect check for CUDA data

01a31d0

add more API calls

683a1c0

add call to register future deposit, add size of ck device buffer in …

91a1737

…bytes

WIP

e012722

fix datatype passed to CkGetGPUDirectData

fa20725

Temporary fix for requiring libmpi.so

a952d25

Remove break from GPU pingpong benchmark

266b19f

fix datatype passed to CkGetGPUDirectData

25d3cce

Debugging, fixed future ID and passing pointers to CkGetGPUDirectData

663ac5a

call local charm object, not charm remote when depositing GPU recv fu…

9000ec3

…ture

Remove debugging print statements

baa180b

put the pong back in the benchmark

1061a11

ZwFink added 29 commits March 17, 2021 10:21

Add iwait_map function to fix the issue

581dae9

include np and greenlet

bab3d20

update to use charm.iwait_map

65ad611

Fix charm.iwait_map bug

69dd6f5

include bug fix for charm.iwait_map

304a452

no need for channel assertion

afeaa6b

Merge branch 'bug-iwait_apifix' into feature-gpu_direct

959b038

correctly differentiate between instances when GPU data sent with oth…

0f34c1a

…er types

numpy arrays can now be sent with GPU-direct data

92f2cba

removed debug print

760e716

no need for different function when buffers + gpu arrays sent in same…

c254008

… message

tests sending one device array

75f358d

flake8 compliance

7ee684e

fixed hangup when multiple device arrays are sent

4f84baa

remove comment

49c82a1

change != None to is not None

f9ffaf6

tests for multiple arrays

46851e6

sizes should be int to match charm++ side

e730942

streams supported at charm4py layer

491de28

use streams when provided

fbcef04

device send function array-specific

f7d9083

let CkArraySendWithDeviceDataFromPointers determine number of buffers

aae2ff1

add general slower case for getting gpu pointers/sizes from memoryviews

abf27e3

use Charm++ functionality for GPU-direct group support

8188d29

Groups can now use GPU-direct functionality

895e619

test both arrays and groups

0abd443

make post/src buffers more kwargs more general

53d35f2

update tests with new kwarg names

52856ad

update calls to match new API

ba3e95c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-Aware Communication #208

GPU-Aware Communication #208

ZwFink commented Jun 24, 2021 •

edited

Loading

lgtm-com bot commented Jun 24, 2021

GPU-Aware Communication #208

Are you sure you want to change the base?

GPU-Aware Communication #208

Conversation

ZwFink commented Jun 24, 2021 • edited Loading

Building and Running on OLCF Summit

Building OpenPMIx

Building CUDA-enabled UCX

Building Charm4Py with UCX

Running Charm4Py with UCX

Charm4Py API

Direct Access

Persistent Communication Optimization

References

lgtm-com bot commented Jun 24, 2021

ZwFink commented Jun 24, 2021 •

edited

Loading