TL/MLX5: a2a various optimizations #1067

samnordmann · 2025-01-02T16:38:06Z

What

This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain

Support rectangular blocks

this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:

FORCE_WIDER, imposing h <= w
FORCE_LONGER, imposing h >= w

Reuse device memory chunks for several blocks

as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option NBR_SERIALIZED_BATCHES.

batch the inter-node RDMA sends

we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by SEND_BATCH_SIZE

Iterate across nodes before blocks when posting the WQEs

allows to better saturate the NW. This option is controlled by NBR_BATCHES_PER_PASSAGE which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1

option to force regular case

through the TL/MLX5's env FORCE_REGULAR, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.

All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.

samnordmann · 2025-01-22T16:20:33Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

+    size_t t1    = power2(ucc_max(msgsize, 8));
+    size_t tsize = height * ucc_max(power2(width) * t1, MAX_MSG_SIZE);
+
+    return tsize <= MAX_TRANSPOSE_SIZE && msgsize <= 128 && height <= 64 &&


define literals and not hardcode

samnordmann added 2 commits January 2, 2025 18:35

TL/MLX5: fix nrows ncols in WQE

ffdcec9

TL/MLX5: various optimizations

5e60c92

samnordmann mentioned this pull request Jan 2, 2025

TL/MLX5: various optimizations #1012

Closed

TL/MLX5: tune default config

fd47eb5

samnordmann added the Ready-for-Review label Jan 2, 2025

samnordmann requested review from Sergei-Lebedev and janjust January 2, 2025 21:00

janjust added the Code-Review-Required label Jan 6, 2025

samnordmann commented Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/MLX5: a2a various optimizations #1067

TL/MLX5: a2a various optimizations #1067

samnordmann commented Jan 2, 2025 •

edited

Loading

samnordmann Jan 22, 2025

TL/MLX5: a2a various optimizations #1067

Are you sure you want to change the base?

TL/MLX5: a2a various optimizations #1067

Conversation

samnordmann commented Jan 2, 2025 • edited Loading

What

Support rectangular blocks

Reuse device memory chunks for several blocks

batch the inter-node RDMA sends

Iterate across nodes before blocks when posting the WQEs

option to force regular case

samnordmann Jan 22, 2025

Choose a reason for hiding this comment

samnordmann commented Jan 2, 2025 •

edited

Loading