TL/MLX5: a2a various optimizations #1067
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain
Support rectangular blocks
this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:
FORCE_WIDER
, imposingh <= w
FORCE_LONGER
, imposingh >= w
Reuse device memory chunks for several blocks
as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option
NBR_SERIALIZED_BATCHES
.batch the inter-node RDMA sends
we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by
SEND_BATCH_SIZE
Iterate across nodes before blocks when posting the WQEs
allows to better saturate the NW. This option is controlled by
NBR_BATCHES_PER_PASSAGE
which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1
option to force regular case
through the TL/MLX5's env
FORCE_REGULAR
, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.