Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: a2a various optimizations #1067

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Jan 2, 2025

What

This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain
before_after

Support rectangular blocks

this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:

  • FORCE_WIDER, imposing h <= w
  • FORCE_LONGER, imposing h >= w
    rect_blocks

Reuse device memory chunks for several blocks

as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option NBR_SERIALIZED_BATCHES.

output5

batch the inter-node RDMA sends

we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by SEND_BATCH_SIZE

batch_size

Iterate across nodes before blocks when posting the WQEs

allows to better saturate the NW. This option is controlled by NBR_BATCHES_PER_PASSAGE which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1

batch_per_passage2

option to force regular case

through the TL/MLX5's env FORCE_REGULAR, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.

All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.

size_t t1 = power2(ucc_max(msgsize, 8));
size_t tsize = height * ucc_max(power2(width) * t1, MAX_MSG_SIZE);

return tsize <= MAX_TRANSPOSE_SIZE && msgsize <= 128 && height <= 64 &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define literals and not hardcode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants