fix dist attn reshape error #5366

tkdcjf159 · 2024-04-05T00:12:51Z

By default, DeepSpeed's DistributedAttention is set with scatter_idx = 2 and gather_idx = 0. However, if I set gather_idx to 1 and have a batch size greater than 1, an error will occur during the output all to all operation, as illustrated below. To fix this, modify the seq_world_size to -1.

def single_all_to_all(input, scatter_idx, gather_idx, group):
    # Assume input shape [2, 1024, 8, 16], scatter_idx = 1, gather_idx=2, seq_world_size=8
    seq_world_size = dist.get_world_size(group)
    inp_shape = list(input.shape) # inp_shape = [2, 1024, 8, 16]
    inp_shape[scatter_idx] = inp_shape[scatter_idx] // seq_world_size # inp_shape = [2, 128, 8, 16]
    if scatter_idx < 2:
        # Reshaping from [2, 1024, 8, 16] to [8, 128, 8, 16]: ERROR! (2 * 1024 * 8 * 16) != (8 * 128 * 8 * 16)
        # Use -1 to fix issue
        input_t = input.reshape(
            [-1, inp_shape[scatter_idx]] + \
            # [seq_world_size, inp_shape[scatter_idx]] + \
            inp_shape[scatter_idx + 1:]
        ).contiguous()
    else:
        # Transpose groups of heads with the seq-len parallel dimension to scatter them
        input_t = input.reshape(
            [-1, seq_world_size, inp_shape[scatter_idx]] + \
            inp_shape[scatter_idx + 1:]
        ).transpose(0, 1).contiguous()

    output = torch.empty_like(input_t)
    dist.all_to_all_single(output, input_t, group=group)

    # If scattering the seq-dim, transpose the heads back to the original dimension
    if scatter_idx < 2:
        output = output.transpose(0, 1).contiguous()

    return output.reshape(
        inp_shape[: gather_idx] + \
        [inp_shape[gather_idx] * seq_world_size,] + \
        inp_shape[gather_idx + 1:]).contiguous()

tkdcjf159 · 2024-04-05T00:45:33Z

@microsoft-github-policy-service agree company="Upstage"

This reverts commit 75bbf1a.

loadams · 2025-01-07T16:36:42Z

@tkdcjf159 - if you're still interested in this PR could you resolve the conflicts and we will get it reviewed?

loadams · 2025-01-21T20:41:16Z

@tkdcjf159 - closing this PR as this code has been refactored, if you believe this bug still remains, please comment and we can re-open this PR.

fix dist attn reshape error

3357ed6

tkdcjf159 requested a review from mrwyattii as a code owner April 5, 2024 00:12

format fix

75bbf1a

tkdcjf159 requested review from awan-10 and arashb as code owners April 5, 2024 00:41

tkdcjf159 and others added 3 commits April 7, 2024 11:22

Revert "format fix"

9ce5c50

This reverts commit 75bbf1a.

Merge branch 'master' into fix-dist-attn-reshape

6b36a38

Merge branch 'master' into fix-dist-attn-reshape

093d67e

loadams self-assigned this Jan 7, 2025

loadams requested review from loadams and removed request for arashb, awan-10 and mrwyattii January 7, 2025 16:51

loadams closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix dist attn reshape error #5366

fix dist attn reshape error #5366

tkdcjf159 commented Apr 5, 2024

tkdcjf159 commented Apr 5, 2024

loadams commented Jan 7, 2025

loadams commented Jan 21, 2025

fix dist attn reshape error #5366

fix dist attn reshape error #5366

Conversation

tkdcjf159 commented Apr 5, 2024

tkdcjf159 commented Apr 5, 2024

loadams commented Jan 7, 2025

loadams commented Jan 21, 2025