Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

Open
transfact opened this issue Feb 3, 2025 · 3 comments
Assignees
Labels

Comments

@transfact
Copy link

Describe the bug
Matmul Hangs when in0 is block sharded and in1 is width sharded

In0's shape is (8,1,224,768) , and In1's shape is (1,1,768,3072).

I added dprint to cb_wait_front(cb0) , cb_wait_front(cb1) and start of blocks to know where it hangs.

When input1 is interleaved, everything works well

Image

But when I changed my configs, it hangs because cb_wati_front(cb1) is hanging.

Image

To Reproduce

Code
@pytest.mark.parametrize(
    "batch_size, channel_a, channel_b, m_size, k_size, n_size, has_bias",
    [
        (1, 8, 1, 224, 768, 3072, False),
    ],
)
@pytest.mark.parametrize("dtype", [ttnn.bfloat8_b])
@pytest.mark.parametrize("pcc", [0.94])
def test_ff1(device, batch_size, channel_a, channel_b, m_size, k_size, n_size, has_bias, dtype, pcc):
    torch.manual_seed(0)

    torch_input_tensor_a = torch.randn((batch_size, channel_a, m_size, k_size), dtype=torch.float32)
    torch_input_tensor_b = torch.randn((batch_size, channel_b, k_size, n_size), dtype=torch.float32)
    torch_output_tensor = torch_input_tensor_a @ torch_input_tensor_b

    reuse_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=ttnn.CoreCoord(6, 8),
        in0_block_w=4,
        out_subblock_h=1,
        out_subblock_w=8,
        per_core_M=7,
        per_core_N=16,
        transpose_mcast=False,
        fused_activation=None,
    )

    input_tensor_a = ttnn.from_torch(
        torch_input_tensor_a, layout=ttnn.TILE_LAYOUT, device=device, dtype=dtype, memory_config=ttnn.L1_MEMORY_CONFIG
    )
    a_memory_config = ttnn.create_sharded_memory_config(
        (1, 8, m_size, k_size),
        core_grid=ttnn.CoreGrid(x=6, y=8),
        strategy=ttnn.ShardStrategy.BLOCK,
    )
    input_tensor_a = ttnn.to_memory_config(input_tensor_a, a_memory_config)
    b_memory_config= ttnn.create_sharded_memory_config(
                                               (1,1,k_size, n_size),
                                               core_grid=ttnn.CoreGrid(x=6,y=1),
                                                orientation=ttnn.ShardOrientation.ROW_MAJOR,
                                               strategy=ttnn.ShardStrategy.WIDTH)

    input_tensor_b = ttnn.from_torch(
        torch_input_tensor_b, layout=ttnn.TILE_LAYOUT, device=device, dtype=dtype, memory_config=ttnn.DRAM_MEMORY_CONFIG
    )
    input_tensor_b = ttnn.to_memory_config(input_tensor_b,b_memory_config)
    compute_kernel_config = ttnn.WormholeComputeKernelConfig(
        math_fidelity=ttnn.MathFidelity.LoFi,
    )
    # mem_config = input_tensor_a.memory_config()
    print(f"-- inputA config: {input_tensor_a.memory_config()}")
    print(f"-- inputB config: {input_tensor_b.memory_config()}")
    output_tensor = ttnn.matmul(
        input_tensor_a,
        input_tensor_b,
        # core_grid=core_grid,
        program_config=reuse_config,
        # compute_kernel_config= compute_kernel_config,
        # memory_config = a_memory_config
    )
    print(f"-- output config: {output_tensor.memory_config()}")
    # output_tensor = ttnn.to_torch(output_tensor)
    # _, msg = assert_with_pcc(torch_output_tensor, output_tensor, pcc=pcc)

    # print(msg)

If dprint is needed, add dprint here.
Image

tt-metal/ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp

Expected behavior
OS: [e.g. Ubuntu 20.04]
This is version v53. If it is fixed in version v55, please let me know.

I have seen issues that matmul of vit makes problem. Would you check this issue?
Thanks.

@transfact transfact added the bug Something isn't working label Feb 3, 2025
@bbradelTT
Copy link
Contributor

@transfact this is not a supported combination of inputs. We will add validation to fail in this case and update the doc strings.

In terms of performant matmuls, the first input should be interleaved or sharded while the second input should be interleaved.

@bbradelTT bbradelTT assigned edwinleeTT and unassigned bbradelTT Feb 3, 2025
@bbradelTT
Copy link
Contributor

@edwinleeTT will work on this issue.

@transfact
Copy link
Author

I just think that 2d matmul has a combination when in1 is sharded since validation is passing width shard
(1D config can be executed using in0 height_shard- in1 height_shard)
It might be traces of DRAM shard.

Thanks for checking and updating docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants