[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

transfact · 2025-02-03T08:10:19Z

Describe the bug
Matmul Hangs when in0 is block sharded and in1 is width sharded

In0's shape is (8,1,224,768) , and In1's shape is (1,1,768,3072).

I added dprint to cb_wait_front(cb0) , cb_wait_front(cb1) and start of blocks to know where it hangs.

When input1 is interleaved, everything works well

But when I changed my configs, it hangs because cb_wati_front(cb1) is hanging.

To Reproduce

Code

@pytest.mark.parametrize(
    "batch_size, channel_a, channel_b, m_size, k_size, n_size, has_bias",
    [
        (1, 8, 1, 224, 768, 3072, False),
    ],
)
@pytest.mark.parametrize("dtype", [ttnn.bfloat8_b])
@pytest.mark.parametrize("pcc", [0.94])
def test_ff1(device, batch_size, channel_a, channel_b, m_size, k_size, n_size, has_bias, dtype, pcc):
    torch.manual_seed(0)

    torch_input_tensor_a = torch.randn((batch_size, channel_a, m_size, k_size), dtype=torch.float32)
    torch_input_tensor_b = torch.randn((batch_size, channel_b, k_size, n_size), dtype=torch.float32)
    torch_output_tensor = torch_input_tensor_a @ torch_input_tensor_b

    reuse_config = ttnn.MatmulMultiCoreReuseMultiCastProgramConfig(
        compute_with_storage_grid_size=ttnn.CoreCoord(6, 8),
        in0_block_w=4,
        out_subblock_h=1,
        out_subblock_w=8,
        per_core_M=7,
        per_core_N=16,
        transpose_mcast=False,
        fused_activation=None,
    )

    input_tensor_a = ttnn.from_torch(
        torch_input_tensor_a, layout=ttnn.TILE_LAYOUT, device=device, dtype=dtype, memory_config=ttnn.L1_MEMORY_CONFIG
    )
    a_memory_config = ttnn.create_sharded_memory_config(
        (1, 8, m_size, k_size),
        core_grid=ttnn.CoreGrid(x=6, y=8),
        strategy=ttnn.ShardStrategy.BLOCK,
    )
    input_tensor_a = ttnn.to_memory_config(input_tensor_a, a_memory_config)
    b_memory_config= ttnn.create_sharded_memory_config(
                                               (1,1,k_size, n_size),
                                               core_grid=ttnn.CoreGrid(x=6,y=1),
                                                orientation=ttnn.ShardOrientation.ROW_MAJOR,
                                               strategy=ttnn.ShardStrategy.WIDTH)

    input_tensor_b = ttnn.from_torch(
        torch_input_tensor_b, layout=ttnn.TILE_LAYOUT, device=device, dtype=dtype, memory_config=ttnn.DRAM_MEMORY_CONFIG
    )
    input_tensor_b = ttnn.to_memory_config(input_tensor_b,b_memory_config)
    compute_kernel_config = ttnn.WormholeComputeKernelConfig(
        math_fidelity=ttnn.MathFidelity.LoFi,
    )
    # mem_config = input_tensor_a.memory_config()
    print(f"-- inputA config: {input_tensor_a.memory_config()}")
    print(f"-- inputB config: {input_tensor_b.memory_config()}")
    output_tensor = ttnn.matmul(
        input_tensor_a,
        input_tensor_b,
        # core_grid=core_grid,
        program_config=reuse_config,
        # compute_kernel_config= compute_kernel_config,
        # memory_config = a_memory_config
    )
    print(f"-- output config: {output_tensor.memory_config()}")
    # output_tensor = ttnn.to_torch(output_tensor)
    # _, msg = assert_with_pcc(torch_output_tensor, output_tensor, pcc=pcc)

    # print(msg)

If dprint is needed, add dprint here.

tt-metal/ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp

Expected behavior
OS: [e.g. Ubuntu 20.04]
This is version v53. If it is fixed in version v55, please let me know.

I have seen issues that matmul of vit makes problem. Would you check this issue?
Thanks.

The text was updated successfully, but these errors were encountered:

bbradelTT · 2025-02-03T15:27:25Z

@transfact this is not a supported combination of inputs. We will add validation to fail in this case and update the doc strings.

In terms of performant matmuls, the first input should be interleaved or sharded while the second input should be interleaved.

bbradelTT · 2025-02-03T20:45:43Z

@edwinleeTT will work on this issue.

transfact · 2025-02-03T23:35:59Z

I just think that 2d matmul has a combination when in1 is sharded since validation is passing width shard
(1D config can be executed using in0 height_shard- in1 height_shard)
It might be traces of DRAM shard.

Thanks for checking and updating docs.

transfact added the bug Something isn't working label Feb 3, 2025

github-actions bot added the community label Feb 3, 2025

ayerofieiev-tt assigned bbradelTT Feb 3, 2025

bbradelTT added op_cat: mm P1 labels Feb 3, 2025

bbradelTT assigned edwinleeTT and unassigned bbradelTT Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

transfact commented Feb 3, 2025

bbradelTT commented Feb 3, 2025

bbradelTT commented Feb 3, 2025

transfact commented Feb 3, 2025

[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

[Bug Report] Matmul Hangs when in0 is block sharded and in1 is width sharded #17482

Comments

transfact commented Feb 3, 2025

bbradelTT commented Feb 3, 2025

bbradelTT commented Feb 3, 2025

transfact commented Feb 3, 2025