Fix non-deterministic hangs caused by MeshDevice trace replay #17696

cfjchu · 2025-02-07T02:06:07Z

Ticket

Link to Github Issue

Problem description

There are non-deterministic hangs with llama model tests using trace functionality. @tt-aho pointed out there are some changes in the blocking behavior that was incorrectly introduced in: d2ba114

What's changed

This change modifies the IDevice::replay_trace method to add more granular control over blocking behavior. Previously, we used a single boolean blocking to denote stalls that happen on device and stalls that happen on worker-thread. Since we've moved push_work API underneath the device APIs, we need this fine-grained control for orchestrating trace replay from MeshDevice.

Checklist

All post commit CI passes: https://github.com/tenstorrent/tt-metal/actions/runs/13191753217
T3000 Regression Tests: https://github.com/tenstorrent/tt-metal/actions/runs/13191607633

tt_metal/impl/device/device.cpp

omilyutin-tt · 2025-02-07T05:17:44Z

tt_metal/api/tt-metalium/device.hpp

@@ -141,7 +141,8 @@ class IDevice {
    // Metal trace device capture mode
    virtual void begin_trace(const uint8_t cq_id, const uint32_t tid) = 0;
    virtual void end_trace(const uint8_t cq_id, const uint32_t tid) = 0;
-    virtual void replay_trace(const uint8_t cq_id, const uint32_t tid, const bool blocking) = 0;
+    virtual void replay_trace(
+        const uint8_t cq_id, const uint32_t tid, const bool block_on_device, const bool block_on_worker_thread) = 0;


Outside of this PR... const T in function declarations don't do anything, it is equivalent to just T (doesn't apply to function definitions, where const T makes sure the parameter doesn't get mutated in the function body, also doesn't apply to const T& const T*)

Yes, I just used the existing precedent there but I agree.

omilyutin-tt · 2025-02-07T05:23:58Z

tt_metal/distributed/mesh_device.cpp

+    // If blocking, wait until worker threads have completed
+    if (block_on_worker_thread) {
+        for (auto& device : scoped_devices_->get_devices()) {
+            device->synchronize();
+        }


Is it not going to work if you call device->synchronize() but each call device->replay_trace() is non blocking?

Is it possible to use the event APIs for this perhaps? Maybe in the long term? I see the TODO but this problem will essentially remain, it's just now you won't hop over push_work and instead perform a blocking EnqueueTrace. How would EnqueueMeshTrace work from this perspective?

Is it not going to work if you call device->synchronize() but each call device->replay_trace() is non blocking?

At a high-level the behavior we want is to have main thread block on all devices. With worker thread per device, this is basically broken up into each worker thread blocking on device and main thread blocking on every worker thread. Your proposal would not guarantee the same behavior as what was there before. Right now, I'm just reverting back to the old behavior to resolve the llama hangs.

Is it possible to use the event APIs for this perhaps? Maybe in the long term? I see the TODO but this problem will essentially remain, it's just now you won't hop over push_work and instead perform a blocking EnqueueTrace. How would EnqueueMeshTrace work from this perspective?

Long term we won't have blocking on worker threads with TT-Mesh. @tt-asaigal will be implementing this as part of the Trace functionality.

omilyutin-tt · 2025-02-07T05:24:27Z

tt_metal/tt_metal.cpp

@@ -1326,7 +1326,7 @@ void EndTraceCapture(IDevice* device, const uint8_t cq_id, const uint32_t tid) {
 void ReplayTrace(IDevice* device, const uint8_t cq_id, const uint32_t tid, const bool blocking) {
    LIGHT_METAL_TRACE_FUNCTION_ENTRY();
    LIGHT_METAL_TRACE_FUNCTION_CALL(CaptureReplayTrace, device, cq_id, tid, blocking);
-    device->replay_trace(cq_id, tid, blocking);
+    device->replay_trace(cq_id, tid, blocking /* block_on_device */, blocking /* block_on_worker_thread */);


Both places where replay is called use the same boolean for both though? Does this fix the hang?

Take a look at the MeshDevice::replay implementation. On each worker replay, block_on_worker_thread is set to false.

Passes locally, waiting on our backlogged CI.

mtairum · 2025-02-07T11:17:15Z

We've tested this PR with a new update to the Llama3 models locally and the demo tests are passing.
Relevant PR: #17421

Pushed a branch that's basically #17421 rebased to this PR #17696 and kicked off the T3K demo pipelines: https://github.com/tenstorrent/tt-metal/actions/runs/13198728419

If these past this PR his good for merge from the models team side.
FIY @yieldthought

yieldthought · 2025-02-10T15:41:49Z

I cherry-picked these onto a new branch and still see a local hang - not convinced this is a complete fix.

yieldthought · 2025-02-10T15:43:04Z

We noticed this broke the CI tests over a week ago. For all of that time main has been broken for our team and main has been completely unprotected against other commits breaking our code further.

The breaking change should have been reverted from main immediately and this fixed on branch. Can we do this today? We need a working main for our workshop tomorrow.

cfjchu · 2025-02-11T06:07:08Z

I had a chat with @mtairum and we agree there are other commits that are responsible for the failures in regression. This commit does solve the issues originally reported.

Looking back at the original regression introduced by my changes:

The original change caused a regression only in the t3k_llama3_tests test suite. The changes in this PR revert to the original behavior and fixes the original regression in t3k_llama3_tests:

https://github.com/tenstorrent/tt-metal/actions/runs/13191607633

There seems to be three categories of failures in the CI today:

failures that already existed before my commit: (A)
failure caused by my commit: (B)-(A)
new failures caused not related to my commit

cfjchu · 2025-02-11T06:09:10Z

post-commit: https://github.com/tenstorrent/tt-metal/actions/runs/13255890993
T3K Frequent: https://github.com/tenstorrent/tt-metal/actions/runs/13255893739

cfjchu changed the title ~~Jchu/fix meshdevice trace replay~~ Fix non-deterministic hangs caused by MeshDevice trace replay Feb 7, 2025

cfjchu marked this pull request as ready for review February 7, 2025 04:36

cfjchu requested review from ayerofieiev-tt, dmakoviichuk-tt, TT-BrianLiu, aliuTT, tt-asaigal, omilyutin-tt, abhullar-tt, pgkeller, tt-aho, tt-dma, ubcheema and davorchap as code owners February 7, 2025 04:36

tt-aho approved these changes Feb 7, 2025

View reviewed changes

tt-asaigal approved these changes Feb 7, 2025

View reviewed changes

omilyutin-tt requested changes Feb 7, 2025

View reviewed changes

omilyutin-tt approved these changes Feb 7, 2025

View reviewed changes

cfjchu force-pushed the jchu/fix-meshdevice-trace-replay branch from 68f1409 to eb5e550 Compare February 7, 2025 19:15

mtairum mentioned this pull request Feb 10, 2025

Llama3 demo multi-device tests hanging with trace #17791

Open

TT-BrianLiu approved these changes Feb 10, 2025

View reviewed changes

cfjchu added 3 commits February 11, 2025 04:23

#0: Fix issue where traced llama models hanging

c706370

#0: Fix non-deterministic hangs caused by MeshDevice trace replay

fbe147a

#0: add comment about deprecating

61072d1

cfjchu force-pushed the jchu/fix-meshdevice-trace-replay branch from eb5e550 to 61072d1 Compare February 11, 2025 04:29

cfjchu merged commit 05b16aa into main Feb 11, 2025
222 of 223 checks passed

cfjchu deleted the jchu/fix-meshdevice-trace-replay branch February 11, 2025 06:10

mtairum mentioned this pull request Feb 12, 2025

ND Hangs on CI T3K #17844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-deterministic hangs caused by MeshDevice trace replay #17696

Fix non-deterministic hangs caused by MeshDevice trace replay #17696

cfjchu commented Feb 7, 2025 •

edited

Loading

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025 •

edited

Loading

omilyutin-tt Feb 7, 2025

cfjchu Feb 7, 2025

cfjchu Feb 7, 2025

mtairum commented Feb 7, 2025

yieldthought commented Feb 10, 2025

yieldthought commented Feb 10, 2025

cfjchu commented Feb 11, 2025 •

edited

Loading

cfjchu commented Feb 11, 2025

Fix non-deterministic hangs caused by MeshDevice trace replay #17696

Fix non-deterministic hangs caused by MeshDevice trace replay #17696

Conversation

cfjchu commented Feb 7, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

omilyutin-tt Feb 7, 2025

Choose a reason for hiding this comment

cfjchu Feb 7, 2025

Choose a reason for hiding this comment

omilyutin-tt Feb 7, 2025

Choose a reason for hiding this comment

cfjchu Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

omilyutin-tt Feb 7, 2025

Choose a reason for hiding this comment

cfjchu Feb 7, 2025

Choose a reason for hiding this comment

cfjchu Feb 7, 2025

Choose a reason for hiding this comment

mtairum commented Feb 7, 2025

yieldthought commented Feb 10, 2025

yieldthought commented Feb 10, 2025

cfjchu commented Feb 11, 2025 • edited Loading

cfjchu commented Feb 11, 2025

cfjchu commented Feb 7, 2025 •

edited

Loading

cfjchu Feb 7, 2025 •

edited

Loading

cfjchu commented Feb 11, 2025 •

edited

Loading