Skip to content

Commit

Permalink
[skip ci] #0: Add failure signature for runner shutdown (tenstorrent#…
Browse files Browse the repository at this point in the history
…17439)

### Ticket

### Problem description

### What's changed
Added signature classification for unexpected runner shutdown e.g.
https://github.com/tenstorrent/tt-metal/actions/runs/13077375477/job/36493228333

### Checklist
- [ ] Post commit CI passes
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
  • Loading branch information
williamlyTT authored and nikileshx committed Feb 3, 2025
1 parent 2c5f154 commit 46c1581
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 0 deletions.
1 change: 1 addition & 0 deletions infra/data_collection/github/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def get_job_failure_signature_(github_job, failure_description) -> Optional[Unio
"timed out": str(InfraErrorV1.JOB_UNIT_TIMEOUT_FAILURE),
"exceeded the maximum execution time": str(InfraErrorV1.JOB_CUMULATIVE_TIMEOUT_FAILURE),
"lost communication with the server": str(InfraErrorV1.RUNNER_COMM_FAILURE),
"runner has received a shutdown signal": str(InfraErrorV1.RUNNER_SHUTDOWN_FAILURE),
"No space left on device": str(InfraErrorV1.DISK_SPACE_FAILURE),
}

Expand Down
1 change: 1 addition & 0 deletions infra/data_collection/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ class InfraErrorV1(enum.Enum):
GENERIC_FAILURE = enum.auto()
DISK_SPACE_FAILURE = enum.auto()
RUNNER_COMM_FAILURE = enum.auto()
RUNNER_SHUTDOWN_FAILURE = enum.auto()

0 comments on commit 46c1581

Please sign in to comment.