Skip to content

Commit

Permalink
Add query to see jobs that need runners scaled up (#5944)
Browse files Browse the repository at this point in the history
This will be used by the autoscaler lambdas to figure out which
instances may not have properly propagated the scale-up command (which
happens fairly regularly)

The query is based on the "queued_jobs" query, with a few changes:
- Queues across all jobs in the pytorch and pytorch-labs organizations,
which our runners support
- Checks only a specific time window for queued jobs, looking for jobs
that have been queued for long enough to warrant intervention, while not
being so old that github cancels the job
  • Loading branch information
ZainRizvi authored Nov 19, 2024
1 parent 8232497 commit ea7177b
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
71 changes: 71 additions & 0 deletions torchci/clickhouse_queries/queued_jobs_aggregate/query.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
--- This query is used by the AWS autoscalers to scale up runner types that
--- have had jobs waiting for them for a significant period of time.
---
--- This query returns the number of jobs per runner type that have been
--- queued for too long, which the autoscalers use to determin how many
--- additional runners to spin up.

with possible_queued_jobs as (
select id, run_id
from default.workflow_job
where
status = 'queued'
AND created_at < (
-- Only consider jobs that have been queued for a significant period of time
CURRENT_TIMESTAMP() - INTERVAL 30 MINUTE
)
AND created_at > (
-- Queued jobs are automatically cancelled after this long. Any allegedly pending
-- jobs older than this are actually bad data
CURRENT_TIMESTAMP() - INTERVAL 3 DAY
)
),
queued_jobs as (
SELECT
DATE_DIFF(
'minute',
job.created_at,
CURRENT_TIMESTAMP()
) AS queue_m,
workflow.repository.owner.login as org,
workflow.repository.full_name as full_repo,
CONCAT(workflow.name, ' / ', job.name) AS name,
job.html_url,
IF(
LENGTH(job.labels) = 0,
'N/A',
IF(
LENGTH(job.labels) > 1,
job.labels[2],
job.labels[1]
)
) AS runner_label
FROM
default.workflow_job job final
JOIN default.workflow_run workflow final ON workflow.id = job.run_id
WHERE
job.id in (select id from possible_queued_jobs)
and workflow.id in (select run_id from possible_queued_jobs)
and workflow.repository.owner.login in ('pytorch', 'pytorch-labs')
AND job.status = 'queued'
/* These two conditions are workarounds for GitHub's broken API. Sometimes */
/* jobs get stuck in a permanently "queued" state but definitely ran. We can */
/* detect this by looking at whether any steps executed (if there were, */
/* obviously the job started running), and whether the workflow was marked as */
/* complete (somehow more reliable than the job-level API) */
AND LENGTH(job.steps) = 0
AND workflow.status != 'completed'
ORDER BY
queue_m DESC
)
select
runner_label,
org,
full_repo,
count(*) as num_queued_jobs,
min(queue_m) as min_queue_time_min,
max(queue_m) as max_queue_time_min
from queued_jobs
group by runner_label, org, full_repo
order by max_queue_time_min desc
settings allow_experimental_analyzer = 1;

0 comments on commit ea7177b

Please sign in to comment.