Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid handling stale long-running messages on scheduler #8991

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

hendrikmakait
Copy link
Member

  • Tests added / passed
  • Passes pre-commit run --all-files

Comment on lines +6067 to +6072
if worker not in self.workers:
logger.debug(
"Received long-running signal from unknown worker %s. Ignoring.", worker
)
return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for good measure, I think it should the code should also work without this.

Comment on lines +6092 to +6095
steal = self.extensions.get("stealing")
if steal is not None:
steal.remove_key_from_stealable(ts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested the move of this code, but I'm certain that we should deal with staleness before taking any meaningful actions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, absolutely

Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   11h 37m 24s ⏱️ + 7m 42s
 4 117 tests + 1   4 000 ✅  -  2    111 💤 ±0  6 ❌ +3 
51 628 runs  +13  49 325 ✅ +10  2 296 💤 ±0  7 ❌ +3 

For more details on these failures, see this check.

Results for commit 88c42f7. ± Comparison against base commit 49f5e74.


ws = ts.processing_on
if ws is None:
logger.debug("Received long-running signal from duplicate task. Ignoring.")
return

if ws.address != worker:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there was a more reliable way to verify the requests integrity.

A chain like this

processing -> long running -> released -> processing (without a long running transition)

that happens on the same worker would still recognize a stale event as valid. However, I doubt this is a relevant scenario in practice.

Comment on lines +1388 to +1391
# Assert that the handler did not fail and no state was corrupted
logs = caplog.getvalue()
assert not logs
assert not wsB.task_prefix_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a test that does not rely on logging. Is this corruption detectable with validate? (If not, can it be made detectable with this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me check.

Comment on lines +1321 to +1332
# Submit task and wait until it executes on a
x = c.submit(
f,
block_secede,
block_long_running,
key="x",
workers=[a.address],
)
await wait_for_state("x", "executing", a)

with captured_logger("distributed.scheduler", logging.ERROR) as caplog:
with freeze_batched_send(a.batched_stream):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For review (and future maintainability) it might be helpful to briefly document in a sentence or two what the below code is constructing and asserting

key="x",
workers=[a.address],
)
await wait_for_state("x", "executing", a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're already dealing with so many events above, why not using an event for this as well? Is it important to interrupt as soon as the task is in this state, i.e. before it's executed on the TPE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, no strong preference over polling vs. adding yet another event. I felt that this was a bit easier to read but I guess YMMV.


def f(block_secede, block_long_running):
block_secede.wait()
distributed.secede()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also trigger when using worker_client? The secede is an API I typically discourage from using. Mostly because the counterpart rejoin is quite broken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suppose it does. The original workload where this popped up had many clients connected to the scheduler.

distributed/tests/test_cancelled_state.py Show resolved Hide resolved
distributed/tests/test_cancelled_state.py Show resolved Hide resolved
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If we can rewrite the test to use validate (or extend validate) that'd be great but it's not a blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants