New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Avoid handling stale long-running messages on scheduler #8991

Open

hendrikmakait wants to merge 3 commits into dask:main from hendrikmakait:avoid-scheduler-corruption-due-to-stale-long-running-msg

Member

hendrikmakait commented Jan 21, 2025

Tests added / passed
Passes pre-commit run --all-files


          Only handle long-running message if task if sent by the worker design…

6f6c36a

…ated to process the task

hendrikmakait requested a review from fjetter as a code owner

January 21, 2025 16:41

hendrikmakait added 2 commits

January 21, 2025 17:45


          Simplify test

82ef6b0


          Move stealing code

88c42f7

hendrikmakait commented

View reviewed changes

distributed/scheduler.py

Comment on lines +6067 to +6072

+                      if worker not in self.workers:
+                          logger.debug(
+                              "Received long-running signal from unknown worker %s. Ignoring.", worker
+                          )
+                          return

Member Author

hendrikmakait Jan 21, 2025

This is mostly for good measure, I think it should the code should also work without this.

hendrikmakait commented

View reviewed changes

distributed/scheduler.py

Comment on lines +6092 to +6095

+                      steal = self.extensions.get("stealing")
+                      if steal is not None:
+                          steal.remove_key_from_stealable(ts)

Member Author

hendrikmakait Jan 21, 2025

I haven't tested the move of this code, but I'm certain that we should deal with staleness before taking any meaningful actions.

Member

fjetter Jan 28, 2025

yes, absolutely

Contributor

github-actions bot commented Jan 21, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files ± 0 27 suites ±0 11h 37m 24s ⏱️ + 7m 42s
4 117 tests + 1 4 000 ✅ - 2 111 💤 ±0 6 ❌ +3
51 628 runs +13 49 325 ✅ +10 2 296 💤 ±0 7 ❌ +3

For more details on these failures, see this check.

Results for commit 88c42f7. ± Comparison against base commit 49f5e74.

fjetter reviewed

View reviewed changes

distributed/scheduler.py

                       ws = ts.processing_on
                       if ws is None:
                           logger.debug("Received long-running signal from duplicate task. Ignoring.")
                           return
+                      if ws.address != worker:

Member

fjetter Jan 28, 2025

Ideally there was a more reliable way to verify the requests integrity.

A chain like this

processing -> long running -> released -> processing (without a long running transition)

that happens on the same worker would still recognize a stale event as valid. However, I doubt this is a relevant scenario in practice.

fjetter reviewed

View reviewed changes

distributed/tests/test_cancelled_state.py

Comment on lines +1388 to +1391

+                  # Assert that the handler did not fail and no state was corrupted
+                  logs = caplog.getvalue()
+                  assert not logs
+                  assert not wsB.task_prefix_count

Member

fjetter Jan 28, 2025

I would prefer a test that does not rely on logging. Is this corruption detectable with validate? (If not, can it be made detectable with this?)

Member Author

hendrikmakait Jan 29, 2025

Good point, let me check.

distributed/tests/test_cancelled_state.py

Comment on lines +1321 to +1332

+                  # Submit task and wait until it executes on a
+                  x = c.submit(
+                      f,
+                      block_secede,
+                      block_long_running,
+                      key="x",
+                      workers=[a.address],
+                  )
+                  await wait_for_state("x", "executing", a)
+                  with captured_logger("distributed.scheduler", logging.ERROR) as caplog:
+                      with freeze_batched_send(a.batched_stream):

Member

fjetter Jan 28, 2025

For review (and future maintainability) it might be helpful to briefly document in a sentence or two what the below code is constructing and asserting

distributed/tests/test_cancelled_state.py

+                      key="x",
+                      workers=[a.address],
+                  )
+                  await wait_for_state("x", "executing", a)

Member

fjetter Jan 28, 2025

since you're already dealing with so many events above, why not using an event for this as well? Is it important to interrupt as soon as the task is in this state, i.e. before it's executed on the TPE?

Member Author

hendrikmakait Jan 29, 2025

Works for me, no strong preference over polling vs. adding yet another event. I felt that this was a bit easier to read but I guess YMMV.

distributed/tests/test_cancelled_state.py

+                  def f(block_secede, block_long_running):
+                      block_secede.wait()
+                      distributed.secede()

Member

fjetter Jan 28, 2025

does this also trigger when using worker_client? The secede is an API I typically discourage from using. Mostly because the counterpart rejoin is quite broken

Member Author

hendrikmakait Jan 29, 2025

I strongly suppose it does. The original workload where this popped up had many clients connected to the scheduler.

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

distributed/tests/test_cancelled_state.py Show resolved Hide resolved

fjetter approved these changes

View reviewed changes

Member

fjetter left a comment

LGTM. If we can rewrite the test to use validate (or extend validate) that'd be great but it's not a blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet