Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move dagster dev tests to core #27654

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

smackesey
Copy link
Collaborator

@smackesey smackesey commented Feb 7, 2025

Summary & Motivation

Our only dagster dev tests currently live in the daemon integration test suite, which isn't run on windows. This moves them to dagster, which is tested on windows.

The tests are mostly moved over as they were, with a few adjustments:

  • Instead of loading the toys repo, we now just load the simple test repo loaded in the rest of the tests. The toys repo adds a bunch of dependencies but loading it specifically doesn't really test anything relevant.
  • All tests have been augmented to ensure all child processes are shut down (on Unix at least-- see below)
  • Some refactoring to make the tests simpler to read.

Previously we weren't testing this on Windows at all, and I went in thinking we would end with exactly the same tests passing on windows and unix. Unfortunately, that's not quite true-- with our current set up I could not find a way to reliably and cleanly shut down the dagster dev process tree on Windows in a CI environment. This has to do with the differences in signal handling and process groups in Windows vs unix. I will quote ChatGPT-o1:

On Windows, CTRL_BREAK_EVENT (and similarly CTRL_C_EVENT) only works when the calling process and the target processes share a console. By default, Python’s subprocess.Popen with CREATE_NEW_PROCESS_GROUP spawns the child into its own new process group but does not automatically attach it to the same console as the parent (and in many CI environments, there may be no real console at all). As a result, sending CTRL_BREAK_EVENT to that child process group essentially does nothing—Windows never delivers the interrupt to those child processes, and they keep running.

The lack of console in the CI environment means that CTRL_BREAK_EVENT, which we use to shut down the child processes, does not propagate directly from dev to the webserver/daemon (but it does appear to work when dagster dev is used interactively, due to sharing a console). So in the Windows tests I ended up explicitly killing all child processes. We could maybe make this more robust in the future by having the dev process send a shutdown signal via another form of IPC, but I couldn't see how to do that here.

How I Tested These Changes

Ran moved dagster dev tests on windows and unix.

Copy link
Collaborator Author

smackesey commented Feb 7, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@smackesey smackesey force-pushed the windows/sean/move-dagster-dev-tests-to-core branch 5 times, most recently from 20c3b36 to 6250a23 Compare February 8, 2025 16:52
@smackesey smackesey force-pushed the windows/sean/move-dagster-dev-tests-to-core branch 17 times, most recently from 8841273 to e2d78a6 Compare February 10, 2025 04:24
@smackesey smackesey marked this pull request as ready for review February 10, 2025 05:11
@smackesey smackesey force-pushed the windows/sean/move-dagster-dev-tests-to-core branch from e2d78a6 to d3c450d Compare February 10, 2025 13:37
Copy link
Member

@alangenfeld alangenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does moving these tests do to the runtime for the associated step in BK?

thanks for digging in to this - unfortunate that there is so much divergence for windows but im not sure theres any better way to deal with it

@@ -211,6 +211,14 @@ def interrupt_ipc_subprocess(proc: "Popen[Any]") -> None:
proc.send_signal(signal.SIGINT)


def interrupt_or_kill_ipc_subprocess(proc: "Popen[Any]", wait_time: int = 10) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _then_ instead of _or_ maybe

@@ -35,7 +40,7 @@ commands =

api_tests: pytest -c ../../pyproject.toml -vv ./dagster_tests/api_tests {env:COVERAGE_ARGS} --durations 10 {posargs}
asset_defs_tests: pytest -c ../../pyproject.toml -vv ./dagster_tests/asset_defs_tests {env:COVERAGE_ARGS} --durations 10 {posargs}
cli_tests: pytest -c ../../pyproject.toml -vv ./dagster_tests/cli_tests {env:COVERAGE_ARGS} --durations 10 {posargs}
cli_tests: pytest -x -c ../../pyproject.toml -vv ./dagster_tests/cli_tests/ {env:COVERAGE_ARGS} --durations 10 {posargs}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean this up i assume

Comment on lines 284 to 285
# Skip windows here-- it behaves strangely with respect to child processes and cmdline(). But we
# are still checking that all child processes shut down later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is we just cant exclude the tracker processes? whats up with cmdline? I think its worth being more specific here

Copy link
Collaborator Author

@smackesey smackesey Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I forgot to fully clean this up and had some questions for you about it:

  • I called these "tracker processes" as a kind of guess just to have a name but I'm not totally sure where they're coming from. I thought maybe from the watch threads of the workspace but those are threads and not separate processes. I think they're from the sensor that issues a RunRequest. They look like this and the number of them varies per test run (I've seen 0, 2, 3):
PID [69896] PPID [69883]: ['/Users/smackesey/stm/code/elementl/oss/.venv/bin/python3', '-c', 'from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=13, pipe_handle=22)', '--multiprocessing-fork']
  • I will definitely elaborate the error message. There are two issues on windows:
    • The transient nature of the "tracker processes" leads to situations where they exist at the time the child processes are gathered but are dead when we check the cmdline(). Checking cmdline() of a dead process throws an error on Windows (only).
    • For some reason there are ~2x the processes on Windows when running through tox. I think each process we're launching directly goes through a tox shim, which is it's own persistent process that invokes the "real" target process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the multiprocessing module is used extra processes are spawned to track certain cross process data structures, so i think you named it those well. No idea whats up with the tox shim thing.

Having to skip out on some of this validation on windows is unfortunate but getting the functionality under tests is a big improvement I am not sure its worth the time it would take to try to improve past what you've cooked up here.

Best thing to do is just leave as much context as you can for the next poor soul who finds them self in this abyss.

@smackesey
Copy link
Collaborator Author

what does moving these tests do to the runtime for the associated step in BK?

Increases by a few minutes, but it's still not the slowest dagster BK step

unfortunate that there is so much divergence for windows but im not sure theres any better way to deal with it

One way would be to use dedicated IPC (pipe or socket) to send the shutdown signal to the child processes instead of signals. But I don't think there's an existing channel for this.

@alangenfeld
Copy link
Member

@smackesey smackesey merged commit c43dd21 into master Feb 10, 2025
13 of 14 checks passed
@smackesey smackesey deleted the windows/sean/move-dagster-dev-tests-to-core branch February 10, 2025 22:52
LoHertel pushed a commit to LoHertel/dagster that referenced this pull request Feb 11, 2025
## Summary & Motivation

Our only `dagster dev` tests currently live in the daemon integration
test suite, which isn't run on windows. This moves them to `dagster`,
which is tested on windows.

The tests are mostly moved over as they were, with a few adjustments:

- Instead of loading the toys repo, we now just load the simple test
repo loaded in the rest of the tests. The toys repo adds a bunch of
dependencies but loading it specifically doesn't really test anything
relevant.
- All tests have been augmented to ensure all child processes are shut
down (on Unix at least-- see below)
- Some refactoring to make the tests simpler to read.

Previously we weren't testing this on Windows at all, and I went in
thinking we would end with exactly the same tests passing on windows and
unix. Unfortunately, that's not _quite_ true-- with our current set up I
could not find a way to reliably and cleanly shut down the `dagster dev`
process tree on Windows in a CI environment. This has to do with the
differences in signal handling and process groups in Windows vs unix. I
will quote ChatGPT-o1:

>On Windows, CTRL_BREAK_EVENT (and similarly CTRL_C_EVENT) only works
when the calling process and the target processes share a console. By
default, Python’s `subprocess.Popen` with CREATE_NEW_PROCESS_GROUP
spawns the child into its own new process group but does not
automatically attach it to the same console as the parent (and in many
CI environments, there may be no real console at all). As a result,
sending CTRL_BREAK_EVENT to that child process group essentially does
nothing—Windows never delivers the interrupt to those child processes,
and they keep running.

The lack of console in the CI environment means that `CTRL_BREAK_EVENT`,
which we use to shut down the child processes, does not propagate
directly from `dev` to the webserver/daemon (but it does appear to work
when `dagster dev` is used interactively, due to sharing a console). So
in the Windows tests I ended up explicitly killing all child processes.
We could maybe make this more robust in the future by having the `dev`
process send a shutdown signal via another form of IPC, but I couldn't
see how to do that here.

## How I Tested These Changes

Ran moved `dagster dev` tests on windows and unix.
braunjj pushed a commit that referenced this pull request Feb 14, 2025
## Summary & Motivation

Our only `dagster dev` tests currently live in the daemon integration
test suite, which isn't run on windows. This moves them to `dagster`,
which is tested on windows.

The tests are mostly moved over as they were, with a few adjustments:

- Instead of loading the toys repo, we now just load the simple test
repo loaded in the rest of the tests. The toys repo adds a bunch of
dependencies but loading it specifically doesn't really test anything
relevant.
- All tests have been augmented to ensure all child processes are shut
down (on Unix at least-- see below)
- Some refactoring to make the tests simpler to read.

Previously we weren't testing this on Windows at all, and I went in
thinking we would end with exactly the same tests passing on windows and
unix. Unfortunately, that's not _quite_ true-- with our current set up I
could not find a way to reliably and cleanly shut down the `dagster dev`
process tree on Windows in a CI environment. This has to do with the
differences in signal handling and process groups in Windows vs unix. I
will quote ChatGPT-o1:

>On Windows, CTRL_BREAK_EVENT (and similarly CTRL_C_EVENT) only works
when the calling process and the target processes share a console. By
default, Python’s `subprocess.Popen` with CREATE_NEW_PROCESS_GROUP
spawns the child into its own new process group but does not
automatically attach it to the same console as the parent (and in many
CI environments, there may be no real console at all). As a result,
sending CTRL_BREAK_EVENT to that child process group essentially does
nothing—Windows never delivers the interrupt to those child processes,
and they keep running.

The lack of console in the CI environment means that `CTRL_BREAK_EVENT`,
which we use to shut down the child processes, does not propagate
directly from `dev` to the webserver/daemon (but it does appear to work
when `dagster dev` is used interactively, due to sharing a console). So
in the Windows tests I ended up explicitly killing all child processes.
We could maybe make this more robust in the future by having the `dev`
process send a shutdown signal via another form of IPC, but I couldn't
see how to do that here.

## How I Tested These Changes

Ran moved `dagster dev` tests on windows and unix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants