Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed.jl worker processes crash on Windows when importing ReverseDiff.jl #46441

Open
MilesCranmer opened this issue Aug 22, 2022 · 6 comments

Comments

@MilesCranmer
Copy link
Member

MilesCranmer commented Aug 22, 2022

Distributed.jl worker processes crash when importing ReverseDiff.jl, and only on Windows. I haven't been able to find any explanation in ReverseDiff.jl itself, so I was hoping I could receive some assistance or clues here.

This bug has been observed on Julia 1.5 through 1.8. It only occurs on Windows (including windows-2019, windows-2022, and windows-latest. Ubuntu and macOS are unaffected, although see $^{[1]}$.)

Other posts on this issue:

This bug can reproduced with the following code, which dynamically allocates some worker processes, activates the current environment on each, and then imports a given package on each worker:

using Pkg, Distributed

"""Try to dynamically create workers, and import the package."""
function test(package_name)
    procs = addprocs(4)
    project_path = splitdir(Pkg.project().path)[1]
    # Import package on head worker:
    Base.MainInclude.eval(
        quote
            import $(Symbol(package_name))
        end
    )
    # Import package on worker:
    @everywhere procs begin
        Base.MainInclude.eval(
            quote
                using Pkg
                Pkg.activate($$project_path)
                import $(Symbol($package_name))
            end,
        )
    end
    rmprocs(procs)
end

packages_to_test = [
    "Distributed",  "JSON3", "LineSearches", "LinearAlgebra",
    "LossFunctions", "Optim", "Printf", "Random",
    "Reexport", "SpecialFunctions", "Zygote", "ReverseDiff",
]
for package_name in packages_to_test
    println("Testing $(package_name)...")
    test(package_name)
    println("Success!")
end

The only reliable $^{[1]}$ failure case here is Windows + ReverseDiff.jl. All other combinations of packages and operating systems work fine. I also note that the first import must occur. If the package is only imported on the worker processes, but not on the head worker, the error does not occur.

You can see an example of this error here: https://github.com/MilesCranmer/SymbolicRegression.jl/runs/7957291344?check_suite_focus=true#step:6:296. All packages are successfully imported, except when it comes to ReverseDiff.jl, and only on Windows.

cc @rikhuijzer @ChrisRackauckas @mohamed82008

$^{[1]}$ for the first time, I also saw this occur on an Ubuntu test – also for ReverseDiff.jl. That one is not consistent though.

@mkitti
Copy link
Contributor

mkitti commented Aug 22, 2022

Thr Github action output is likely to disappear after sometime. Could you copy or otherwise attach the relevant output here?

@MilesCranmer
Copy link
Member Author

Here is the copied output (Windows test):
Precompiling project...
  ✓ SafeTestsets
  1 dependency successfully precompiled in 1 seconds. 113 already precompiled.
     Testing Running tests...
Testing Distributed...
      From worker 3:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 4:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 5:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 2:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing JSON3...
      From worker 9:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 7:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 6:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 8:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing LineSearches...
      From worker 13:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 12:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 11:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 10:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing LinearAlgebra...
      From worker 17:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 16:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 14:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 15:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing LossFunctions...
      From worker 21:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 20:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 18:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 19:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing Optim...
      From worker 22:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 24:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 25:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 23:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing Printf...
      From worker 28:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 26:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 27:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 29:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing Random...
      From worker 30:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 31:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 33:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 32:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing Reexport...
      From worker 35:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 36:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 34:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 37:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing SpecialFunctions...
      From worker 38:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 41:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 40:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 39:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing Zygote...
      From worker 43:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 42:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 44:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
      From worker 45:	  Activating project at `C:\Users\runneradmin\AppData\Local\Temp\jl_rgb2uf`
Success!
Testing ReverseDiff...
Worker 48 terminated.
Worker 46 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
│ Error: Error during package callback
│   exception =
│    1-element ExceptionStack:
│    ProcessExitedException(46)
│    
│    ...and 3 more exceptions.
│    
│    Stacktrace:
│      [1] sync_end(c::Channel{Any})
│        @ Base .\task.jl:436
│      [2] macro expansion
│        @ .\task.jl:455 [inlined]
│      [3] _require_callback(mod::Base.PkgId)
│        @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\Distributed.jl:77
│      [4] #invokelatest#2
│        @ .\essentials.jl:729 [inlined]
│      [5] invokelatest
│        @ .\essentials.jl:726 [inlined]
│      [6] run_package_callbacks(modkey::Base.PkgId)
│        @ Base .\loading.jl:869
│      [7] _require_prelocked(uuidkey::Base.PkgId)
│        @ Base .\loading.jl:1206
│      [8] macro expansion
│        @ .\loading.jl:1180 [inlined]
│      [9] macro expansion
│        @ .\lock.jl:223 [inlined]
│     [10] require(into::Module, mod::Symbol)
│        @ Base .\loading.jl:1144
│     [11] top-level scope
│        @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:11
│     [12] eval
│        @ .\boot.jl:368 [inlined]
│     [13] eval
│        @ .\client.jl:478 [inlined]
│     [14] test(package_name::String)
│        @ Main D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:9
│     [15] top-level scope
│        @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:43
│     [16] include(fname::String)
│        @ Base.MainInclude .\client.jl:476
│     [17] top-level scope
│        @ none:6
│     [18] eval
│        @ .\boot.jl:368 [inlined]
│     [19] exec_options(opts::Base.JLOptions)
│        @ Base .\client.jl:276
│     [20] _start()
│        @ Base .\client.jl:522
└ @ Base loading.jl:874
Worker 47 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
ERROR: 
LoadError: Worker 49 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
ProcessExitedException
(Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#679")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:944
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:950
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
46)

...and 3 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:436
 [2] macro expansion
   @ .\task.jl:455 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\macros.jl:219
 [4] macro expansion
   @ C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Distributed\src\macros.jl:203 [inlined]
 [5] test(package_name::String)
   @ Main D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:15
 [6] top-level scope
   @ D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:43
 [7] include(fname::String)
   @ Base.MainInclude .\client.jl:476
 [8] top-level scope
   @ none:6
in expression starting at D:\a\SymbolicRegression.jl\SymbolicRegression.jl\test\runtests.jl:41
ERROR: Package SymbolicRegression errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\Types.jl:67
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
   @ Pkg.Operations C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\Operations.jl:1813
 [3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, IOContext{Base.PipeEndpoint}, Tuple{Symbol}, NamedTuple{(:io,), Tuple{IOContext{Base.PipeEndpoint}}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:431
 [4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::IOContext{Base.PipeEndpoint}, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:coverage,), Tuple{Bool}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:156
 [5] test(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:coverage,), Tuple{Bool}}})
   @ Pkg.API C:\hostedtoolcache\windows\julia\1.8.0\x64\share\julia\stdlib\v1.8\Pkg\src\API.jl:171
 [6] top-level scope
   @ none:1
Error: Process completed with exit code 1.

@MilesCranmer
Copy link
Member Author

A more minimal example if one just wants to test ReverseDiff.jl alone is:

using Pkg, Distributed
import ReverseDiff

procs = addprocs(4)
project_path = splitdir(Pkg.project().path)[1]
@everywhere procs begin
    Base.MainInclude.eval(
        quote
            using Pkg
            Pkg.activate($$project_path)
            import ReverseDiff
        end,
    )
end

@mkitti
Copy link
Contributor

mkitti commented Aug 22, 2022

Since the issue is currently specific to ReverseDiff, should that be mentioned in the title?

@MilesCranmer MilesCranmer changed the title Distributed.jl worker processes crash on Windows when importing package Distributed.jl worker processes crash on Windows when importing ReverseDiff.jl Aug 22, 2022
@MilesCranmer
Copy link
Member Author

Done.

Do you have access to a Windows machine? If so, it would be interesting to run a git bisect on the history of ReverseDiff.jl to see what change is breaking things.

@mkitti
Copy link
Contributor

mkitti commented Aug 22, 2022

I do have a Windows machine. Let's handle this on the package issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants