Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release loop lock before waiting for it to do work #369

Merged
merged 1 commit into from
Jan 17, 2020

Conversation

sloretz
Copy link
Contributor

@sloretz sloretz commented Jan 15, 2020

The main thread can be blocked trying to acquire __loop_from_run_thread_lock, but emit_event() holds that lock while waiting for a future that can only be completed by the main thread. This change releases the lock before blocking when emitting an event.

I expect this to fix ros2/build_farmer#248 . This is the situation in the python3 setup.py pytest process that is hung on this CI job. There are 5 threads. The main thread is blocked as above, and four threads are all simultaneously trying to call emit_event(). One of them is holding the lock and blocked wating for the future, while the other 3 are blocked trying to acquire __loop_from_run_thread_lock. I have no idea why the hang started occurring so regularly since it looks like it's a race condition. The fact that the CI job seems to require the December 10th fastrtps and rmw_connext changes plus all other tests to be run beforehand seems to be coincidence.

edit: gdb py-bt output from all threads https://gist.github.com/sloretz/6091ef621b9244579ec557ecb95b7a39

Main thread can be blocked trying to acquire
__loop_from_run_thread_lock while emit_event() in another thread is
holding that lock and waiting for the main thread to emit the event.
This change releases the lock before blocking.

Signed-off-by: Shane Loretz <[email protected]>
@sloretz sloretz added the bug Something isn't working label Jan 15, 2020
@sloretz sloretz self-assigned this Jan 15, 2020
@sloretz
Copy link
Contributor Author

sloretz commented Jan 15, 2020

CI (build all test all, because that's what is most likely to hang if this does not fix the issue)

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Member

@nuclearsandwich nuclearsandwich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not this turns out to resolve the issue I am admiring and grateful for @sloretz's determination. Based on the PR description and my own experience investigating these hangs this looks very promising. I'd like to get an additional review from either @hidmic or @wjwwood as the two folks who I think are otherwise closest to this code right now and am waiting with fingers and toes crossed for CI results.

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding!
One of the CI runs hung, but the patch looks good.

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hats off @sloretz, this is definitely a step forward. LGTM!

@sloretz
Copy link
Contributor Author

sloretz commented Jan 16, 2020

I reran windows CI with build type Debug, and it didn't hang :( https://ci.ros2.org/job/ci_windows/8990

It will be hard to tell what's going on without debug symbols. In the 8984 there were 5 threads. One seemed to be created by WinDbg itself. 3 threads had very very short stack traces blocked in openblas_read_env leading to NtWaitForMultipleObjects, and the main thread had a long stack trace in cpython leading to PyMethodDef_RawFastCallKeywords to _overlapped to GetQueuedCompletionStatus to NtRemoveIoCompletion.

In short, I have no idea why windows is hanging. I think this patch is an improvement, so unless there are objections by end of day I'll merge this and look at the windows hang separately.

@sloretz sloretz merged commit 4333910 into master Jan 17, 2020
@delete-merged-branch delete-merged-branch bot deleted the sloretz/async_emit_event branch January 17, 2020 17:06
ivanpauno pushed a commit that referenced this pull request Jan 17, 2020
Main thread can be blocked trying to acquire
__loop_from_run_thread_lock while emit_event() in another thread is
holding that lock and waiting for the main thread to emit the event.
This change releases the lock before blocking.

Signed-off-by: Shane Loretz <[email protected]>
ivanpauno pushed a commit that referenced this pull request Jan 20, 2020
Main thread can be blocked trying to acquire
__loop_from_run_thread_lock while emit_event() in another thread is
holding that lock and waiting for the main thread to emit the event.
This change releases the lock before blocking.

Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
ivanpauno added a commit that referenced this pull request Jan 21, 2020
* Handle case where output buffer is closed during shutdown (#365)

* Handle case where output buffer is closed during shutdown

  - Prevent crash during launch shutdown when a process IO event happens
    after the buffers have been closed
  - Use unbuffered output in that case so IO still has a chance of being seen

Signed-off-by: Pete Baughman <[email protected]>

* Address MR feedback

Signed-off-by: Pete Baughman <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>

* Import test file without contaminating sys.modules (#360)

Signed-off-by: Pete Baughman <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>

* Release loop lock before waiting for it to do work (#369)

Main thread can be blocked trying to acquire
__loop_from_run_thread_lock while emit_event() in another thread is
holding that lock and waiting for the main thread to emit the event.
This change releases the lock before blocking.

Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Co-authored-by: Peter Baughman <[email protected]>
Co-authored-by: Shane Loretz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests hanging during ROS 2 CI tests
5 participants