-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release loop lock before waiting for it to do work #369
Conversation
Main thread can be blocked trying to acquire __loop_from_run_thread_lock while emit_event() in another thread is holding that lock and waiting for the main thread to emit the event. This change releases the lock before blocking. Signed-off-by: Shane Loretz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not this turns out to resolve the issue I am admiring and grateful for @sloretz's determination. Based on the PR description and my own experience investigating these hangs this looks very promising. I'd like to get an additional review from either @hidmic or @wjwwood as the two folks who I think are otherwise closest to this code right now and am waiting with fingers and toes crossed for CI results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great finding!
One of the CI runs hung, but the patch looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hats off @sloretz, this is definitely a step forward. LGTM!
I reran windows CI with build type Debug, and it didn't hang :( https://ci.ros2.org/job/ci_windows/8990 It will be hard to tell what's going on without debug symbols. In the 8984 there were 5 threads. One seemed to be created by WinDbg itself. 3 threads had very very short stack traces blocked in In short, I have no idea why windows is hanging. I think this patch is an improvement, so unless there are objections by end of day I'll merge this and look at the windows hang separately. |
Main thread can be blocked trying to acquire __loop_from_run_thread_lock while emit_event() in another thread is holding that lock and waiting for the main thread to emit the event. This change releases the lock before blocking. Signed-off-by: Shane Loretz <[email protected]>
Main thread can be blocked trying to acquire __loop_from_run_thread_lock while emit_event() in another thread is holding that lock and waiting for the main thread to emit the event. This change releases the lock before blocking. Signed-off-by: Shane Loretz <[email protected]> Signed-off-by: Ivan Santiago Paunovic <[email protected]>
* Handle case where output buffer is closed during shutdown (#365) * Handle case where output buffer is closed during shutdown - Prevent crash during launch shutdown when a process IO event happens after the buffers have been closed - Use unbuffered output in that case so IO still has a chance of being seen Signed-off-by: Pete Baughman <[email protected]> * Address MR feedback Signed-off-by: Pete Baughman <[email protected]> Signed-off-by: Ivan Santiago Paunovic <[email protected]> * Import test file without contaminating sys.modules (#360) Signed-off-by: Pete Baughman <[email protected]> Signed-off-by: Ivan Santiago Paunovic <[email protected]> * Release loop lock before waiting for it to do work (#369) Main thread can be blocked trying to acquire __loop_from_run_thread_lock while emit_event() in another thread is holding that lock and waiting for the main thread to emit the event. This change releases the lock before blocking. Signed-off-by: Shane Loretz <[email protected]> Signed-off-by: Ivan Santiago Paunovic <[email protected]> Co-authored-by: Peter Baughman <[email protected]> Co-authored-by: Shane Loretz <[email protected]>
The main thread can be blocked trying to acquire
__loop_from_run_thread_lock
, butemit_event()
holds that lock while waiting for a future that can only be completed by the main thread. This change releases the lock before blocking when emitting an event.I expect this to fix ros2/build_farmer#248 . This is the situation in the
python3 setup.py pytest
process that is hung on this CI job. There are 5 threads. The main thread is blocked as above, and four threads are all simultaneously trying to callemit_event()
. One of them is holding the lock and blocked wating for the future, while the other 3 are blocked trying to acquire__loop_from_run_thread_lock
. I have no idea why the hang started occurring so regularly since it looks like it's a race condition. The fact that the CI job seems to require the December 10th fastrtps and rmw_connext changes plus all other tests to be run beforehand seems to be coincidence.edit: gdb
py-bt
output from all threads https://gist.github.com/sloretz/6091ef621b9244579ec557ecb95b7a39