-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expressly destroy a node's objects before the node. #456
Expressly destroy a node's objects before the node. #456
Conversation
This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. Signed-off-by: Steven! Ragnarök <[email protected]>
253f18e
to
30600a1
Compare
This doesn't seem outrageous. It does seem like there may be a more elegant solution, but it probably requires some arcane knowledge of Python garbage collection's interactions with |
The new logic seems to remove items for the collections while iterating over them which shouldn't be done. The loops should be changed into |
Signed-off-by: Steven! Ragnarök <[email protected]>
97d3ccf
to
0337867
Compare
Signed-off-by: Steven! Ragnarök <[email protected]>
Signed-off-by: Steven! Ragnarök <[email protected]>
As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a temporary workaround, though I don't know what causes the hang or how this works around it.
I've also been dissatisfied with this solution. After discussion with @sloretz I've started another experiment running with Python's GC disabled before 1461 and enabled after self.handle.destroy(). That has also seemingly stopped the hang from occurring over the last ~50 test runs which means we might need to look more closely at the handle destruction logic. |
I ran experimental tests most of the day Saturday and I was able to reproduce a hang using this branch of rclpy with rmw_cyclonedds_cpp as well. Although it's in a slightly different place. In this gdb log thread 2 is stuck during capsule destruction in a futex down in cyclonedds. |
@hidmic and I dove into this today and we couldn't pinpoint a reason for the hang. It is still possible to hang with even with this PR although the incidence rate is lower. We did a couple of experiments trying to reproduce the hang outside of launch tests using multiple threads and rclpy but were unable to do so. I do have a bit more detail from the current reproduction since the destruction doesn't occur as part of a garbage collection. In python, we're inside Handle.__destroy_self
and in the C context we can see the pycapsule destructor calling rcl_publisher_fini which is carrying us into Fast-RTPS internals where we're stuck waiting for a lock:
This PR still improves matters, but it's not a full solution. |
In discussion yesterday we've decided to merge this as an incremental improvement. I'll leave ros2/build_farmer#248 open or create a more focused issue to continue investigation. |
* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>
* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]> Signed-off-by: AbhinavSingh <[email protected]>
This seems to reduce hangs during test runs described in
ros2/build_farmer#248.
Another few hours of testing will convince me to strengthen that statement.
The handles corresponding to the destroyed objects should be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).
I consider this patch to be a workaround rather than a fix as I'm not particularly proud of how little I understand why the existing destruction handling is insufficient.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to. A further possibility is to play around with pausing the garbage collector and triggering it at explicit points to see if that changes results at all.