Expressly destroy a node's objects before the node. #456

nuclearsandwich · 2019-11-07T22:55:58Z

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.
Another few hours of testing will convince me to strengthen that statement.

The handles corresponding to the destroyed objects should be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix as I'm not particularly proud of how little I understand why the existing destruction handling is insufficient.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to. A further possibility is to play around with pausing the garbage collector and triggering it at explicit points to see if that changes results at all.

This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. Signed-off-by: Steven! Ragnarök <[email protected]>

nuclearsandwich · 2019-11-07T22:57:33Z

Linux
Linux-aarch64
macOS
Windows

mjcarroll · 2019-11-07T22:58:57Z

This doesn't seem outrageous. It does seem like there may be a more elegant solution, but it probably requires some arcane knowledge of Python garbage collection's interactions with C extensions. Unless @sloretz has any immediate thoughts, I'm good with merging this + adding a tracking issue for a longer-term solution.

dirk-thomas · 2019-11-07T23:48:08Z

The new logic seems to remove items for the collections while iterating over them which shouldn't be done. The loops should be changed into while self._[name]: to avoid that.

nuclearsandwich · 2019-11-08T13:11:02Z

Linux
Linux-aarch64
macOS
Windows

Signed-off-by: Steven! Ragnarök <[email protected]>

Signed-off-by: Steven! Ragnarök <[email protected]>

nuclearsandwich · 2019-11-08T16:44:59Z

Linux
Linux-aarch64
macOS
Windows

rclpy/rclpy/node.py

As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>

sloretz

LGTM as a temporary workaround, though I don't know what causes the hang or how this works around it.

nuclearsandwich · 2019-11-08T21:35:22Z

Linux
Linux-aarch64
macOS
Windows

nuclearsandwich · 2019-11-09T03:56:40Z

I don't know what causes the hang or how this works around it.

I've also been dissatisfied with this solution. After discussion with @sloretz I've started another experiment running with Python's GC disabled before 1461 and enabled after self.handle.destroy(). That has also seemingly stopped the hang from occurring over the last ~50 test runs which means we might need to look more closely at the handle destruction logic.

nuclearsandwich · 2019-11-11T16:03:35Z

I ran experimental tests most of the day Saturday and I was able to reproduce a hang using this branch of rclpy with rmw_cyclonedds_cpp as well. Although it's in a slightly different place.

In this gdb log thread 2 is stuck during capsule destruction in a futex down in cyclonedds.

nuclearsandwich · 2019-11-12T16:11:13Z

@hidmic and I dove into this today and we couldn't pinpoint a reason for the hang. It is still possible to hang with even with this PR although the incidence rate is lower.

We did a couple of experiments trying to reproduce the hang outside of launch tests using multiple threads and rclpy but were unable to do so.

I do have a bit more detail from the current reproduction since the destruction doesn't occur as part of a garbage collection. In python, we're inside Handle.__destroy_self

 192        def __destroy_self(self):
 193            with self.__rlock:
 194                # Calls pycapsule destructor
>195                _rclpy_capsule.rclpy_pycapsule_destroy(self.__capsule)
 196                # Call post-destroy callbacks
 197                while self.__destroy_callbacks:
 198                    self.__destroy_callbacks.pop()(self)
 199                # get rid of references to other handles to break reference cycles
 200                del self.__required_handles

and in the C context we can see the pycapsule destructor calling rcl_publisher_fini which is carrying us into Fast-RTPS internals where we're stuck waiting for a lock:

#0  0x00007f40703f19f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1f77a04) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x1f779b0, cond=0x1f779d8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x1f779d8, mutex=0x1f779b0) at pthread_cond_wait.c:655
#3  0x00007f40632a5951 in eprosima::fastrtps::TimedConditionVariable::wait<std::timed_mutex>(std::unique_lock<std::timed_mutex>&, std::function<bool ()>) (this=0x1f779d8, lock=..., predicate=...)
    at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/include/fastrtps/rtps/resources/../../utils/TimedConditionVariable.hpp:79
#4  0x00007f406329fd0e in eprosima::fastrtps::rtps::ResourceEvent::unregister_timer (this=0x1f779a0, event=0x20deb20) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/ResourceEvent.cpp:101
#5  0x00007f40632a8962 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1797100, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:42
#6  0x00007f40632a89a0 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1797100, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:44
#7  0x00007f4063367911 in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x20cef00, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:105
#8  0x00007f40633679ec in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x20cef00, __in_chrg=<optimized out>) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:115
#9  0x00007f406335a689 in eprosima::fastrtps::ParticipantImpl::removePublisher (this=0x1f16450, pub=0x18f46c0) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/participant/ParticipantImpl.cpp:91
#10 0x00007f406334e85b in eprosima::fastrtps::Domain::removePublisher (pub=0x18f46c0) at /home/osrf/debug_ws/src/eProsima/Fast-RTPS/src/cpp/Domain.cpp:116
#11 0x00007f4063bf99e3 in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher (identifier=0x7f406427fc17 "rmw_fastrtps_cpp", node=0x191e3b0, publisher=0x162e0d0)
    at /home/osrf/debug_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publisher.cpp:66
#12 0x00007f4064274f8f in rmw_destroy_publisher (node=0x191e3b0, publisher=0x162e0d0) at /home/osrf/debug_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_publisher.cpp:295
#13 0x00007f406a533f34 in rmw_destroy_publisher (v2=0x191e3b0, v1=0x162e0d0) at /home/osrf/debug_ws/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:285
#14 0x00007f406a758cc2 in rcl_publisher_fini (publisher=0x7f4064e68720, node=0x7f4064e686f0) at /home/osrf/debug_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:231
#15 0x00007f406b231336 in _rclpy_destroy_publisher (pyentity=<PyCapsule at remote 0x7f40600a86f0>) at /home/osrf/debug_ws/src/ros2/rclpy/rclpy/src/rclpy/_rclpy.c:1259
#16 0x00007f4067599d78 in rclpy_pycapsule_destroy (_unused_self=<module at remote 0x7f406b4740e8>, args=(<PyCapsule at remote 0x7f40600a86f0>,))
    at /home/osrf/debug_ws/src/ros2/rclpy/rclpy/src/rclpy/_rclpy_pycapsule.c:100

This PR still improves matters, but it's not a full solution.

nuclearsandwich · 2019-11-13T14:05:19Z

In discussion yesterday we've decided to merge this as an incremental improvement. I'll leave ros2/build_farmer#248 open or create a more focused issue to continue investigation.

* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>

* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]> Signed-off-by: AbhinavSingh <[email protected]>

nuclearsandwich added the in progress Actively being worked on (Kanban column) label Nov 7, 2019

nuclearsandwich self-assigned this Nov 7, 2019

nuclearsandwich force-pushed the nuclearsandwich/destroy-dependents-first branch from 253f18e to 30600a1 Compare November 7, 2019 22:56

Don't iterate over a collection being modified.

0337867

Signed-off-by: Steven! Ragnarök <[email protected]>

nuclearsandwich force-pushed the nuclearsandwich/destroy-dependents-first branch from 97d3ccf to 0337867 Compare November 8, 2019 13:11

nuclearsandwich added 2 commits November 8, 2019 11:24

Fix whitespace and method name.

deee908

Signed-off-by: Steven! Ragnarök <[email protected]>

Fix method name.

025ffee

Signed-off-by: Steven! Ragnarök <[email protected]>

sloretz reviewed Nov 8, 2019

View reviewed changes

rclpy/rclpy/node.py Outdated Show resolved Hide resolved

sloretz approved these changes Nov 8, 2019

View reviewed changes

nuclearsandwich mentioned this pull request Nov 11, 2019

Add verbose info for topic list ros2/ros2cli#351

Merged

mjcarroll approved these changes Nov 13, 2019

View reviewed changes

nuclearsandwich merged commit 39b9a57 into master Nov 13, 2019

delete-merged-branch bot deleted the nuclearsandwich/destroy-dependents-first branch November 13, 2019 14:06

claireyywang mentioned this pull request Dec 4, 2019

[ros2param] Provide tests for ros2param verbs ros2/ros2cli#389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expressly destroy a node's objects before the node. #456

Expressly destroy a node's objects before the node. #456

nuclearsandwich commented Nov 7, 2019

nuclearsandwich commented Nov 7, 2019

mjcarroll commented Nov 7, 2019

dirk-thomas commented Nov 7, 2019

nuclearsandwich commented Nov 8, 2019

nuclearsandwich commented Nov 8, 2019

sloretz left a comment

nuclearsandwich commented Nov 8, 2019

nuclearsandwich commented Nov 9, 2019

nuclearsandwich commented Nov 11, 2019

nuclearsandwich commented Nov 12, 2019

nuclearsandwich commented Nov 13, 2019

Expressly destroy a node's objects before the node. #456

Expressly destroy a node's objects before the node. #456

Conversation

nuclearsandwich commented Nov 7, 2019

nuclearsandwich commented Nov 7, 2019

mjcarroll commented Nov 7, 2019

dirk-thomas commented Nov 7, 2019

nuclearsandwich commented Nov 8, 2019

nuclearsandwich commented Nov 8, 2019

sloretz left a comment

Choose a reason for hiding this comment

nuclearsandwich commented Nov 8, 2019

nuclearsandwich commented Nov 9, 2019

nuclearsandwich commented Nov 11, 2019

nuclearsandwich commented Nov 12, 2019

nuclearsandwich commented Nov 13, 2019