-
Notifications
You must be signed in to change notification settings - Fork 0
Tests hanging during ROS 2 CI tests #248
Comments
More hanging builds came in over the weekend.
On Linux I was able to strace the pytest process which isn't reaping its children ( I didn't see anything else that yielded solid breadcrumbs. It's possible that ros2/launch#279 is not playing well with an update to pytest (5.2.2 was released 24 October). Now that we've seen the problem on Linux I can try and reproduce it locally. |
I was able to reproduce the issue in a local container and gdb one stuck process. I did capture some gdb output in this gist: https://gist.github.com/nuclearsandwich/c845d073fa854ce020c8d42453d00faa I don't have a good instinct on how to guarantee expression and minimize the reproduction but it does look like the hang is in rclpy when used with Fast-RTPS and the test usage is just hitting it. |
Yesterday was light on failures. Today is heavy.
|
I've had a few more reproductions locally and all of them have backtraces near identical to the one I captured. I have been running the tests in a loop for half a day and been unable to reproduce them with cyclonedds, the only rmw implementation besides fastrtps I have available in the repro container. I don't know whether there is an issue with Fast-RTPS itself or how we're using it. It's starting to remind me a lot of ros2/launch#89 Could it be that we need to explicitly destroy publishers before clearing them and destroying the node like this? |
|
This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to.
This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. Signed-off-by: Steven! Ragnarök <[email protected]>
* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>
I had a look into the hang and managed to reproduce it by testing just patch to rclpy to increase hang frequencydiff --git a/rclpy/rclpy/node.py b/rclpy/rclpy/node.py
index 94afaf4..17c2d03 100644
--- a/rclpy/rclpy/node.py
+++ b/rclpy/rclpy/node.py
@@ -1463,18 +1463,26 @@ class Node:
# Destroy dependent items eagerly to work around a possible hang
# https://github.com/ros2/build_cop/issues/248
- while self.__publishers:
- self.destroy_publisher(self.__publishers[0])
- while self.__subscriptions:
- self.destroy_subscription(self.__subscriptions[0])
- while self.__clients:
- self.destroy_client(self.__clients[0])
- while self.__services:
- self.destroy_service(self.__services[0])
- while self.__timers:
- self.destroy_timer(self.__timers[0])
- while self.__guards:
- self.destroy_guard_condition(self.__guards[0])
+ # while self.__publishers:
+ # self.destroy_publisher(self.__publishers[0])
+ # while self.__subscriptions:
+ # self.destroy_subscription(self.__subscriptions[0])
+ # while self.__clients:
+ # self.destroy_client(self.__clients[0])
+ # while self.__services:
+ # self.destroy_service(self.__services[0])
+ # while self.__timers:
+ # self.destroy_timer(self.__timers[0])
+ # while self.__guards:
+ # self.destroy_guard_condition(self.__guards[0])
+
+ self.__publishers.clear()
+ self.__subscriptions.clear()
+ self.__clients.clear()
+ self.__services.clear()
+ self.__timers.clear()
+ self.__guards.clear()
+
self.handle.destroy()
self._wake_executor()
This is the process tree at the time of deadlock. The deadlock occured in 15469, the
There were seven threads: two from python (main thread + extra thread spawned by Beginning of main thread (thread 1) tracebacksadly this is the only traceback I have
Here are the notes I took. Thread 1 is blocked because In short, thread 3 has the PDP mutex but wants the StatefulReader mutex, while thread 5 has the StatefulReader mutex but wants the PDP mutex. Thread 1 is blocked because it needs thread 3 to set a flag to unblock it. I patched Patch to Fast-RTPS v1.9.2diff --git a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
index f766c5ea0..e090a533f 100644
--- a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
+++ b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
@@ -996,7 +996,7 @@ CDRMessage_t PDP::get_participant_proxy_data_serialized(Endianness_t endian)
void PDP::check_remote_participant_liveliness(
ParticipantProxyData* remote_participant)
{
- std::lock_guard<std::recursive_mutex> guard(*this->mp_mutex);
+ std::unique_lock<std::recursive_mutex> guard(*this->mp_mutex); if(GUID_t::unknown() != remote_participant->m_guid)
{
@@ -1007,7 +1007,9 @@ void PDP::check_remote_participant_liveliness(
std::chrono::microseconds(TimeConv::Duration_t2MicroSecondsInt64(remote_participant->m_leaseDuration));
if (now > real_lease_tm)
{
+ guard.unlock();
remove_remote_participant(remote_participant->m_guid, ParticipantDiscoveryInfo::DROPPED_PARTICIPANT);
+ guard.lock();
return;
} |
* Expressly destroy a node's objects before the node. This seems to reduce hangs during test runs described in ros2/build_farmer#248. The handles corresponding to the destroyed objects *should* be getting destroyed explicitly when self.handle.destroy() is called below. It seems however that when running with Fast-RTPS it's possible to get into a state where multiple threads are waiting on futexes and none can move forward. The rclpy context of this apparent deadlock is while clearing a node's list of publishers or services (possibly others, although publishers and services were the only ones observed). I consider this patch to be a workaround rather than a fix. I think there may either be a race condition between the rcl/rmw layer and the rmw implementation layer which is being tripped by the haphazardness of Python's garbage collector or there is a logical problem with the handle destruction ordering in rclpy that only Fast-RTPS is sensitive to. * Don't pre-emptively remove items from Node lists. As pointed out by Shane, pop()ing each item from the list before passing it to the .destroy_ITEM() method prevents it from being destroyed as the individual methods first check that the item is present in the list then remove it before continuing to destroy it. Signed-off-by: Steven! Ragnarök <[email protected]>
Linux test failures
|
We were good for a while but these hangs seem back with a vengeance today on Linux and Linux ARM64. Curiously some of the builds exhibit the issue when Fast-RTPS is disabled (testing opensplice changes) but it did occur with fast-rtps on a debug nightly as well. |
I am out of steam for the day. There's a private, internal thread with more info that when I come back online tomorrow I'll collate here. |
The Linux nightlies have been very consistently hitting this issue. I've tried several times to reproduce it on a smaller scale but haven't been able to do so either on the buildfarm or locally. Looking into the hung processes on Linux, each of them that I've sampled looks similar to this one:
which has five threads blocked on the futex Logs from GDB that include the python backtraces and C backtraces for each thread are here. |
This issue seems to be very hard to reproduce, so I'm looking at code changes instead. Assuming the hang is caused by a ros2 code change and not a CI infrastructure change, these are the changes that were made between the last nightly_linux_debug that didn't hang, and the first one that did.
|
Trying to reproduce using
|
A couple more tests
|
|
|
|
|
|
There are a couple other issues with ros2cli tests. I would highly doubt they impact this issue, but they might. |
I've opened ros2/ros2cli#489 to skip these tests on Windows until we can identify and mitigate the problem. |
I noticed a similar hang/block coming from the test_launch_ros package today in the build here: nightly_windows_rel#1521 (internal discussion link). This is the first time I've seen the behavior without launch_testing but with launch. |
Here's another hang from test_launch_ros https://ci.ros2.org/job/ci_windows/10115 |
Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <[email protected]>
Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <[email protected]>
I very much think https://ci.ros2.org/job/nightly_osx_debug/1501/ is multiprocessing-related and I suspect it's this little caveat:
Another part of this mystery is why does |
Just noting |
Fixes #480 The actual tests are the same, except with the use of launch_testing we ensure the CLI daemon is restarted between tests. This follows a similar pattern as the other ros2cli tests. In addition to converting to launch tests, this change also runs the tests for all RMW implementations. For now, we are skipping tests on Windows. Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved. Signed-off-by: Jacob Perron <[email protected]>
https://ci.ros2.org/job/nightly_osx_release/1626/ has been stuck in test_launch_ros for more than 24 hours. I'm going to abort it. |
Another test_launch_ros hang https://ci.ros2.org/job/nightly_osx_extra_rmw_release/752/console |
🧑🌾 Another
|
Another one https://ci.ros2.org/job/nightly_osx_debug/1616. |
Any way to get a copy of |
I wonder if it has something to do with the many resource leaks in |
May also be leaking of open file descriptors in launch_yaml.Parser ros2/launch#415 |
This may also be a contributing cause: ros2/launch#416 |
We aren't using this repository anymore for buildfarm issues, so I'm going to archive it. Thus I'm closing out this issue. If you continue to have problems, please report another bug against https://github.com/ros2/ros2. Thank you. |
Over the past week I've seen several instances of a CI run hanging during tests of ros2cli packages.
I didn't think to screengrab the processmonitor output during the last Windows one I cleaned up but I did grab the procline of a process which seemed to be endlessly spawning and killing python processes:
c:\python37\python.exe setup.py pytest egg_info --egg-base C:\J\workspace\ci_windows\ws\build\ros2topic
. On macOS where I'm a bit more comfortable with the tooling there doesn't appear to be a cycling of spawing and dying processes but a static tree with one zombie process (37262 in the example pstree below).I couldn't get a procline off of the zombie process. Attempting to SIGCHLD its parent yielded no result, I eventually escalated to SIGTERM before sending SIGTERM to the colcon process (84891) which was enough to de-hang the job (although it failed since colcon test was terminated) but the 35227 tree stayed behind now with all child processes zombiefied. I eventually had to issue a SIGKILL to 35227 which removed it and allowed its children to get reaped by init. I haven't identified a patient 0 and tests are passing most of the time. I also haven't seen this happen on Linux platforms thus far.
The text was updated successfully, but these errors were encountered: