Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Tests hanging during ROS 2 CI tests #248

Closed
nuclearsandwich opened this issue Nov 1, 2019 · 44 comments · Fixed by ros2/launch#369
Closed

Tests hanging during ROS 2 CI tests #248

nuclearsandwich opened this issue Nov 1, 2019 · 44 comments · Fixed by ros2/launch#369
Assignees
Labels
bug Something isn't working

Comments

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Nov 1, 2019

Over the past week I've seen several instances of a CI run hanging during tests of ros2cli packages.

I didn't think to screengrab the processmonitor output during the last Windows one I cleaned up but I did grab the procline of a process which seemed to be endlessly spawning and killing python processes: c:\python37\python.exe setup.py pytest egg_info --egg-base C:\J\workspace\ci_windows\ws\build\ros2topic. On macOS where I'm a bit more comfortable with the tooling there doesn't appear to be a cycling of spawing and dying processes but a static tree with one zombie process (37262 in the example pstree below).

I couldn't get a procline off of the zombie process. Attempting to SIGCHLD its parent yielded no result, I eventually escalated to SIGTERM before sending SIGTERM to the colcon process (84891) which was enough to de-hang the job (although it failed since colcon test was terminated) but the 35227 tree stayed behind now with all child processes zombiefied. I eventually had to issue a SIGKILL to 35227 which removed it and allowed its children to get reaped by init. I haven't identified a patient 0 and tests are passing most of the time. I also haven't seen this happen on Linux platforms thus far.

pstree 717
-+= 00717 jenkins /usr/bin/java -Djava.awt.headless=true -jar /Users/jenkins/jenkins-agent/steven-agent.jar -jnlpUrl https://ci.ros2.org/computer/mac_mojave_lore/slave-agent.jnlp -secret 6e1eb6416b70a634000f26198aacf432d597e2e95234eb2441de0b60e5827818
 \-+- 08348 jenkins /bin/sh -xe /var/folders/tm/5tw_fjn13737cx92skl8xjc00000gp/T/jenkins6149197211335807067.sh
   \-+- 08350 jenkins /usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python -u run_ros2_batch.py --do-venv --force-ansi-color --workspace-path /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release --ignore-rmw rmw_connext_cpp --repo-file-url https://raw.githubusercontent.com/ros2/ros2/mas
     \-+- 84888 jenkins /bin/sh -c . ../venv/bin/activate && /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/colcon test --base-paths "src" --build-base "build" --install-base "install" --event-handlers console_direct+ --executor sequential --retest-until-pass 10
       \-+- 84891 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/colcon test --base-paths src --build-base build --install-base install --event-handlers console_direct+ --executor sequential --retest-until-pass 10
         \-+- 35227 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python setup.py pytest egg_info --egg-base /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/build/ros2topic
           |--- 37166 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args
           |--- 37167 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/listener_node.py --ros-args -r chatter:=chit_chatter
           |--- 37168 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args -r chatter:=_hidden_chatter
           |--- 37169 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Arrays --ros-args -r __node:=array_repeater -r /array_repeater/output:=/arrays
           |--- 37170 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Defaults --ros-args -r __node:=defaults_repeater -r /defaults_repeater/output:=/defaults
           |--- 37171 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/BoundedSequences --ros-args -r __node:=bounded_sequences_repeater -r /bounded_sequences_repeater/output:=/bou
           |--- 37172 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/UnboundedSequences --ros-args -r __node:=unbounded_sequences_repeater -r /unbounded_sequences_repeater/output
           |--- 37173 jenkins /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/venv/bin/python /Users/osrf/jenkins-agent/workspace/nightly_osx_extra_rmw_release/ws/src/ros2/ros2cli/ros2topic/test/fixtures/controller_node.py --ros-args
           \--- 37262 jenkins (python3.7)
@nuclearsandwich nuclearsandwich added the bug Something isn't working label Nov 1, 2019
@nuclearsandwich
Copy link
Member Author

nuclearsandwich commented Nov 4, 2019

More hanging builds came in over the weekend.

On Linux I was able to strace the pytest process which isn't reaping its children (/home/jenkins-agent/workspace/nightly_linux_extra_rmw_release/venv/bin/python setup.py pytest egg_info --egg-base /home/jenkins-agent/workspace/nightly_linux_extra_rmw_release/ws/build/ros2topic) and it was blocked on a futex wait.

I didn't see anything else that yielded solid breadcrumbs. It's possible that ros2/launch#279 is not playing well with an update to pytest (5.2.2 was released 24 October). Now that we've seen the problem on Linux I can try and reproduce it locally.

@nuclearsandwich
Copy link
Member Author

I was able to reproduce the issue in a local container and gdb one stuck process.

I did capture some gdb output in this gist: https://gist.github.com/nuclearsandwich/c845d073fa854ce020c8d42453d00faa

I don't have a good instinct on how to guarantee expression and minimize the reproduction but it does look like the hang is in rclpy when used with Fast-RTPS and the test usage is just hitting it.

@nuclearsandwich
Copy link
Member Author

Yesterday was light on failures. Today is heavy.

@nuclearsandwich
Copy link
Member Author

I've had a few more reproductions locally and all of them have backtraces near identical to the one I captured. I have been running the tests in a loop for half a day and been unable to reproduce them with cyclonedds, the only rmw implementation besides fastrtps I have available in the repro container.

I don't know whether there is an issue with Fast-RTPS itself or how we're using it. It's starting to remind me a lot of ros2/launch#89

Could it be that we need to explicitly destroy publishers before clearing them and destroying the node like this?

@nuclearsandwich nuclearsandwich self-assigned this Nov 6, 2019
@nuclearsandwich
Copy link
Member Author

nuclearsandwich added a commit to ros2/rclpy that referenced this issue Nov 7, 2019
This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.
nuclearsandwich added a commit to ros2/rclpy that referenced this issue Nov 7, 2019
This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

Signed-off-by: Steven! Ragnarök <[email protected]>
nuclearsandwich added a commit to ros2/rclpy that referenced this issue Nov 13, 2019
* Expressly destroy a node's objects before the node.

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

* Don't pre-emptively remove items from Node lists.

As pointed out by Shane, pop()ing each item from the list before
passing it to the .destroy_ITEM() method prevents it from being
destroyed as the individual methods first check that the item is present
in the list then remove it before continuing to destroy it.

Signed-off-by: Steven! Ragnarök <[email protected]>
@sloretz
Copy link

sloretz commented Nov 19, 2019

I had a look into the hang and managed to reproduce it by testing just test_cli.py in ros2topic on Bionic with the latest ROS 2 sources and a patch to rclpy. It looks like some threads have deadlocked in Fast-RTPS and I made some notes, but unfortunately I accidentally killed the process I was debugging so I don't have all of the tracebacks.

patch to rclpy to increase hang frequency
diff --git a/rclpy/rclpy/node.py b/rclpy/rclpy/node.py
index 94afaf4..17c2d03 100644
--- a/rclpy/rclpy/node.py
+++ b/rclpy/rclpy/node.py
@@ -1463,18 +1463,26 @@ class Node:
 
         # Destroy dependent items eagerly to work around a possible hang
         # https://github.com/ros2/build_cop/issues/248
-        while self.__publishers:
-            self.destroy_publisher(self.__publishers[0])
-        while self.__subscriptions:
-            self.destroy_subscription(self.__subscriptions[0])
-        while self.__clients:
-            self.destroy_client(self.__clients[0])
-        while self.__services:
-            self.destroy_service(self.__services[0])
-        while self.__timers:
-            self.destroy_timer(self.__timers[0])
-        while self.__guards:
-            self.destroy_guard_condition(self.__guards[0])
+        # while self.__publishers:
+        #     self.destroy_publisher(self.__publishers[0])
+        # while self.__subscriptions:
+        #     self.destroy_subscription(self.__subscriptions[0])
+        # while self.__clients:
+        #     self.destroy_client(self.__clients[0])
+        # while self.__services:
+        #     self.destroy_service(self.__services[0])
+        # while self.__timers:
+        #     self.destroy_timer(self.__timers[0])
+        # while self.__guards:
+        #     self.destroy_guard_condition(self.__guards[0])
+
+        self.__publishers.clear()
+        self.__subscriptions.clear()
+        self.__clients.clear()
+        self.__services.clear()
+        self.__timers.clear()
+        self.__guards.clear()
+
         self.handle.destroy()
         self._wake_executor()

This is the process tree at the time of deadlock. The deadlock occured in 15469, the python3 setup.py pytest process.

sloretz  15217     1  0 14:29 pts/0    00:00:05 /usr/bin/python3 /usr/local/bin/colcon test --retest-until-fail 20 --event-handlers console_direct+ --packages-select ros2topic
  sloretz  15469 15217  0 14:29 pts/0    00:00:05 /usr/bin/python3 setup.py pytest egg_info --egg-base /home/sloretz/ws/ros2/build/ros2topic
    sloretz  15707 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args
    sloretz  15708 15469  0 14:30 pts/0    00:00:02 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/listener_node.py --ros-args -r chatter:=chit_chatter
    sloretz  15709 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/talker_node.py --ros-args -r chatter:=_hidden_chatter
    sloretz  15710 15469  0 14:30 pts/0    00:00:04 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Arrays --ros-args -r __node:=array_repeater -r /array_repeater/output:=/arrays
    sloretz  15711 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/Defaults --ros-args -r __node:=defaults_repeater -r /defaults_repeater/output:=/defaults
    sloretz  15712 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/BoundedSequences --ros-args -r __node:=bounded_sequences_repeater -r /bounded_sequences_repeater/output:=/bounded_seq
    sloretz  15713 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/repeater_node.py test_msgs/msg/UnboundedSequences --ros-args -r __node:=unbounded_sequences_repeater -r /unbounded_sequences_repeater/output:=/unbou
    sloretz  15714 15469  0 14:30 pts/0    00:00:03 /usr/bin/python3 /home/sloretz/ws/ros2/src/ros2/ros2cli/ros2topic/test/fixtures/controller_node.py --ros-args

sloretz  15681     1  0 14:30 pts/0    00:00:02 /usr/bin/python3 /home/sloretz/ws/ros2/install/ros2cli/bin/_ros2_daemon --rmw-implementation rmw_opensplice_cpp --ros-domain-id 101

There were seven threads: two from python (main thread + extra thread spawned by launch_testing) and 5 created by Fast-RTPS. The main thread was blocked as a consequence of a deadlock in two threads created by Fast-RTPS.

Beginning of main thread (thread 1) traceback

sadly this is the only traceback I have

  #0  0x00007fd4937d79f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x1c7d178) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x1c7d128, cond=0x1c7d150) at pthread_cond_wait.c:502
  #2  __pthread_cond_wait (cond=0x1c7d150, mutex=0x1c7d128) at pthread_cond_wait.c:655
  #3  0x00007fd486c5d573 in eprosima::fastrtps::TimedConditionVariable::wait<std::timed_mutex>(std::unique_lock<std::timed_mutex>&, std::function<bool ()>) (this=0x1c7d150, lock=..., predicate=...)
      at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/include/fastrtps/rtps/resources/../../utils/TimedConditionVariable.hpp:79
  #4  0x00007fd486c57913 in eprosima::fastrtps::rtps::ResourceEvent::unregister_timer (this=0x1c7d118, event=0x138ff20) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/ResourceEvent.cpp:101
  #5  0x00007fd486c60584 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1ec8e80, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:42
  #6  0x00007fd486c605c2 in eprosima::fastrtps::rtps::TimedEvent::~TimedEvent (this=0x1ec8e80, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/rtps/resources/TimedEvent.cpp:44
  #7  0x00007fd486d1f951 in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x1cf1910, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:105
  #8  0x00007fd486d1fa2c in eprosima::fastrtps::PublisherImpl::~PublisherImpl (this=0x1cf1910, __in_chrg=<optimized out>) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/publisher/PublisherImpl.cpp:115
  #9  0x00007fd486d1296b in eprosima::fastrtps::ParticipantImpl::removePublisher (this=0x1c7ca20, pub=0x1390000) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/participant/ParticipantImpl.cpp:91
  #10 0x00007fd486d06cfd in eprosima::fastrtps::Domain::removePublisher (pub=0x1390000) at /home/sloretz/ws/ros2/src/eProsima/Fast-RTPS/src/cpp/Domain.cpp:116
  #11 0x00007fd4874b156f in rmw_fastrtps_shared_cpp::__rmw_destroy_publisher (identifier=0x7fd487b35357 "rmw_fastrtps_cpp", node=0x17ab380, publisher=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publisher.cpp:66
  #12 0x00007fd487b2aad9 in rmw_destroy_publisher (node=0x17ab380, publisher=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_publisher.cpp:295
  #13 0x00007fd48d903f44 in rmw_destroy_publisher (v2=0x17ab380, v1=0x1860fe0) at /home/sloretz/ws/ros2/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:285
  #14 0x00007fd48db28ce2 in rcl_publisher_fini (publisher=0x7fd488356300, node=0x7fd4883562f0) at /home/sloretz/ws/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:231

Here are the notes I took. Thread 1 is blocked because allow_to_delete_ never becomes true. This should happen during ResourceEvent::run_io_service(), but that service is being handled by thread 3 which has deadlocked with thread 5. Thread 3 is stuck inside io_service_.poll() because it is in some callback that is waiting for StatefulReader(0x1cae400)->mp_mutex while trying to call matched_writer_remove(). Additionally thread 3 acquired a lock on PDP(0x1e6c6d0)->mp_mutex in check_remote_participant_liveliness(). The lock thread 3 is waiting on is held by thread 5 which acquired the lock in StatefulReader(0x1cae400)::processDataMessage. Thread 5 won't release it because it's trying to acquire PDP(0x1e6c6d0)->mp_mutex in PDP::removeWriterProxyData().

In short, thread 3 has the PDP mutex but wants the StatefulReader mutex, while thread 5 has the StatefulReader mutex but wants the PDP mutex. Thread 1 is blocked because it needs thread 3 to set a flag to unblock it.

I patched Fast-RTPS to avoid this deadlock and haven't seen the hang running the same test over and over all night. I don't know how the test got into this situation (maybe there's still a bug in the order rclpy destructs stuff), and I don't know much about the Fast-RTPS codebase, but the patch seems reasonable-ish to me.

Patch to Fast-RTPS v1.9.2
diff --git a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
index f766c5ea0..e090a533f 100644
--- a/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
+++ b/src/cpp/rtps/builtin/discovery/participant/PDP.cpp
@@ -996,7 +996,7 @@ CDRMessage_t PDP::get_participant_proxy_data_serialized(Endianness_t endian)
 void PDP::check_remote_participant_liveliness(
         ParticipantProxyData* remote_participant)
 {
-    std::lock_guard<std::recursive_mutex> guard(*this->mp_mutex);
+    std::unique_lock<std::recursive_mutex> guard(*this->mp_mutex);     if(GUID_t::unknown() != remote_participant->m_guid)
     {
@@ -1007,7 +1007,9 @@ void PDP::check_remote_participant_liveliness(
                 std::chrono::microseconds(TimeConv::Duration_t2MicroSecondsInt64(remote_participant->m_leaseDuration));
         if (now > real_lease_tm)
         {
+            guard.unlock();
             remove_remote_participant(remote_participant->m_guid, ParticipantDiscoveryInfo::DROPPED_PARTICIPANT);
+            guard.lock();
             return;
         }

jaisontj pushed a commit to aws-ros-dev/rclpy that referenced this issue Nov 19, 2019
* Expressly destroy a node's objects before the node.

This seems to reduce hangs during test runs described in
ros2/build_farmer#248.

The handles corresponding to the destroyed objects *should* be getting
destroyed explicitly when self.handle.destroy() is called below. It
seems however that when running with Fast-RTPS it's possible to get into
a state where multiple threads are waiting on futexes and none can move
forward. The rclpy context of this apparent deadlock is while clearing
a node's list of publishers or services (possibly others, although
publishers and services were the only ones observed).

I consider this patch to be a workaround rather than a fix.
I think there may either be a race condition between the rcl/rmw layer
and the rmw implementation layer which is being tripped by the
haphazardness of Python's garbage collector or there is a logical
problem with the handle destruction ordering in rclpy that only
Fast-RTPS is sensitive to.

* Don't pre-emptively remove items from Node lists.

As pointed out by Shane, pop()ing each item from the list before
passing it to the .destroy_ITEM() method prevents it from being
destroyed as the individual methods first check that the item is present
in the list then remove it before continuing to destroy it.

Signed-off-by: Steven! Ragnarök <[email protected]>
@nuclearsandwich
Copy link
Member Author

I cherry plucked Shane's branch onto Fast-RTPS 1.9.x and kicked off the ROS 2 test suite since that change may be part of the Eloquent release and we should keep a sharp eye on regressions where lock code is changing.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@sloretz
Copy link

sloretz commented Nov 21, 2019

Linux test failures test_communication.TestPublisherSubscriberAfterShutdown.test_processes_finished_gracefully[WStrings] are all connext + cyclonedds related, but it isn't obvious from their name

@nuclearsandwich
Copy link
Member Author

We were good for a while but these hangs seem back with a vengeance today on Linux and Linux ARM64. Curiously some of the builds exhibit the issue when Fast-RTPS is disabled (testing opensplice changes) but it did occur with fast-rtps on a debug nightly as well.

@nuclearsandwich
Copy link
Member Author

I am out of steam for the day. There's a private, internal thread with more info that when I come back online tomorrow I'll collate here.

@nuclearsandwich
Copy link
Member Author

The Linux nightlies have been very consistently hitting this issue. I've tried several times to reproduce it on a smaller scale but haven't been able to do so either on the buildfarm or locally.

Looking into the hung processes on Linux, each of them that I've sampled looks similar to this one:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f29bf803f40 (LWP 7361) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  2    Thread 0x7f29b1003700 (LWP 7365) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  3    Thread 0x7f29b0802700 (LWP 7438) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  4    Thread 0x7f29aaffd700 (LWP 7518) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  5    Thread 0x7f29aa7fc700 (LWP 7604) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x362c7e0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  6    Thread 0x7f29a889f700 (LWP 7691) "python" 0x00007f29bf0026d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f2990003bb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205

which has five threads blocked on the futex 0x362c7e0 and one blocked on the futex 0x7f2990003bb0. I don't have any info on the relationship between the futexes. There are no active child processes at the time of debugging. Everything I've read about digging into the contended futexes has suggested that systemtap is the next step but the debugging info for kernels in AWS isn't provided so the next steps are either to reproduce the issue on a debuggable kernel or build the debug info for the kernel we're running in production. I've been continuing to iterate on the former process and will probably try the latter before the end of the day.

Logs from GDB that include the python backtraces and C backtraces for each thread are here.

@sloretz
Copy link

sloretz commented Jan 13, 2020

Trying to reproduce using ci_linux

  • https://ci.ros2.org/job/ci_linux/9012/
    • Control job, I'd expect this to hang. It's using the exact repos file from nightly_linux_debug 1403. If it doesn't hang then the job below means nothing
    • edit: hung 🎉
  • https://ci.ros2.org/job/ci_linux/9013/
    • Job with fastrtps version pinned to what it was before the nightlies started hanging. If this does not hang and 9012 does hang, then the fastrtps PRs should be investigated as a possible cause of the hang.
    • edit: Did not hang 👀

@sloretz
Copy link

sloretz commented Jan 14, 2020

A couple more tests

  • https://ci.ros2.org/job/ci_linux/9015/
    • This is using all repos the same version as nightly_linux_debug 1402 except fastrtps which is using the version used by nightly_linux_debug 1403. If the fastrtps changes are responsible for the ros2topic hang, then I would expect this to hang too.
    • edit: did not hang
  • https://ci.ros2.org/job/ci_linux/9016/
    • A control job with the repos file from 1402. If this hangs, then it's probably a CI infrastructure change rather than a code change. If this does not hang, and the job above does not hang, then the cause of the hang probably requires changes in multiple repos. If this does not hang, and the job above does hang, then the fastrtps changes are likely the only cause of the hang.
    • edit: did not hang

@sloretz
Copy link

sloretz commented Jan 14, 2020

@sloretz
Copy link

sloretz commented Jan 14, 2020

@sloretz
Copy link

sloretz commented Jan 15, 2020

@sloretz
Copy link

sloretz commented Jan 15, 2020

@sloretz
Copy link

sloretz commented Jan 15, 2020

@brawner
Copy link

brawner commented Feb 28, 2020

There are a couple other issues with ros2cli tests. I would highly doubt they impact this issue, but they might.

ros2/ros2cli#456
ros2/ros2cli#457

@nuclearsandwich
Copy link
Member Author

I've opened ros2/ros2cli#489 to skip these tests on Windows until we can identify and mitigate the problem.

@nuclearsandwich
Copy link
Member Author

I noticed a similar hang/block coming from the test_launch_ros package today in the build here: nightly_windows_rel#1521 (internal discussion link). This is the first time I've seen the behavior without launch_testing but with launch.

@nuclearsandwich
Copy link
Member Author

Here's another hang from test_launch_ros https://ci.ros2.org/job/ci_windows/10115

jacobperron added a commit to ros2/ros2cli that referenced this issue Apr 16, 2020
Other CLI tests are skipped on Windows since #489.

To be reverted when ros2/build_farmer#248 is resolved.

Signed-off-by: Jacob Perron <[email protected]>
jacobperron added a commit to ros2/ros2cli that referenced this issue Apr 16, 2020
Other CLI tests are skipped on Windows since #489.

To be reverted when ros2/build_farmer#248 is resolved.

Signed-off-by: Jacob Perron <[email protected]>
@rotu
Copy link

rotu commented Apr 17, 2020

I very much think https://ci.ros2.org/job/nightly_osx_debug/1501/ is multiprocessing-related and I suspect it's this little caveat:

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Another part of this mystery is why does _wait_for_ready_callbacks get called? I think a crash when tokenizing triggered the atexit handler on shutdown, which was registered by a previous task on the worker thread (@dirk-thomas just removed it in ros2/launch#401 👍). Seeing as there is a fatal error "Fatal Python error: initsite: Failed to import the site module" reported many times in this run, I think that was the crash during tokenization.

@sloretz
Copy link

sloretz commented Apr 17, 2020

Seeing as there is a fatal error "Fatal Python error: initsite: Failed to import the site module" reported many times in this run, I think that was the crash during tokenization.

Just noting site is imported during initialization (unless the python process is invoked with -S), so a fatal error there might suggest something has gone wrong before any python code in our control has had a chance to execute.

jacobperron added a commit to ros2/ros2cli that referenced this issue Apr 18, 2020
Fixes #480

The actual tests are the same, except with the use of launch_testing we ensure the CLI daemon
is restarted between tests. This follows a similar pattern as the other ros2cli tests.

In addition to converting to launch tests, this change also runs the tests for all RMW implementations.

For now, we are skipping tests on Windows. Other CLI tests are skipped on Windows since #489. To be reverted when ros2/build_farmer#248 is resolved.

Signed-off-by: Jacob Perron <[email protected]>
@nuclearsandwich
Copy link
Member Author

https://ci.ros2.org/job/nightly_osx_release/1626/ has been stuck in test_launch_ros for more than 24 hours. I'm going to abort it.

@sloretz
Copy link

sloretz commented May 1, 2020

Another test_launch_ros hang https://ci.ros2.org/job/nightly_osx_extra_rmw_release/752/console

@chapulina
Copy link

🧑‍🌾 Another test_launch_ros hang: https://ci.ros2.org/job/ci_osx/8638/console

17:21:10 test/test_copyright.py .                                                 [  1%]
17:21:10 test/test_flake8.py .                                                    [  3%]
17:21:10 test/test_pep257.py .                                                    [  4%]
17:21:10 test/test_launch_ros/test_normalize_parameters.py .....................  [ 39%]
17:21:10 test/test_launch_ros/test_normalize_remap_rule.py .....                  [ 47%]
// hang

@ivanpauno
Copy link
Member

@rotu
Copy link

rotu commented May 6, 2020

Any way to get a copy of log/latest_test/events.log from that run?

@rotu
Copy link

rotu commented May 6, 2020

I wonder if it has something to do with the many resource leaks in osrf_pycommon osrf/osrf_pycommon#62

@rotu
Copy link

rotu commented May 6, 2020

May also be leaking of open file descriptors in launch_yaml.Parser ros2/launch#415

@rotu
Copy link

rotu commented May 7, 2020

This may also be a contributing cause: ros2/launch#416

@clalancette
Copy link

We aren't using this repository anymore for buildfarm issues, so I'm going to archive it. Thus I'm closing out this issue. If you continue to have problems, please report another bug against https://github.com/ros2/ros2. Thank you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants