[lifecycle_manager] expose service_timeout #4838

doisyg · 2025-01-09T13:19:32Z

Basic Info

Info	Please fill out this column
Ticket(s) this addresses
Primary OS tested on	Ubuntu
Robotic platform tested on	Dexory robot
Does this PR contain AI generated software?	No

Description of contribution in a few bullet points

This adds a parameter service_timeout, default to 5s, to control explicitly the timeout applied to the change_state and get_state calls in the lifecycle_manager node. Which were functionally respectively 5s and 2s (LifecycleServiceClient default). This is useful to stabilize bringup in CPU limited systems (or in systems whith looooooots of nodes started at the same time) where a lifecycle_manager could failed to transition its managed nodes due to change_state or get_state timing out.
Slight refactor of LifecycleServiceClient to not crop the timeout to plain seconds in change_state and get_state. + replacing the 1 change_state overload by a default parameter.

Description of documentation updates required from your changes

Probably: https://docs.nav2.org/configuration/packages/configuring-lifecycle.html

For Maintainers:

Check that any new parameters added are updated in docs.nav2.org
Check that any significant change is added to the migration guide
Check that any new features OR changes to existing behaviors are reflected in the tuning guide
Check that any new functions have Doxygen added
Check that any new features have test coverage
Check that any new plugins is added to the plugins page
If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

mergify · 2025-01-09T13:20:49Z

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

Signed-off-by: Guillaume Doisy <[email protected]>

mergify · 2025-01-09T13:37:08Z

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

mergify · 2025-01-09T19:23:16Z

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

SteveMacenski

I agree with these changes, however I recall trying to add this before to handle one of the few places that there aren't networking timeouts set and it causing issues/instability in the stack. I think related to service calls in ROS 2 with timeouts being alot more flaky to actually get sent/processed (even when that timeout is like 2000s) than if you sent it without the timeout -- for some weird RMW/DDS reason I'm sure. If memory serves, it made it flaky for the system to actually get brought up some non-trivial percentage of the time, so I was forced to revert it after a ton of user reports.

Maybe that's been fixed now? Do you see any issues if you launch it a couple dozen times?

I want to see what CI will say but the build was failing. I just re-triggered but if it has those permission issues again, please ping Ruffin to take a look

doisyg · 2025-01-10T11:23:24Z

Summoning @ruffsl ,)

Yes, services in ros2 have a quite a history of failing due to middleware issues, I still have PTSD. I am not ruling it out in our system. At the moment, when bringup all our nodes all together at the same time, we can have the get_state of one of our many lifecycle managers randomly taking more than 5s to get a reply. I am not sure yet if it is just taking more time (optimistic scenario) or failing completely (pessimistic scenario, middleware issue).

In terms of this PR impact, the change of behavior system wide should be limited: when the new service_timeout parameter is not explicitly set, it falls back to a default of 5s. Which was the value applied to get_change before this PR, but change_state had a default of 2s. So the only functional impact (when no using the new parameter) is that the timeout of the change_state call of the lifecycle manager is increased from 2 to 5s.

doisyg · 2025-01-10T17:56:03Z

Test re-run (thanks @ruffsl), now test_docking_server failing, related to this PR you think @SteveMacenski ?

SteveMacenski · 2025-01-10T20:02:49Z

The current job run has every system test failing :-) https://app.circleci.com/pipelines/github/ros-navigation/navigation2/13347/workflows/b2a18891-9bba-43c9-9316-dd05b976ee2e/jobs/40015/tests

we can have the get_state of one of our many lifecycle managers randomly taking more than 5s to get a reply.

Can we then update only that, exposing the hardcoded timeout as a parameter, without changing the other methods to using a timeout? This PR removes the timeout-less invoke version of these methods that we currently rely on, which is the main concern.

[lifecycle_manager] expose service_timeout

f644c6f

Signed-off-by: Guillaume Doisy <[email protected]>

doisyg force-pushed the lifecycle_manager_timeout_main branch from b3af4d0 to f644c6f Compare January 9, 2025 13:35

SteveMacenski reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lifecycle_manager] expose service_timeout #4838

[lifecycle_manager] expose service_timeout #4838

doisyg commented Jan 9, 2025

mergify bot commented Jan 9, 2025

mergify bot commented Jan 9, 2025

mergify bot commented Jan 9, 2025

SteveMacenski left a comment •

edited

Loading

doisyg commented Jan 10, 2025

doisyg commented Jan 10, 2025

SteveMacenski commented Jan 10, 2025 •

edited

Loading

[lifecycle_manager] expose service_timeout #4838

Are you sure you want to change the base?

[lifecycle_manager] expose service_timeout #4838

Conversation

doisyg commented Jan 9, 2025

Basic Info

Description of contribution in a few bullet points

Description of documentation updates required from your changes

For Maintainers:

mergify bot commented Jan 9, 2025

mergify bot commented Jan 9, 2025

mergify bot commented Jan 9, 2025

SteveMacenski left a comment • edited Loading

Choose a reason for hiding this comment

doisyg commented Jan 10, 2025

doisyg commented Jan 10, 2025

SteveMacenski commented Jan 10, 2025 • edited Loading

SteveMacenski left a comment •

edited

Loading

SteveMacenski commented Jan 10, 2025 •

edited

Loading