Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lifecycle_manager] expose service_timeout #4838

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

doisyg
Copy link
Contributor

@doisyg doisyg commented Jan 9, 2025


Basic Info

Info Please fill out this column
Ticket(s) this addresses
Primary OS tested on Ubuntu
Robotic platform tested on Dexory robot
Does this PR contain AI generated software? No

Description of contribution in a few bullet points

  • This adds a parameter service_timeout, default to 5s, to control explicitly the timeout applied to the change_state and get_state calls in the lifecycle_manager node. Which were functionally respectively 5s and 2s (LifecycleServiceClient default). This is useful to stabilize bringup in CPU limited systems (or in systems whith looooooots of nodes started at the same time) where a lifecycle_manager could failed to transition its managed nodes due to change_state or get_state timing out.
  • Slight refactor of LifecycleServiceClient to not crop the timeout to plain seconds in change_state and get_state. + replacing the 1 change_state overload by a default parameter.

Description of documentation updates required from your changes

Probably: https://docs.nav2.org/configuration/packages/configuring-lifecycle.html

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

Copy link
Contributor

mergify bot commented Jan 9, 2025

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

@doisyg doisyg force-pushed the lifecycle_manager_timeout_main branch from b3af4d0 to f644c6f Compare January 9, 2025 13:35
Copy link
Contributor

mergify bot commented Jan 9, 2025

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

1 similar comment
Copy link
Contributor

mergify bot commented Jan 9, 2025

@doisyg, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these changes, however I recall trying to add this before to handle one of the few places that there aren't networking timeouts set and it causing issues/instability in the stack. I think related to service calls in ROS 2 with timeouts being alot more flaky to actually get sent/processed (even when that timeout is like 2000s) than if you sent it without the timeout -- for some weird RMW/DDS reason I'm sure. If memory serves, it made it flaky for the system to actually get brought up some non-trivial percentage of the time, so I was forced to revert it after a ton of user reports.

Maybe that's been fixed now? Do you see any issues if you launch it a couple dozen times?

I want to see what CI will say but the build was failing. I just re-triggered but if it has those permission issues again, please ping Ruffin to take a look

@doisyg
Copy link
Contributor Author

doisyg commented Jan 10, 2025

Summoning @ruffsl ,)

Yes, services in ros2 have a quite a history of failing due to middleware issues, I still have PTSD. I am not ruling it out in our system. At the moment, when bringup all our nodes all together at the same time, we can have the get_state of one of our many lifecycle managers randomly taking more than 5s to get a reply. I am not sure yet if it is just taking more time (optimistic scenario) or failing completely (pessimistic scenario, middleware issue).

In terms of this PR impact, the change of behavior system wide should be limited: when the new service_timeout parameter is not explicitly set, it falls back to a default of 5s. Which was the value applied to get_change before this PR, but change_state had a default of 2s. So the only functional impact (when no using the new parameter) is that the timeout of the change_state call of the lifecycle manager is increased from 2 to 5s.

@doisyg
Copy link
Contributor Author

doisyg commented Jan 10, 2025

Test re-run (thanks @ruffsl), now test_docking_server failing, related to this PR you think @SteveMacenski ?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 10, 2025

The current job run has every system test failing :-) https://app.circleci.com/pipelines/github/ros-navigation/navigation2/13347/workflows/b2a18891-9bba-43c9-9316-dd05b976ee2e/jobs/40015/tests

we can have the get_state of one of our many lifecycle managers randomly taking more than 5s to get a reply.

Can we then update only that, exposing the hardcoded timeout as a parameter, without changing the other methods to using a timeout? This PR removes the timeout-less invoke version of these methods that we currently rely on, which is the main concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants