-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lifecycle_manager] expose service_timeout #4838
base: main
Are you sure you want to change the base?
[lifecycle_manager] expose service_timeout #4838
Conversation
@doisyg, your PR has failed to build. Please check CI outputs and resolve issues. |
Signed-off-by: Guillaume Doisy <[email protected]>
b3af4d0
to
f644c6f
Compare
@doisyg, your PR has failed to build. Please check CI outputs and resolve issues. |
1 similar comment
@doisyg, your PR has failed to build. Please check CI outputs and resolve issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with these changes, however I recall trying to add this before to handle one of the few places that there aren't networking timeouts set and it causing issues/instability in the stack. I think related to service calls in ROS 2 with timeouts being alot more flaky to actually get sent/processed (even when that timeout is like 2000s) than if you sent it without the timeout -- for some weird RMW/DDS reason I'm sure. If memory serves, it made it flaky for the system to actually get brought up some non-trivial percentage of the time, so I was forced to revert it after a ton of user reports.
Maybe that's been fixed now? Do you see any issues if you launch it a couple dozen times?
I want to see what CI will say but the build was failing. I just re-triggered but if it has those permission issues again, please ping Ruffin to take a look
Summoning @ruffsl ,) Yes, services in ros2 have a quite a history of failing due to middleware issues, I still have PTSD. I am not ruling it out in our system. At the moment, when bringup all our nodes all together at the same time, we can have the In terms of this PR impact, the change of behavior system wide should be limited: when the new |
Test re-run (thanks @ruffsl), now |
The current job run has every system test failing :-) https://app.circleci.com/pipelines/github/ros-navigation/navigation2/13347/workflows/b2a18891-9bba-43c9-9316-dd05b976ee2e/jobs/40015/tests
Can we then update only that, exposing the hardcoded timeout as a parameter, without changing the other methods to using a timeout? This PR removes the timeout-less |
Basic Info
Description of contribution in a few bullet points
service_timeout
, default to 5s, to control explicitly the timeout applied to thechange_state
andget_state
calls in thelifecycle_manager
node. Which were functionally respectively 5s and 2s (LifecycleServiceClient
default). This is useful to stabilize bringup in CPU limited systems (or in systems whith looooooots of nodes started at the same time) where alifecycle_manager
could failed to transition its managed nodes due tochange_state
orget_state
timing out.LifecycleServiceClient
to not crop the timeout to plain seconds inchange_state
andget_state
. + replacing the 1change_state
overload by a default parameter.Description of documentation updates required from your changes
Probably: https://docs.nav2.org/configuration/packages/configuring-lifecycle.html
For Maintainers: