-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-DDS Service Reliability sometimes hangs lifecycle manager #3033
Comments
Ok, actually I think this is partially a duplicate of #2917. However, I still think the infinite timeout is problematic. |
Do you see this with Cyclone? If not, then you should file this with Fast-DDS as a problem to resolve.
Try reordering the bringup sequence for lifecycle, my guess is its not actually an issue with the controller server, its just first in line. How is it a duplicate of #2917? That was on nested timer issues. Do you have a specific suggested change? From what I read in this, it seems like the service clients aren't being made properly, and without more specific information / lines / issue descriptions, I can only interpolate what I think you mean by that. I think you mean that they're just straight up not working, which is not a Nav2 issue, its a DDS/RMW issue. But maybe you mean something we can change here, but I'm not sure what that is without more detail. |
No, this never happens with CycloneDDS. |
File a ticket please with Fast-DDS and link back here so there's a thread connection. It sounds like that's really the problem -- unless you found something we can do in Nav2 to make it work? But it sounds like this is a communication issue that's out of this project's control. If there's something actionable for us here, I'd be happy to discuss it |
CC @MiguelCompany @EduPonz Another DDS related issue has cropped up |
My suggestion is to not use |
The fact that this is an issue in Fast-DDS is an issue that should be reported to them, though. We can add a timeout so in case of failure we don't hang forever, but is problematic that we're failing in the first place when moving to Fast-DDS. |
Well I don't think they are the same thing -- since they are in different places. Unless you want it the idea to be "throughout Nav2 don't have infinite timeouts waiting for action server / service servers" |
Ah you're right, backed that out |
Feel free to propose a PR. 5 min seems a little nuts, 10 sec seems more in line with reasonable even if respawned from a remote server, but yeah, I think a parameter would be good |
Here is where the controller server backtrace is showing in this state where the lifecycle manager is waiting for it:
meanwhile, the lifecycle manager backtrace is here:
I can run
and if I try calling
And I can't query via lifecycle
|
It is important to note that this error condition does not happen every time. It is inconsistent. |
Here is a first effort at a reproduction. It reproduces the problem only some of the time though. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38 |
@Aposhian One thing I don't understand. On the issue description you say this happens on rolling, but on the gist you just shared it uses So my question is: Does the issue only reproduce in galactic? Does it also reproduce on humble/rolling? |
Note that Rolling if from binaries in 20.04 are outdated, since Rolling has moved to 22.04, so there be updates missing if not using 22.04 as the base or building from source |
Yes, I am using old rolling on focal, so closer to galactic than to humble. |
Yes, I cannot reproduce this issue on Humble images. |
Keep in mind that Galactic is EOL in 5 months. Since this is a galactic-specific problem, what do you want to do about it? Would it be possible to backport the patch @MiguelCompany? If not, we could try to patch something in just the galactic branch for you @Aposhian for the interim? If its just for Galactic to deal with just an issue with Fast-DDS, I'm OK being a bit more sloppy about it than I normally would since it's lifecycle is very limited and would not be pushed into future distributions -- the maintenance cost and technical reach is limited. |
@MiguelCompany suggested I try building FastDDS from source on the 2.3.x branch, which has unreleased fixes for galactic. I don't think any Nav2 action is required. I will update if that is a viable workaround for galactic. |
OK - if that fixes things feel free to close when ready! |
Ok, I tried building from 2.3.x FastDDS, but it doesn't resolve the issue. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38
|
@MiguelCompany while not specifically for Nav2, lifecycle services failing is still happening on humble intermittently. Can we migrate/copy this issue to Fast-DDS? |
Agreed, some private conversations I've had in recent weeks have also expressed issues with services in Humble @MiguelCompany what's the next step / status on this issue? |
Any updates? This is a pretty critical issue and trying to find if we need to take action to resolve it 😄 |
In the same theme of #3032, I want to get this off the queue so trying to find out the actionable things we can do to resolve this. I know the main issue is Fast-DDS not handling lots of service calls well, but I suppose we can improve the way these failures occur. The lifecycle manager and BT Action Node in BT Navigator are the only 2 places where we have waits without timeouts intentionally since these are pretty critical operations. For this (lifecycle manger), what should happen if a service for transitioning doesn't come up? Fail everything? I suppose we could also remove the checks for services being up in the constructors https://github.com/ros-planning/navigation2/blob/main/nav2_util/src/lifecycle_service_client.cpp#L39 knowing that we check if they're up before we submit requests in the change/get methods that were recently added due to supporting respawning. I'll say though that the current status of lifecycle services and the manager are in a stable but thoughtfully setup balance due to supporting Bond, Respawn, and Lifecycle all at the same time. Removing that could cause some real headaches other places, but I won't be sure until I try if that seems to be the best answer. Like in the other ticket though, I think the priority should be on working with Fast-DDS to be able to resolve this issue outright. |
Yeah I'm torn as to whether this is something that Nav2 should consider to fallible as a part of normal operation, and gracefully handle it. But once you start to question the underlying RMW, lots of other assumptions go out the window. |
Since you also said this is happening on Humble, we have moved our investigations there with no luck so far. Using your reproducer on Humble yields no results, being able to run hundreds of times without error. Do you have any other clue on how to have this issue reproduced on Humble? We also have this PR open that aims to bring Fast DDS's Waitsets to the RMW. It's still under review but you could try using it and see if the situation improves. |
I think that sounds reasonable. However, we won't be much use testing this at the moment, since we have decided to use CycloneDDS with Humble for our main operation at the moment. |
OK, I can add that to my task list to change early next week to add the log / waits. I could actually do this too in the BT Action Node too so that in either case when waiting if it isn't connecting we're printing that for awareness. Do you think that's a good solution to #3032 as well? |
Yes I think so. |
Sweet, its on my queue for this afternoon |
#3071 implements |
Sorry, this should stay open until the service reliability is handled in Fast-DDS |
@Aposhian that PR was merged that supposedly fixes this -- have you tested by chance? |
No, but this is probably something we can get to testing in the next month. |
@SteveMacenski can you please point us to the upstream PR that we should make sure we test with? Was it this one? ros2/rmw_fastrtps#619 |
yes! |
@Aposhian I don't suppose you tested as part of your migration back to Fast-DDS to use the features you liked? |
No not yet |
Hi everyone, The new Humble sync is out which contains the changes made to solve this issue. |
will report back if we end up having time to try it. unfortunately both of us seem to be a bit short on time for the rest of the year, so I cannot promise anything. |
Hi everyone, On January the 27th there was a Humble patch release which included additional fixes (mainly eProsima/Fast-DDS#3195). It'd great if someone could give this a try a see whether this ticket can be closed. |
Agreed! |
Unfortunately, the problem was reproduced on Fast-DDS with PR3195 patch: very rare, but still appear typically 1 time per ~50-100 Nav2 stack runs. Checked on Ubuntu 20.04 with ROS2 Rolling built from sources (20230110 version, PR3195 is there). Just for the reference: for verification I've used the Nav2 system tests that are deploying full Nav2 stack in each turn. The stress-test script is attached to current message: keepout_stress.sh.txt |
I can confirm that we still encounter this issue with the |
This is killing the processes with I would check two things:
|
This hasn't received any updated in about a year with anyone noting an issue and I also haven't seen it in my day to day development in some time. I'm closing this ticket as fixed by Fast-DDS since it does not appear to continue to be a problem of reportable frequency anymore. Though, happy to reopen if anyone can show that this is still and issue plaguing their application. |
Bug report
Steps to reproduce issue
Give lifecycle manager a node to manage that is erroring out, and for some reason doesn't create its lifecycle services properly.
Expected behavior
Failure to connect to lifecycle state services should eventually timeout, or indicate that the lifecycle node it is trying to manage is not working.
Actual behavior
lifecycle manager blocks indefinitely.
Additional information
I have a hard time getting this to reproduce reliably. I think the lifecycle services failing to come up has something to do with FastDDS shared mem. This is happening with just nav2 controller_server for me.
This is resulting in behavior observed in #3027, since the Controller server is unable to configure to provide the FollowPath action.
The text was updated successfully, but these errors were encountered: