Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BtActionServer should not create ActionServer until activated #3032

Closed
Aposhian opened this issue Jun 21, 2022 · 13 comments
Closed

BtActionServer should not create ActionServer until activated #3032

Aposhian opened this issue Jun 21, 2022 · 13 comments

Comments

@Aposhian
Copy link
Contributor

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • ROS2 Version:
    • galactic
  • Version or commit hash:
    • binaries
  • DDS implementation:
    • FastDDS

Steps to reproduce issue

BtActionServer is prevented from going into active by #3027. It creates a simple action server, but does not "activate" it. This means that any other clients that call wait_for_action_server including the ros2 CLI report that the action server is ready, when it is actually "inactive" which is not a true lifecycle state that can be queried.

Expected behavior

wait_for_action_server returning true on an action client should mean that the action server is ready to serve requests.

Actual behavior

wait_for_action_server returns true, but when goals are sent, nothing happens.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

I'm not sure I understand what you mean. wait_for_action_server is in the BT node constructors, which is called when the BT is constructed. This is done on activation https://github.com/ros-planning/navigation2/blob/313caeea7ef7ff7763109134b33e7d4f1045b6a0/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp#L119-L127. As you can see, the Navigator-action server is immediately activated after the BT is constructed. That seems like the logical order: We cannot process navigator-action commands until the constituent BT node action servers are ready to process information. Once we load a tree, then we're ready to process commands, so the action server is then set to activate()

Is there a concrete suggestion perhaps that can help me understand your issue?

@Aposhian
Copy link
Contributor Author

Yes. This is a problem because of #3027. When calling loadBehaviorTree on a BT that loads a BTActionNode there is a potential to block indefinitely. That results in an intermediate state where the action server exists, but since the BT didn't finish loading, it can't actually serve requests, but elsewhere wait_for_action_server will say that the action server exists and is ready.

@Aposhian
Copy link
Contributor Author

My suggestion is something like this: don't create the action server in on_configure: just do it on_activate after the BT has been successfully loaded.

 bool BtActionServer<ActionT>::on_activate() 
 { 
   if (!loadBehaviorTree(default_bt_xml_filename_)) { 
     RCLCPP_ERROR(logger_, "Error loading XML file: %s", default_bt_xml_filename_.c_str()); 
     return false; 
   } 
   action_server_ = make_unique<SimpleActionServer>();
   action_server_->activate(); 
   return true; 
 } 

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 21, 2022

That wouldn't be in line with the memory allocation scheme setup by the lifecycle nodes. Dynamic memory allocation is supposed to be in on_configure as much as possible. Requests to the simple action server should be rejected if its not activated, right? So you should be seeing from the client side a rejection of the goal? Its not ideal, but at least you have knowledge that its not taking requests yet.

If you simply have your nav client / autonomy node be a lifecycle node that is to be transitioned after BT navigator, then you'd be guaranteed that this is active and ready to go by the time it's own on_activate() method is called. You can exploit lifecycle nodes to really trivially solve this issue 😄

If that doesn't work for you, we can think about another solution - but that seems like by far the best one

@Aposhian
Copy link
Contributor Author

Hm, maybe there was something else going on, because even though looking at the source I agree it should just reject the goal if it is not activated, in practice, it was accepting the goal, but then not calling the callbacks.

@Aposhian
Copy link
Contributor Author

Dynamic memory allocation is supposed to be in on_configure as much as possible.

Also, I am curious as to where the rationale for this is. I have not read anything about that.

@SteveMacenski
Copy link
Member

Hm, maybe there was something else going on, because even though looking at the source I agree it should just reject the goal if it is not activated, in practice, it was accepting the goal, but then not calling the callbacks.

OK, we can leave this ticket open for now then while you look into that. Feel free to re-name the ticket if you find another issue is actually the 'real' issue.

Also, I am curious as to where the rationale for this is. I have not read anything about that.

https://design.ros2.org/articles/node_lifecycle.html:

The configuration of a node will typically involve those tasks that must be performed once during the node’s life time, such as obtaining permanent memory buffers and setting up topic publications/subscriptions that do not change.

The node uses this to set up any resources it must hold throughout its life (irrespective of if it is active or inactive). As examples, such resources may include topic publications and subscriptions, memory that is held continuously, and initialising configuration parameters.

@SteveMacenski SteveMacenski added the question Further information is requested label Jun 21, 2022
@SteveMacenski
Copy link
Member

I doubt it, but I'd just quickly try Cyclone to see if this is another oddity from Fast-DDS. All the sudden I'm seeing a bunch of odd tickets and this might be related? I think its unlikely this is a Fast-DDS issue, but just good to sanity check.

@Aposhian
Copy link
Contributor Author

This is a cascading failure from #3033, which is a difference between FastDDS and CycloneDDS. This ticket is a proposition to change the error condition behavior to be easier to debug, and more clearly point to where the problem is.

@vinnnyr
Copy link
Contributor

vinnnyr commented Jun 21, 2022

If you simply have your nav client / autonomy node be a lifecycle node that is to be transitioned after BT navigator,

Sorry if this is verging to be off topic -- but can you just hook arbritary lifecycle nodes into the nav2 lifecycle manager, or do they have to implement the bond interface?

@SteveMacenski
Copy link
Member

but can you just hook arbritary lifecycle nodes into the nav2 lifecycle manager, or do they have to implement the bond interface?

There's a param to turn off the bond connections, so no, its not strictly required. Though for a given lifecycle manager, its all or nothing, but bond isn't strictly speaking required, its a nice to have feature mostly for people developing algorithms new to ROS 2 that might create lock-up conditions.

Though, there's no requirement that all nodes are in the same lifecycle manager. In fact, we use multiple lifecycle managers in Nav2 already. So you could have a lifecycle manager instance for BT navigator and your stuff above it with bond turned off. That way all of the Nav2 stuff that could lock up are using bond timers. That would solve your problem.

@Aposhian got it, understood. But the non-rejection of actions is a bit concerning and I can understand the weird position this is in. If the rejections can be done properly, then that would be enough so solve this issue, regardless of the lifecycle manager stuff, no?

Moving the action server around I view as a last resort since that's breaking the rules of lifecycle that we're trying to follow as closely as possible to be a best practices example

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 7, 2022

@Aposhian I spent some time on this to get it off the Nav2 ticket queue and could not replicate. Things appear to be correctly rejected by the action server as far as I can tell and I'm able to detect that failed goal send request on the client side. If I make the simple action server always reject goals, I see that properly on the client side. Then we know that active is false and only turns true when activate() is called after the BT XML is loaded, so its not possible for that to be true. So I don't see how its possible that you're seeing the behavior you are - unless potentially you're not correctly checking if a goal was rejected?

Even if we moved the action server creation until after the BT XML is loaded, it doesn't resolve the actual problem that Fast-DDS is not discovering / connecting the services / actions that are blocking in the BT Action Server. We could add a timeout to these - but then what's the expected outcome if it fails? Throw exception? Certainly not continue. If we throw exception and catch in on_activate to fail the transition, you'd have to re-transition the stack up again to try again.

I suppose we could do that if you think that's the right option. Its not alot of work, just having the BT Action Node and BT Cancel Node use a timeout and if found, throw exception. In loadBehaviorTree, catch it and return false (and update logging in goalReceived to indicate it failed to load, not necessarily they it didn't exist). I could add that to my task list for next week.

Would that be enough? That feels like just putting a bandaid on, given that it doesn't actually resolve the underlying issue, but at least it would fail more "nicely". You'd have to take charge of the lifecycle management of Nav2 / your nodes rather than relying on autostart in that situation, since you'd need to know if it failed to come up.

My personal recommendation would be to file tickets and follow up consistently about them with Fast-DDS so that this issue is resolved, this is a big problem.

@Aposhian
Copy link
Contributor Author

Aposhian commented Jul 7, 2022

Yeah, if you are having trouble reproducing the issue where goals are not accepted when the simple action server is not "active" then this can probably be at the least de-prioritized. And now that #3029 is merged, I don't think this is a big issue, since it won't remain in that intermediate state indefinitely like before. I think let's close.

@Aposhian Aposhian closed this as completed Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants