Failures in lifecycle transitions are not correctly handled #1880

alsora · 2022-01-31T16:44:06Z

I recently realized that there is a bug in the current implementation of lifecycle transitions.
However, it's unclear to me what should be the expected behavior, so opening a ticket to discuss.

The problem is the following.
Let's assume that the node is currently in the active state and there is a "deactivate" transition request.

A lifecycle transition is requested, thus calling change_state function https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L379-L448

The code will then execute rcl_lifecycle_trigger_transition_by_id, trying to transition to the deactivating state.

In the RCL function _trigger_transition https://github.com/ros2/rcl/blob/master/rcl_lifecycle/src/rcl_lifecycle.c#L349-L371 we update the current state state_machine->current_state = transition->goal; before publishing notifications.
If the call to rcl_publish fails (and it can fail) the function will immediately return with an error code and this will result in immediately aborting the change_state function call https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L396-L401 with an error along the lines of "Unable to start transition %u from current state %s: Failed to publish.

The problem is that, although the transition failed to start, we already updated the state machine so the node is now in the deactivatingstate.
The on_deactivate user callback has not been invoked.
Future requests either to activate or to deactivate will be rejected because they are not valid transitions from this state.

The first easy solution I thought of was to just move the state_machine->current_state = transition->goal; line after publishing the notification.
In this way, if the transition really failed to start, the node would remain in the original state.

The same situation can also happen while finalizing the transition.
However, here we can't stay in the current state (which would be "deactivating"), we can't even go back to the initial one, as we already invoked the user callback.

What's your thought on this problem?
Given the fact that transitions are made of multiple stages and that they all can fail, how should a user deal with this?
Should we always bring the node to the error state?

The text was updated successfully, but these errors were encountered:

fujitatomoya mentioned this issue Feb 9, 2022

Lifecycle primary state error transitions ros2/design#283

Open

clalancette assigned wjwwood Feb 17, 2022

alsora mentioned this issue May 24, 2022

Lifecycle node trigger_transition() function caused transition publish failed #1941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in lifecycle transitions are not correctly handled #1880

Failures in lifecycle transitions are not correctly handled #1880

alsora commented Jan 31, 2022 •

edited

Loading

Failures in lifecycle transitions are not correctly handled #1880

Failures in lifecycle transitions are not correctly handled #1880

Comments

alsora commented Jan 31, 2022 • edited Loading

alsora commented Jan 31, 2022 •

edited

Loading