Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller_manager] Fix execution point of doSwitch #211

Closed
adolfo-rt opened this issue Sep 4, 2015 · 9 comments
Closed

[controller_manager] Fix execution point of doSwitch #211

adolfo-rt opened this issue Sep 4, 2015 · 9 comments
Labels

Comments

@adolfo-rt
Copy link
Member

When switching controllers, the robot hardware abstraction might require changing the control mode of certain resources (e.g. joints). This feature was contributed in #200. I recently realized that the current implementation has an important flaw with a straightforward fix.

When a controller switch request is made, a non real-time ROS callback checks that the switch is possible, and among other things calls RobotHW::canSwitch(...). This callback doesn't perform the actual switching, but delegates it to the [real-time] control thread. Because of this, the call to RobotHW::doSwitch(...) should also be done in the control thread. We currently do it in the ROS callback, which is wrong.

This misplaced call to RobotHW::doSwitch(...) can cause a robot to switch the control mode of its joints before the new controllers are actually running. The ensuing behavior is undefined, and depends on how your hardware abstraction is implemented. I could reproduce this on hardware, and confirm that the fix proposed below address the issue.

The fix consists of moving the execution point of RobotHW::doSwitch(...), which entails that the method implementation must now be real-time safe. A fix for indigo is proposed in #209.

Because the real-time safe requirements now imposed on RobotHW::doSwitch(...), it now becomes much likelier that RobotHW::canSwitch(...) will have side-effects, i.e., change the state of the RobotHW instance. For this reason, I additionally propose that in jade we remove the constness of the RobotHW::canSwitch(...) method. Since jade has not been officially released, we can allow to make this small API change. A fix for jade is porposed in #210.

@ipa-mdl

@adolfo-rt adolfo-rt added the bug label Sep 4, 2015
@adolfo-rt adolfo-rt mentioned this issue Sep 4, 2015
@mathias-luedtke
Copy link
Contributor

You are right it might lead to undefined behaviour if this case is not handled properly in RobotHW, which is rather cumbersome.
But mode switching might take longer than 1 update cycle, so I am not sure how to integrate it in update() properly.

For CANopen it takes 2 to 10 (rough estimation) cycles to switch the mode depending on the device, which might require to switch the motor power off and on again.
To overcome this, I stop the motor motion in ros_canopen until the controller was switched (i.e. the first command was sent).
In my current implementation doSwitch blocks until all requested motors were switched properly, but this does not effect unrelated motors and controllers.
I can rewrite it to non-blocking behaviour (to be used in update()), but the undefined state for the controllers would still be present, since it takes longer than 1 cycle to switch and ControllerManager does not know anything about the internal state.

Maybe it's time to add proper resource states that are used for switching and detection of emergency stops, something like

  • STOPPED: no (claiming) controller is assigned
  • ERROR: controller is assigned, but resource is in error state
  • SWITCHING: resource is not driven currently, needs time to switch
  • READY: everything is finde

In addition we would need a mechanism to propagate the state between ControllerManager and RobotHW.
But this needs changes throughout the code and a notable amount of work.

@adolfo-rt
Copy link
Member Author

Currently switch_controller is required to be an atomic operation that can take place in a single control cycle. I'm open to discuss alternative strategies, but would prefer to not change current behavior unless no sensible alternative solutions exist. Non-atomic switching opens many questions, such as, what happens while switching completes?, will my robot fall?, is there any control?.

We've also encountered devices that take longer than one control cycle to switch.The following is a list of possible solutions that can be made to work with the single-control-cycle restriction of switch_controller. In all cases doSwitch must be real-time safe, hence non-blocking. Not all might apply to your setup.

  • Keep last command from controller being stopped, or set it to an uninitialized value (e.g. NaN). If mode switching takes longer than one cycle, the controllers will switch (stop/start) before the new mode configuration kicks in. It might be acceptable to keep the last commands of the stopped controller (which are constant, as they are no longer being updated) for a few more cycles.
  • Disallow certain mode transitions in RobotHW::canSwitch. If mode switching is really slow, and there is no policy that can be applied at the RobotHW level (as proposed above), maybe it doesn't make sense to allow it at runtime, or at least make you first stop one controller, then start a new one from controlled conditions (but not really switch). Truly reactive behaviors need switching in a single control cycle.
  • Change your hardware (or at least the electronics) for something that has acceptable switching behavior. This is easier said than done, I know. Some hardware simply require you to start sending a different type of command, and the switching is done transparently to you (I think UR arms work like this).

You say you're using CANopen. IIRC, the CANopen state machine should be able to transition between any to modes in two control cycles, right? (mode A→idle, idle→mode B). This is what we're getting from CAN devices.

@adolfo-rt
Copy link
Member Author

@ipa-mdl, please let me know what strategy you propose to go about this. I don't want to break ros_canopen without reason, but would also not want to leave this bug open for long. I'll be offline for the next three weeks, so there's no immediate rush.

@mathias-luedtke
Copy link
Contributor

Thanks for waiting :)

The CANopen state machines can switch directly between two modes (there is no real idle mode), but

  • the timing is unspecified, the mode switch is confirmed by reading the new mode as current mode
  • the switching might be restricted to certain motor states (e.g. voltage off)

Your option 3 is obviously not applicable here.
Option 2 is reasonable, but the user would need to know which controllers can be switched directly and which needs the intermediate stop. In addition two service calls are needed for actual mode switching.

The current canopen_motor_node implements option 1, but enforces that the new controllers are not started before the hardware is ready to process their commands. The latter is the crucial part for starting the new motion cleanly. However, this is not possible if doSwitch if called from update.
(The same holds for the second service call in option 2.)

What do you think about adding a non-RT prepareSwitch, like old doSwitch, perhaps with boolean return type?
This hybrid API will allow reactive one-cycle switching and our non-RT switching (used at stillstand).
If you do want to bloat the API any further, I have to hijack canSwitch for the preparation, it's ugly but will work.
(a non-const version of canSwitch would be great then)

I am/have been out of the office as well. I will give the hybrid interface a try in two weeks.

@adolfo-rt
Copy link
Member Author

What do you think about adding a non-RT prepareSwitch, like old doSwitch, perhaps with boolean return type?

Controller switching guarantees completion in a single control cycle. I'd like this behavior to remain, as it's one of the strong points of the controller manager. The prepare_switch alternative breaks this guarantee. Also, it's brittle to rely on the robot being standstill to run prepareSwitch, as the mechanism might be moved by external factors and not remain still.

If you do want to bloat the API any further, I have to hijack canSwitch for the preparation, it's ugly but will work.

Yes, please.

I don't know what you mean by ugly, but option 1 in #211 (comment) is a pretty decent compromise, and has worked well for us in joints that take a few cycles to switch.

If switching is really slow, I'd advise to go for option 2 in #211 (comment), but I think this can likely be prevented using option 1.

I don't want to put pressure here, but I consider this bug a blocker for the Jade release.

@mathias-luedtke
Copy link
Contributor

Option 1 does not work if the controller starts before the hardware is ready to process the commands.
It might work with properly tuned controllers, but not with forward controllers since the first commands will get lost. This problem gets worse if you want to synchronize all your axes.
Long story short: The start of a controller execution has to be blocked until all motors are ready.

Hijacking is ugly because I have to run the doSwitch tasks in canSwitch, relying on the fact that nothing can prevent the controllers from starting/stopping within the ControllerManager logic. In addtion canSwitch is const in indigo.

A compromise would be to rename canSwitch to prepareSwitch, non-const with boolean return type.

If you insist on #209 (or the jade equivalent) then this is a blocker for all further ros_canopen releases.
I am not aware of other driver implementations that use the switching feature already.
(If there is any, please let us know, your feedback is highly appreciated!)
just moving the execution point drives this feature useless for ros_canopen and we would end up in the situation we had before (i.e. hijacking unrelated functions).

@adolfo-rt
Copy link
Member Author

Long answer...

Option 1 does not work if the controller starts before the hardware is ready to process the commands.

I don't know if it works for you, but this is the sequence of events I'm expecting when switching from ctrl_a, which runs in mode_a to ctrl_b, which runs in mode_b:

When switching is requested:

  • doSwitch is called to go from mode_a to mode_b
  • ctrl_a stops, and its commands (belonging to mode_a) stop changing over time
  • ctrl_b starts in mode_b

During a few control cycles:

  • hardware is busy switching mode. If it's still in mode_a, it will keep following the last command sent by ctrl_a, which is no longer updating, since ctrl_a is stopped. If the robot is standing still, then no observable change should take place.
  • ctrl_b is writing commands that don't reach hardware, as it hasn't completed the switch to mode_b.

Once hardware switches to mode_b:

  • hardware starts following commands sent by ctrl_b. Again, if the robot was standstill, no observable change should take place.

During switching, the first commands from ctrl_b will get lost, and the last written command of ctrl_a will be used. Do you see this as a show stopper?.

More importantly, how is this worse than what you currently have?. I expect that you have the opposite. You switch modes before controllers, hence start reading from a still stopped ctrl_b while ctrl_a's commands go nowhere.

Hijacking is ugly because I have to run the doSwitch tasks in canSwitch, relying on the fact that nothing can prevent the controllers from starting/stopping within the ControllerManager logic.

I'm not sure I understood the above phrase. But if we accept that controller switching cannot take longer than one control cycle, I see no big issues with offloading most of the work to canSwitch, and have doSwitch only commit the changes. More below...

In addtion canSwitch is const in indigo.

To go over this limitation in indigo, I had to make mutable the data structure that canSwitch prepares for doSwitch. This is why the const qualifier of canSwitch is being dropped in #210 (jade).

A compromise would be to rename canSwitch to prepareSwitch, non-const with boolean return type.

canSwitch already returns bool and is non-const in #210. The prepareSwitch name sounds appropriate. Since mode switching is not a widely used feature, I'm OK with considering renaming the method for jade.

If you insist on #209 (or the jade equivalent) then this is a blocker for all further ros_canopen releases.

I've been holding #209 and #210 because of your heads-up. A fix has to be pushed to indigo and jade, but it can be delayed. I'm all ears for feedback from affected downstream projects like ros_canopen to make the transition as smooth as possible.

I am not aware of other driver implementations that use the switching feature already.
(If there is any, please let us know, your feedback is highly appreciated!)

Our worst switching latency is two 10ms control cycles, and we observe the behavior described in the example above. It's more than acceptable. The code is not open.

@mathias-luedtke
Copy link
Contributor

Thank you for your detailed response!

During switching, the first commands from ctrl_b will get lost, and the last written command of ctrl_a will be used. Do you see this as a show stopper?.

For a single joint this might be acceptable, but it is not acceptable for synchronized multi-joint motions.
The CANopen standard does not list any timing bounds for switching a mode.

You switch modes before controllers, hence start reading from a still stopped ctrl_b while ctrl_a's commands go nowhere.

Yes and no: Commands from ctrl_a go nowhere, but I do not forward ctrl_b's commands (to be more specific: the command from the JointHandle) until all motors have switched their mode. Motion is halted in between. For this to work I have to postpone the start of the controller. (Logic for same mode switch is different)

I see no big issues with offloading most of the work to canSwitch, and have doSwitch only commit the changes.

I guess this is the crucial point. In CANopen committing the change takes some time.
Currently ros_canopen does not enforce to motor to stillstand be before switching (again synchronously for all affected joints), but that's already on my list.
And I want the controller to have a clean start for all it's resources.

Our worst switching latency is two 10ms control cycles

Two cycles is our best case. Typical latency will be 3-4 cycles for our hardware. Other hardware needs some additional state machine transitions in between and therefore 5 or more cycles just for the (synchronized) communication. If the motor needs to be stopped first (with moderate deceleration) the time increases of cause.

canSwitch already returns bool and is non-const in #210. The prepareSwitch name sounds appropriate. Since mode switching is not a widely used feature, I'm OK with considering renaming the method for jade.

Then it's fine for jade, a rename would be nice though.
Why can't we backport it to indigo?
A const canSwitch (marked deprecated) and a non-const version of whatever name that are both called. Or the latter might even call the first per default.
This way the HW interface developers can migrate easily. In addition the non-const implementation will work for indigo and jade.

IMHO breaking the behaviour (non-RT -> RT) is worse than breaking the API (const -> non-const).

@mathias-luedtke
Copy link
Contributor

This issue can be closed now, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants