Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syslink requires flow control #703

Open
whoenig opened this issue Feb 22, 2021 · 16 comments
Open

Syslink requires flow control #703

whoenig opened this issue Feb 22, 2021 · 16 comments

Comments

@whoenig
Copy link
Contributor

whoenig commented Feb 22, 2021

The communication between NRF51 and STM32 requires flow control to avoid packet loss. Currently, there is hardware-supported flow control from STM->NRF51 to prevent the STM to send data too quickly, but not the other way around. However, for long-lasting operations on the STM side, packets might be lost. A simple example is using the CRTP Echo port with an echo delay. Since the HW is not prepared for duplex HW flow control, this will require some protocol changes between the NRF51 and STM32.

@whoenig
Copy link
Contributor Author

whoenig commented Jul 16, 2021

@ataffanel Has this been prioritized/planned on? I am currently hitting this problem when implementing the param/log TOC retrieval using crazyflie-link-cpp. The logic is to first send all requests for every single entry of the TOC in the first loop and then receiving all responses in the second loop. Sometimes I am missing responses. The workaround to have it in a single loop (i.e., send a request and then wait for the response) works, but is much less efficient.

From what I remember, one potential workaround was to somehow limit the number of packets that are in flight per CRTP port to be 1? Is there a less restrictive workaround? Also, this should likely be implemented directly in crazyflie-link-cpp then?

@ataffanel
Copy link
Member

I had actually decided, independently of your message, to dedicate my week to implementing Syslink flow control :-). No promises, we are a skeleton crew currently at Bitcraze, but all my non-support/production time will be on this.

I have had the same problem implemening TOC fetch in Rust. My solution is the same as yours, I fetch the element one by one. The few time fetching as a bulk worked, it reduces the fetch time from 2.4 seconds down to ~0.8 seconds. It is interesting to note that when fetching multiple tocs (log and param+param values) in parallel, the runtime is still 2.4 seconds since we are using the link very sparsely so the final win might not be that high. But I am happy that this is a very easily repeatable test case for the flow control.

I would not advise you to start on in-flight limiting right now, it would indeed be a solution but it sounds complex to implement and it will become mostly useless once flow control is implemented. I will update this ticket during the week with my progress.

One thing I am not sure about yet but that might happen is that we might eventually need a new version of Safelink (the radio packet counters). If I remember well, the current implementation of safelink both in the Crazyflie and in the libs does not allow for the Crazyflie to block the up-link even though it could. Without that, the only option is to stop ack-ing packet at radio level which is going to work but is not ideal so we should avoid it in the future.

@whoenig
Copy link
Contributor Author

whoenig commented Jul 19, 2021

Awesome - thanks! Let me know if you need any help/brainstorming. For the C++ side it sometimes works to fetch in bulk, but sometimes it doesn't. My code is at https://github.com/whoenig/crazyflie_cpp/blob/70beb44478c627b4ec7b9e50573e71d61d6c0dfe/src/Crazyflie.cpp#L735-L766 (including an assert that finds the situation in Debug builds, but I can imagine that release builds are even worse.) The timing is exactly the same for me in C++. Using the (optional) cache, protected by CRC and numEntries, brings it down to about 0.2s.

I don't think changing the safelink protocol is a big deal. We can always bump up the protocol version, use the "old" implementation to query the version first (potentially even without any use of safelink), and then switch to the newer version if the protocol number is high enough.

whoenig added a commit to whoenig/crazyflie_cpp that referenced this issue Sep 24, 2021
@whoenig
Copy link
Contributor Author

whoenig commented Sep 24, 2021

This one is pretty important to me. It will allow me to switch to crazyflie-link-cpp as backend for crazyflie_cpp and the Crazyswarm. At the same time, I expect fixing this will make all libraries/clients more robust especially for areas with a bad radio connection. In some places it might also allow us to simplify the library/client code (where we have to apply ugly workarounds otherwise, see discussion above).

@ataffanel
Copy link
Member

I am looking at this next week. I did not manage to unlock enough time during the summer and it ended up being a bit harder than I initially though (as usual ....). Now I have a good idea of what needs to be done, I will update this ticket next week with my progress.

@ataffanel
Copy link
Member

After some great talk with @tobbeanton I have a new easier strategy:

  • The actuall UART link does not seem to be the problem so I will not implement a low level flow control.
  • The problem main is caused by CRTP queues that can fill-up.

So the strategy is to implement a new syslink packet that tell the nRF to stop or restart sending radio packets to the STM. This is higher level flow control assuming the low level UART is fine.

We have asserts in the low level UART code so if those are a problem we will see it. We have some strategy like Non Maskable Interrupt, above FreeRTOS, and DMA that can improve the low level UART handling if needed.

I will make a PR so that you can test it with your use-case @whoenig

@whoenig
Copy link
Contributor Author

whoenig commented Oct 26, 2021

Cool! The main drawback I see is that it might require some tuning on when to send those packets to keep the risk of race conditions very low in practice.

@ataffanel
Copy link
Member

Cool! The main drawback I see is that it might require some tuning on when to send those packets to keep the risk of race conditions very low in practice.

Yes agreed, I plan on keeping a wide margin to account for delay in the stop packet arrival to the nRF.

On possible problem is that since Safelink on the nRF is not securing the uplink, the only way the nRF will have to stop the packet flow will be to not send any Ack back which will also stop the downlink, this is potential for a deadlock. If this is a problem that can be solved by implementing the uplink in Safelink both on the nRF and the Lib (it is an incompatible change so it will need some new packets at connection time).

@whoenig
Copy link
Contributor Author

whoenig commented Oct 26, 2021

it is an incompatible change so it will need some new packets at connection time

I would still prefer having that, as it might reduce lots of headache in the future. We already have to manually enable syslink at connection time. Couldn't we use this as a version rather than a flag, i.e., 0 - disabled, 1 - syslink v1, 2 - syslink v2? Eventually, after a year or so, we can remove support for the old mode entirely.

@whoenig
Copy link
Contributor Author

whoenig commented Jan 13, 2022

Hi @ataffanel: Is there any update on this issue? We are actively developing and using Crazyswarm2 with the native-link backend and do occasional issues that I think are related to this.

@knmcguire
Copy link
Contributor

As you probably noticed by now, no there is no update on this... I don't think @ataffanel has worked on this. Perhaps a good topic for the our next '3rd week' project?

Anyway I'll make a todo for us on our board

@knmcguire
Copy link
Contributor

This will be on hold for a while due to arnauds absence. Perhaps I should use an 'inactive' tag?

@amitx64
Copy link

amitx64 commented Aug 1, 2023

@knmcguire @krichardsson , I did not find TAG side code in repository. Do you know where it is?

@knmcguire
Copy link
Contributor

@amitx64 we don't know what you mean with TAG in the context of this issue.

Please ask any support questions on https://discussions.bitcraze.io/

@whoenig
Copy link
Contributor Author

whoenig commented Oct 12, 2023

Since Arnaud is back, perhaps this could actually be worked on:-)?

@knmcguire
Copy link
Contributor

not yet back :) but next week he'll come back. Next week is an async week and jetlag for him so perhaps the week after?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants