-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for concurrency bugs in start/stop operations #1029
Conversation
I've added a second commit to fix an orthogonal but closely related issue. After cancelling transfers, we need to wait for all cancellation handling to complete before trying to do anything new with the same transfers. Calling If To prevent this happening, I've made the transfer thread keep track of which transfers have finished (either by completion, or cancellation), and made This is implemented using a pthread condition variable which is signalled from the transfer thread and waited on by With this change, I can no longer reproduce the failure to restart RX seen in #883. |
There's a recent thread on the osmocom-sdr list from Jasper van den Eshof, who has been fixing more or less exactly the same issue in librtlsdr. Looking at the librtlsdr code suggests a simpler approach that could be used to orchestrate the cancellations, which would eliminate the need for the locking introduced in b346790.
|
Another useful observation from the discussion with Jasper - the individual I'll work on making both improvements. |
Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.
Calling libusb_cancel_transfer only starts the cancellation of a transfer. The process is not complete until the transfer callback has been called with status LIBUSB_TRANSFER_CANCELLED. If hackrf_start_rx() is called soon after hackrf_stop_rx(), prepare_transfers() may be called before the previous cancellations are completed, resulting in a LIBUSB_ERROR_BUSY when a transfer is reused with libusb_submit_transfer(). To prevent this happening, we keep track of which transfers have finished (either by completion, or cancellation), and make cancel_transfers() wait until all transfers are finished. This is implemented using a pthread condition variable which is signalled from the transfer thread.
This fixes bug greatscottgadgets#1042, which occured when an RX->OFF->RX sequence happened quickly enough that the loop in rx_mode() did not see the change. As a result, the enable_baseband_streaming() call at the start of that function was not repeated for the new RX operation, so RX progress stalled. To solve this, the vendor request handler now increments a sequence number when it changes the transceiver mode. Instead of the RX loop checking whether the transceiver mode is still RX, it now checks whether the current sequence number is the same as when it was started. If not, there must have been at least one mode change, so the loop exits, and the main loop starts the necessary loop for the new mode. The same behaviour is implemented for the TX and sweep loops. For this approach to be reliable, we must ensure that when deciding which mode and sequence number to use, we take both values from the same set_transceiver_mode request. To achieve this, we briefly disable the USB0 interrupt to stop the vendor request handler from running whilst reading the mode and sequence number together. Then the loop dispatch proceeds using those pre-read values.
I've added a third commit which fixes the firmware-side issue #1042. With the combination of all three changes on this PR, I can no longer reproduce any problems with repeated start/stop of HackRF. I plan to do some further work to simplify the solutions, but for anyone having problems this branch should work as-is. |
Since this PR has been tested & confirmed to fix #883, #916 and #1042, I propose we get it merged as-is without further changes, and we can look at making the simplifications discussed in this comment and this comment in a separate PR later. |
Outstanding troubleshooting and solutions! |
This is a defensive change to make the transceiver code easier to reason about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042. Previously, set_transceiver_mode() was called in the vendor request handler for the SET_TRANSCEIVER_MODE request, as well in the callback for a USB configuration change. Both these calls are made from the USB0 ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode() functions at any point. It was hard to tell if this was safe. Instead, set_transceiver_mode() has been removed, and its work is split into three parts: - request_transceiver_mode(), which is safe to call from ISR context. All this function does is update the requested mode and increment a sequence number. This builds on work already done in PR greatscottgadgets#1029, but the interface has been simplified to use a shared volatile structure. - transceiver_startup(), which transitions the transceiver from an idle state to the configuration required for a specific mode, including setting up the RF path, configuring the M0, adjusting LEDs and UI etc. - transceiver_shutdown(), which transitions the transceiver back to an idle state. The *_loop functions that implement the transceiver modes now call transceiver_startup() before starting work, and transceiver_shutdown() before returning, and all this happens in the main thread of execution. As such, it is now guaranteed that all the steps involved happen in a consistent order, with the transceiver starting from an idle state, and being returned to an idle state before control returns to the main loop. For consistency of interface, an off_mode() function has been added to implement the behaviour of the OFF transceiver mode. Since the transceiver is already guaranteed to be in an idle state when this is called, the only work required is to set the UI mode and wait for a new mode request.
This is a defensive change to make the transceiver code easier to reason about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042. Previously, set_transceiver_mode() was called in the vendor request handler for the SET_TRANSCEIVER_MODE request, as well in the callback for a USB configuration change. Both these calls are made from the USB0 ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode() functions at any point. It was hard to tell if this was safe. Instead, set_transceiver_mode() has been removed, and its work is split into three parts: - request_transceiver_mode(), which is safe to call from ISR context. All this function does is update the requested mode and increment a sequence number. This builds on work already done in PR greatscottgadgets#1029, but the interface has been simplified to use a shared volatile structure. - transceiver_startup(), which transitions the transceiver from an idle state to the configuration required for a specific mode, including setting up the RF path, configuring the M0, adjusting LEDs and UI etc. - transceiver_shutdown(), which transitions the transceiver back to an idle state. The *_mode() functions that implement the transceiver modes now call transceiver_startup() before starting work, and transceiver_shutdown() before returning, and all this happens in the main thread of execution. As such, it is now guaranteed that all the steps involved happen in a consistent order, with the transceiver starting from an idle state, and being returned to an idle state before control returns to the main loop. For consistency of interface, an off_mode() function has been added to implement the behaviour of the OFF transceiver mode. Since the transceiver is already guaranteed to be in an idle state when this is called, the only work required is to set the UI mode and wait for a new mode request.
These were added in greatscottgadgets#805, as a workaround to prevent their parent functions from returning before transfer cancellations had completed. This has since been fixed properly in greatscottgadgets#1029.
The simplification discussed in this comment is now in PR #1071. I had a go at the the other proposed change outlined in this comment, but decided not to go ahead with it. Although it would eliminate some locking, it would just add different complications instead. It would detract from the simplicity of the transfer thread, which is currently just a trivial libusb event handling loop. |
Fixes bug #916.
Previously, there was a race which could lead to a transfer being left active after
cancel_transfers()
completed. This would then cause the nextprepare_transfers()
call to fail, becauselibusb_submit_transfer()
would return an error due to the transfer already being in use.The sequence of events that could cause this was:
hackrf_stop_rx()
, which callscancel_transfers()
, which iterates through the 4 transfers in use and cancels them one by one withlibusb_cancel_transfer()
.hackrf_libusb_transfer_callback()
, which handles the data and then callslibusb_submit_transfer()
to resubmit that transfer.cancel_transfers()
andhackrf_stop_rx()
are completed but one transfer is still active.hackrf_start_rx()
call fails, becauseprepare_transfers()
tries to submit a transfer which is already in use.To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.
With this change, I can now run the test program from #916 without failures.