Fixes for concurrency bugs in start/stop operations #1029

martinling · 2021-12-31T20:25:47Z

Fixes bug #916.

Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer().
During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer.
Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active.
The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.

With this change, I can now run the test program from #916 without failures.

martinling · 2022-01-10T23:44:57Z

I've added a second commit to fix an orthogonal but closely related issue.

After cancelling transfers, we need to wait for all cancellation handling to complete before trying to do anything new with the same transfers.

Calling libusb_cancel_transfer() only starts the cancellation of a transfer. The process is not complete until the transfer callback has been called with status LIBUSB_TRANSFER_CANCELLED.

If hackrf_start_rx() is called soon after hackrf_stop_rx(), prepare_transfers() may be called before the previous cancellations are completed, resulting in a LIBUSB_ERROR_BUSY when a transfer is reused with libusb_submit_transfer().

To prevent this happening, I've made the transfer thread keep track of which transfers have finished (either by completion, or cancellation), and made cancel_transfers() wait until all transfers are finished before returning.

This is implemented using a pthread condition variable which is signalled from the transfer thread and waited on by cancel_transfers().

With this change, I can no longer reproduce the failure to restart RX seen in #883.

martinling · 2022-01-20T15:43:28Z

There's a recent thread on the osmocom-sdr list from Jasper van den Eshof, who has been fixing more or less exactly the same issue in librtlsdr.

Looking at the librtlsdr code suggests a simpler approach that could be used to orchestrate the cancellations, which would eliminate the need for the locking introduced in b346790.

In hackrf_stop_{rx|tx}(), set a flag to request cancellation, then call libusb_interrupt_event_handler().
In transfer_threadproc, use libusb_handle_events_timeout_completed() instead of libusb_handle_events_timeout(), and pass the flag set in hackrf_stop_{rx|tx} as the completed argument. This function will then return when requested.
Once libusb_handle_events_timeout_completed() has completed, do the cancel_transfers() work in the transfer thread. The mutex between cancel_transfers() and hackrf_libusb_transfer_callback() would then not be necessary.
After cancelling transfers, loop on libusb_handle_events_timeout_completed() until all cancellation callbacks have completed. Then signal hackrf_stop_{rx|tx} as before.

martinling · 2022-01-23T23:36:35Z

Another useful observation from the discussion with Jasper - the individual transfer_finished flags, added in my second commit, could be more simply replaced with a count of active transfers, which is incremented on submission and decremented on completion or cancellation.

I'll work on making both improvements.

Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.

Calling libusb_cancel_transfer only starts the cancellation of a transfer. The process is not complete until the transfer callback has been called with status LIBUSB_TRANSFER_CANCELLED. If hackrf_start_rx() is called soon after hackrf_stop_rx(), prepare_transfers() may be called before the previous cancellations are completed, resulting in a LIBUSB_ERROR_BUSY when a transfer is reused with libusb_submit_transfer(). To prevent this happening, we keep track of which transfers have finished (either by completion, or cancellation), and make cancel_transfers() wait until all transfers are finished. This is implemented using a pthread condition variable which is signalled from the transfer thread.

This fixes bug greatscottgadgets#1042, which occured when an RX->OFF->RX sequence happened quickly enough that the loop in rx_mode() did not see the change. As a result, the enable_baseband_streaming() call at the start of that function was not repeated for the new RX operation, so RX progress stalled. To solve this, the vendor request handler now increments a sequence number when it changes the transceiver mode. Instead of the RX loop checking whether the transceiver mode is still RX, it now checks whether the current sequence number is the same as when it was started. If not, there must have been at least one mode change, so the loop exits, and the main loop starts the necessary loop for the new mode. The same behaviour is implemented for the TX and sweep loops. For this approach to be reliable, we must ensure that when deciding which mode and sequence number to use, we take both values from the same set_transceiver_mode request. To achieve this, we briefly disable the USB0 interrupt to stop the vendor request handler from running whilst reading the mode and sequence number together. Then the loop dispatch proceeds using those pre-read values.

martinling · 2022-02-03T08:30:49Z

I've added a third commit which fixes the firmware-side issue #1042.

With the combination of all three changes on this PR, I can no longer reproduce any problems with repeated start/stop of HackRF.

I plan to do some further work to simplify the solutions, but for anyone having problems this branch should work as-is.

gozu42 · 2022-02-03T12:41:11Z

can confirm no longer seeing any restart-hangs with 7057235 on pc+raspi with test app from #916 or gqrx ctrl-d stresstest.

martinling · 2022-02-07T15:23:39Z

Since this PR has been tested & confirmed to fix #883, #916 and #1042, I propose we get it merged as-is without further changes, and we can look at making the simplifications discussed in this comment and this comment in a separate PR later.

mossmann · 2022-02-08T04:28:39Z

Outstanding troubleshooting and solutions!

This is a defensive change to make the transceiver code easier to reason about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042. Previously, set_transceiver_mode() was called in the vendor request handler for the SET_TRANSCEIVER_MODE request, as well in the callback for a USB configuration change. Both these calls are made from the USB0 ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode() functions at any point. It was hard to tell if this was safe. Instead, set_transceiver_mode() has been removed, and its work is split into three parts: - request_transceiver_mode(), which is safe to call from ISR context. All this function does is update the requested mode and increment a sequence number. This builds on work already done in PR greatscottgadgets#1029, but the interface has been simplified to use a shared volatile structure. - transceiver_startup(), which transitions the transceiver from an idle state to the configuration required for a specific mode, including setting up the RF path, configuring the M0, adjusting LEDs and UI etc. - transceiver_shutdown(), which transitions the transceiver back to an idle state. The *_loop functions that implement the transceiver modes now call transceiver_startup() before starting work, and transceiver_shutdown() before returning, and all this happens in the main thread of execution. As such, it is now guaranteed that all the steps involved happen in a consistent order, with the transceiver starting from an idle state, and being returned to an idle state before control returns to the main loop. For consistency of interface, an off_mode() function has been added to implement the behaviour of the OFF transceiver mode. Since the transceiver is already guaranteed to be in an idle state when this is called, the only work required is to set the UI mode and wait for a new mode request.

This is a defensive change to make the transceiver code easier to reason about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042. Previously, set_transceiver_mode() was called in the vendor request handler for the SET_TRANSCEIVER_MODE request, as well in the callback for a USB configuration change. Both these calls are made from the USB0 ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode() functions at any point. It was hard to tell if this was safe. Instead, set_transceiver_mode() has been removed, and its work is split into three parts: - request_transceiver_mode(), which is safe to call from ISR context. All this function does is update the requested mode and increment a sequence number. This builds on work already done in PR greatscottgadgets#1029, but the interface has been simplified to use a shared volatile structure. - transceiver_startup(), which transitions the transceiver from an idle state to the configuration required for a specific mode, including setting up the RF path, configuring the M0, adjusting LEDs and UI etc. - transceiver_shutdown(), which transitions the transceiver back to an idle state. The *_mode() functions that implement the transceiver modes now call transceiver_startup() before starting work, and transceiver_shutdown() before returning, and all this happens in the main thread of execution. As such, it is now guaranteed that all the steps involved happen in a consistent order, with the transceiver starting from an idle state, and being returned to an idle state before control returns to the main loop. For consistency of interface, an off_mode() function has been added to implement the behaviour of the OFF transceiver mode. Since the transceiver is already guaranteed to be in an idle state when this is called, the only work required is to set the UI mode and wait for a new mode request.

These were added in greatscottgadgets#805, as a workaround to prevent their parent functions from returning before transfer cancellations had completed. This has since been fixed properly in greatscottgadgets#1029.

martinling · 2022-03-20T11:23:57Z

The simplification discussed in this comment is now in PR #1071.

I had a go at the the other proposed change outlined in this comment, but decided not to go ahead with it. Although it would eliminate some locking, it would just add different complications instead. It would detract from the simplicity of the transfer thread, which is currently just a trivial libusb event handling loop.

martinling requested a review from mossmann December 31, 2021 20:51

martinling added the bug label Dec 31, 2021

martinling linked an issue Dec 31, 2021 that may be closed by this pull request

Repeated Start/Stop RX Produces Error #916

Closed

martinling mentioned this pull request Dec 31, 2021

Freeze in Gqrx with hackRF 2021-03-1 driver version #883

Closed

gozu42 mentioned this pull request Jan 2, 2022

Repeated Start/Stop RX Produces Error #916

Closed

martinling removed a link to an issue Jan 10, 2022

Repeated Start/Stop RX Produces Error #916

Closed

martinling changed the title ~~Use a lock to prevent transfers being restarted during cancellation.~~ Fixes for concurrency bugs in libhackrf start/stop operations Jan 10, 2022

martinling force-pushed the bug-916 branch from 38ecfa1 to 9182f1c Compare January 11, 2022 00:23

martinling mentioned this pull request Jan 29, 2022

Firmware USB streaming can stall after the first 16K USB transfer following RX restart #1042

Closed

This was linked to issues Jan 29, 2022

Freeze in Gqrx with hackRF 2021-03-1 driver version #883

Closed

Repeated Start/Stop RX Produces Error #916

Closed

martinling added 2 commits February 3, 2022 06:44

martinling force-pushed the bug-916 branch from 2f5428d to bb21675 Compare February 3, 2022 07:14

martinling linked an issue Feb 3, 2022 that may be closed by this pull request

Firmware USB streaming can stall after the first 16K USB transfer following RX restart #1042

Closed

martinling force-pushed the bug-916 branch from bb21675 to 7057235 Compare February 3, 2022 07:59

martinling changed the title ~~Fixes for concurrency bugs in libhackrf start/stop operations~~ Fixes for concurrency bugs in start/stop operations Feb 3, 2022

Potomac mentioned this pull request Feb 4, 2022

GQRX freezes with hackRF 2021.03.1 driver gqrx-sdr/gqrx#959

Closed

mossmann merged commit 2821cdc into greatscottgadgets:master Feb 8, 2022

martinling mentioned this pull request Feb 8, 2022

Move transceiver mode changes out of USB ISR. #1045

Merged

This was referenced Feb 13, 2022

Firmware sample buffer management overhaul, including safe handling of TX underruns #982

Merged

possible hardware defect #1048

Closed

martinling mentioned this pull request Feb 24, 2022

hackrf_sweep sometimes doesn't output result #1052

Closed

This was referenced Mar 18, 2022

libhackrf hangs if HackRF is disconnected #1068

Closed

Overhaul handling of transfer errors and use of streaming flag. #1069

Merged

HackRF freeze in transmission (TX) with computers in USB3 and not with one with USB2 #500

Closed

This was referenced Mar 18, 2022

Remove unnecessary delays on stop, and duplicated stop commands on close. #1070

Merged

Simplify logic for tracking finished transfers #1071

Merged

martinling deleted the bug-916 branch March 20, 2022 11:24

jvde-github mentioned this pull request May 31, 2022

fix hang issue when device is disconnected airspy/airspyhf#33

Merged

martinling mentioned this pull request Nov 16, 2022

[Question]: Looping hackrf_sweep x times #1226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for concurrency bugs in start/stop operations #1029

Fixes for concurrency bugs in start/stop operations #1029

martinling commented Dec 31, 2021 •

edited by sync-by-unito bot

Loading

martinling commented Jan 10, 2022

martinling commented Jan 20, 2022

martinling commented Jan 23, 2022

martinling commented Feb 3, 2022

gozu42 commented Feb 3, 2022

martinling commented Feb 7, 2022

mossmann commented Feb 8, 2022

martinling commented Mar 20, 2022

Fixes for concurrency bugs in start/stop operations #1029

Fixes for concurrency bugs in start/stop operations #1029

Conversation

martinling commented Dec 31, 2021 • edited by sync-by-unito bot Loading

martinling commented Jan 10, 2022

martinling commented Jan 20, 2022

martinling commented Jan 23, 2022

martinling commented Feb 3, 2022

gozu42 commented Feb 3, 2022

martinling commented Feb 7, 2022

mossmann commented Feb 8, 2022

martinling commented Mar 20, 2022

martinling commented Dec 31, 2021 •

edited by sync-by-unito bot

Loading