Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for concurrency bugs in start/stop operations #1029

Merged
merged 3 commits into from
Feb 8, 2022

Conversation

martinling
Copy link
Member

@martinling martinling commented Dec 31, 2021

Fixes bug #916.

Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

  1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer().
  2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer.
  3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active.
  4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.

With this change, I can now run the test program from #916 without failures.

@martinling martinling requested a review from mossmann December 31, 2021 20:51
@martinling martinling added the bug label Dec 31, 2021
@martinling martinling linked an issue Dec 31, 2021 that may be closed by this pull request
@martinling martinling changed the title Use a lock to prevent transfers being restarted during cancellation. Fixes for concurrency bugs in libhackrf start/stop operations Jan 10, 2022
@martinling
Copy link
Member Author

I've added a second commit to fix an orthogonal but closely related issue.

After cancelling transfers, we need to wait for all cancellation handling to complete before trying to do anything new with the same transfers.

Calling libusb_cancel_transfer() only starts the cancellation of a transfer. The process is not complete until the transfer callback has been called with status LIBUSB_TRANSFER_CANCELLED.

If hackrf_start_rx() is called soon after hackrf_stop_rx(), prepare_transfers() may be called before the previous cancellations are completed, resulting in a LIBUSB_ERROR_BUSY when a transfer is reused with libusb_submit_transfer().

To prevent this happening, I've made the transfer thread keep track of which transfers have finished (either by completion, or cancellation), and made cancel_transfers() wait until all transfers are finished before returning.

This is implemented using a pthread condition variable which is signalled from the transfer thread and waited on by cancel_transfers().

With this change, I can no longer reproduce the failure to restart RX seen in #883.

@martinling
Copy link
Member Author

There's a recent thread on the osmocom-sdr list from Jasper van den Eshof, who has been fixing more or less exactly the same issue in librtlsdr.

Looking at the librtlsdr code suggests a simpler approach that could be used to orchestrate the cancellations, which would eliminate the need for the locking introduced in b346790.

  • In hackrf_stop_{rx|tx}(), set a flag to request cancellation, then call libusb_interrupt_event_handler().
  • In transfer_threadproc, use libusb_handle_events_timeout_completed() instead of libusb_handle_events_timeout(), and pass the flag set in hackrf_stop_{rx|tx} as the completed argument. This function will then return when requested.
  • Once libusb_handle_events_timeout_completed() has completed, do the cancel_transfers() work in the transfer thread. The mutex between cancel_transfers() and hackrf_libusb_transfer_callback() would then not be necessary.
  • After cancelling transfers, loop on libusb_handle_events_timeout_completed() until all cancellation callbacks have completed. Then signal hackrf_stop_{rx|tx} as before.

@martinling
Copy link
Member Author

Another useful observation from the discussion with Jasper - the individual transfer_finished flags, added in my second commit, could be more simply replaced with a count of active transfers, which is incremented on submission and decremented on completion or cancellation.

I'll work on making both improvements.

Fixes bug greatscottgadgets#916.

Previously, there was a race which could lead to a transfer being left
active after cancel_transfers() completed. This would then cause the
next prepare_transfers() call to fail, because libusb_submit_transfer()
would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(),
   which iterates through the 4 transfers in use and cancels them one
   by one with libusb_cancel_transfer().

2. During this time, a transfer is completed. The transfer thread calls
   hackrf_libusb_transfer_callback(), which handles the data and then
   calls libusb_submit_transfer() to resubmit that transfer.

3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one
   transfer is still active.

4. The next hackrf_start_rx() call fails, because prepare_transfers()
   tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers
or restart them. This ensures that only one of these actions can happen
for a given transfer; it's no longer possible for a transfer to be
cancelled and then immediately restarted.
Calling libusb_cancel_transfer only starts the cancellation of a
transfer. The process is not complete until the transfer callback
has been called with status LIBUSB_TRANSFER_CANCELLED.

If hackrf_start_rx() is called soon after hackrf_stop_rx(),
prepare_transfers() may be called before the previous cancellations
are completed, resulting in a LIBUSB_ERROR_BUSY when a transfer is
reused with libusb_submit_transfer().

To prevent this happening, we keep track of which transfers have
finished (either by completion, or cancellation), and make
cancel_transfers() wait until all transfers are finished.

This is implemented using a pthread condition variable which is
signalled from the transfer thread.
This fixes bug greatscottgadgets#1042, which occured when an RX->OFF->RX sequence
happened quickly enough that the loop in rx_mode() did not see the
change. As a result, the enable_baseband_streaming() call at the start
of that function was not repeated for the new RX operation, so RX
progress stalled.

To solve this, the vendor request handler now increments a sequence
number when it changes the transceiver mode. Instead of the RX loop
checking whether the transceiver mode is still RX, it now checks whether
the current sequence number is the same as when it was started. If not,
there must have been at least one mode change, so the loop exits, and
the main loop starts the necessary loop for the new mode. The same
behaviour is implemented for the TX and sweep loops.

For this approach to be reliable, we must ensure that when deciding
which mode and sequence number to use, we take both values from the same
set_transceiver_mode request.

To achieve this, we briefly disable the USB0 interrupt to stop the
vendor request handler from running whilst reading the mode and sequence
number together. Then the loop dispatch proceeds using those pre-read
values.
@martinling
Copy link
Member Author

I've added a third commit which fixes the firmware-side issue #1042.

With the combination of all three changes on this PR, I can no longer reproduce any problems with repeated start/stop of HackRF.

I plan to do some further work to simplify the solutions, but for anyone having problems this branch should work as-is.

@martinling martinling changed the title Fixes for concurrency bugs in libhackrf start/stop operations Fixes for concurrency bugs in start/stop operations Feb 3, 2022
@gozu42
Copy link
Contributor

gozu42 commented Feb 3, 2022

can confirm no longer seeing any restart-hangs with 7057235 on pc+raspi with test app from #916 or gqrx ctrl-d stresstest.

@martinling
Copy link
Member Author

Since this PR has been tested & confirmed to fix #883, #916 and #1042, I propose we get it merged as-is without further changes, and we can look at making the simplifications discussed in this comment and this comment in a separate PR later.

@mossmann mossmann merged commit 2821cdc into greatscottgadgets:master Feb 8, 2022
@mossmann
Copy link
Member

mossmann commented Feb 8, 2022

Outstanding troubleshooting and solutions!

martinling added a commit to martinling/hackrf that referenced this pull request Feb 8, 2022
This is a defensive change to make the transceiver code easier to reason
about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042.

Previously, set_transceiver_mode() was called in the vendor request
handler for the SET_TRANSCEIVER_MODE request, as well in the callback
for a USB configuration change. Both these calls are made from the USB0
ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode()
functions at any point. It was hard to tell if this was safe.

Instead, set_transceiver_mode() has been removed, and its work is split
into three parts:

- request_transceiver_mode(), which is safe to call from ISR context.
  All this function does is update the requested mode and increment a
  sequence number. This builds on work already done in PR greatscottgadgets#1029, but
  the interface has been simplified to use a shared volatile structure.

- transceiver_startup(), which transitions the transceiver from an idle
  state to the configuration required for a specific mode, including
  setting up the RF path, configuring the M0, adjusting LEDs and UI etc.

- transceiver_shutdown(), which transitions the transceiver back to an
  idle state.

The *_loop functions that implement the transceiver modes now call
transceiver_startup() before starting work, and transceiver_shutdown()
before returning, and all this happens in the main thread of execution.

As such, it is now guaranteed that all the steps involved happen in a
consistent order, with the transceiver starting from an idle state, and
being returned to an idle state before control returns to the main loop.

For consistency of interface, an off_mode() function has been added to
implement the behaviour of the OFF transceiver mode. Since the
transceiver is already guaranteed to be in an idle state when this is
called, the only work required is to set the UI mode and wait for a new
mode request.
martinling added a commit to martinling/hackrf that referenced this pull request Feb 8, 2022
This is a defensive change to make the transceiver code easier to reason
about, and to avoid the possibility of races such as that seen in greatscottgadgets#1042.

Previously, set_transceiver_mode() was called in the vendor request
handler for the SET_TRANSCEIVER_MODE request, as well in the callback
for a USB configuration change. Both these calls are made from the USB0
ISR, so could interrupt the rx_mode(), tx_mode() and sweep_mode()
functions at any point. It was hard to tell if this was safe.

Instead, set_transceiver_mode() has been removed, and its work is split
into three parts:

- request_transceiver_mode(), which is safe to call from ISR context.
  All this function does is update the requested mode and increment a
  sequence number. This builds on work already done in PR greatscottgadgets#1029, but
  the interface has been simplified to use a shared volatile structure.

- transceiver_startup(), which transitions the transceiver from an idle
  state to the configuration required for a specific mode, including
  setting up the RF path, configuring the M0, adjusting LEDs and UI etc.

- transceiver_shutdown(), which transitions the transceiver back to an
  idle state.

The *_mode() functions that implement the transceiver modes now call
transceiver_startup() before starting work, and transceiver_shutdown()
before returning, and all this happens in the main thread of execution.

As such, it is now guaranteed that all the steps involved happen in a
consistent order, with the transceiver starting from an idle state, and
being returned to an idle state before control returns to the main loop.

For consistency of interface, an off_mode() function has been added to
implement the behaviour of the OFF transceiver mode. Since the
transceiver is already guaranteed to be in an idle state when this is
called, the only work required is to set the UI mode and wait for a new
mode request.
@martinling
Copy link
Member Author

The simplification discussed in this comment is now in PR #1071.

I had a go at the the other proposed change outlined in this comment, but decided not to go ahead with it. Although it would eliminate some locking, it would just add different complications instead. It would detract from the simplicity of the transfer thread, which is currently just a trivial libusb event handling loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants