Repeated Start/Stop RX Produces Error #916

AlexFWulff · 2021-07-12T15:14:18Z

I think that this is still a problem on the most recent release (2021.03.1). I wrote a C program that can reproduce an error where, after a few repeated start rx/stop rx calls in quick succession, the TX light stays on and the rx callback is never triggered.

Here is my hackrf_info output (platform is RPi 3 raspbian):

hackrf_info version: 2021.03.1
libhackrf version: unknown (0.5)
Found HackRF
Index: 0
Serial number: 0000000000000000088869dc24397f1b
Board ID Number: 2 (HackRF One)
Firmware Version: 2021.03.1 (API:1.04)
Part ID Number: 0xa000cb3c 0x0041474f

libhackrf is not installed globally - instead I'm manually linking in the compiled libhackrf 2021.03.1 binary.

Check out this gist for the program I'm using.

Below is the expected output. The rx should start, maybe get a few cycles before the rx callback is called, and then should restart again indefinitely.

Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Finally started
Start
Not started yet
Valid length: 262144
Valid length: 262144
Finally started
Start
Valid length: 262144
....

Below is the output I get after letting the program run for about 10 seconds (sometimes more, sometimes less). Eventually, the callback is never triggered and the program gets stuck on "Not started yet" forever.

Finally started
Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
... [repeats a few times]
Finally started
Start
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
... [continues forever]

Is there a way to fix this to allow for start_rx and stop_rx to be called in quick succession?

Originally posted by AlexFWulff in #394 (comment)

The text was updated successfully, but these errors were encountered:

straithe · 2021-07-21T22:04:11Z

Have you tried reproducing this issue on a Linux desktop or laptop that is not a Raspberry Pi?

AlexFWulff · 2021-07-21T22:09:34Z

Yes - I have tried this on macOS natively and in a Debian Virtualbox instance w/ macOS host.

xaki23 · 2021-07-22T20:45:44Z

i can repro this with the supplied example code on both a raspi4 and fedora-vm.
tried some variants, but cant make much sense of it yet.
this is meant as a rough collection of observations.

putting the device open/close inside the loop seems to reliably prevent the problem.

slowing down the loop (sleep(1)s between start/stop and/or after stop) reveals a start_rx error that is not visible in the unsleeping loop.
i also noticed that when this happens i see data arriving (the rx callback fires) after the stop_rx.

Valid length: 262144
Valid length: 262144
Finally started
Stop
Valid length: 262144
Start
hackrf_start_rx() failed: Resource busy (-1000)
Valid length: 262144  
Valid length: 262144

in this example, note the rx_callback firing ("Valid length") after the Stop, and still firing even though the start "failed".

so my guess is stop_rx is not fully cleaning up in-flight transfers, and that messes up some libhackrf or libusb internal state.
the "open/close inside loop prevents it" agrees with that.

it also shows something that seems to be timing-dependent.
when i tried it on the raspi, it would happen MUCH more frequently when the system i sshed-from to the raspi was under heavy load.
the only feedback mechanism i can imagine there is the print(s) inside the rx_callback blocking briefly.

beware the example code has some "odd" minor issues, like "result |=" assignment (so an earlier error will be reported again in the wrong place) and some of the error reporting using the wrong thing to report (the stop_rx error report claims it is start_rx).
not smacking the original poster there (supplying the example code really, really helped! 👍 for that!), just a warning for @straithe or anyone else trying it out.

AlexFWulff · 2021-07-22T20:50:20Z

@xaki23 interesting, and I'm glad you were able to reproduce this too. My experience was that I got much more time before things started failing on the RPi as compared to my VM. I also noticed that putting open/close in the loop mitigates the error, but this is obviously not ideal as far as fast repeated sampling goes.

I noticed that with the 2018 HackRF firmware release things are quite a bit more stable, although the problem still persists. This program runs for about 3x longer with the 2018 firmware. Can you try re-flashing your HackRF with the older firmware and confirming?

dmaltsiniotis · 2021-07-23T07:14:23Z

Using the GIST as a starting point, I was also able to reproduce the behavior described. The device sometimes getting stuck when calling hackrf_start_rx() with an error of Resource busy (-1000) immediately after a calling hackrf_stop_rx(). I don't have a full understanding of the underlying library and driver yet, so I'm still digging in.

Preliminary findings

I have found that adjusting the sample rate exacerbates or alleviates the issue. For example, I cannot get it to "get stuck" at sample rates at or below 14e6. I don't think this has anything actually to do with the sample rate, but probably at high to max sample rates, buffers somewhere tend to be "more full" causing some kind of buffer issue somewhere under certain conditions.

I have also found another way of triggering this behavior consistently on any sample rate and setting, totally by accident. It requires a bit of explanation for non-windows users though: If a right mouse click is issued to a command prompt window on Windows, the command prompt enters a "cursor highlight mode" of sorts, making it easy to select text from the console without it scrolling. This has another side-effect: It pauses the running process, which is incredibly frustrating when you don't realize there's a small highlight somewhere and you're waiting ages for some command to finish... but I digress. What I found was that using the given GIST code, any amount of highlighting the window (a.k.a. pausing the process) even for less than a second, triggers this issue.

These preliminary results are leading me down two paths of investigation:

The first path is checking for obvious or non-obvious race-type conditions or buffer issues in the libhackrf stop code.
The second path is much more mundane: The possibility that gist code itself is causing this issue by by not processing the callback fast enough, causing the receive buffer to back up, which in turn puts the device into a weird state when trying to stop and start again quickly.

Additionally, after digging a bit through libhackrf code, it seems someone encountered this very issue (or something similar to it) and attempted to mitigate it a little while back by adding a 10ms sleep on the call to hackrf_stop_rx(): b4ea51a

Regarding receiving data after stop as @xaki23 mentions, while I too see the same console outputs, I'm not totally convinced the callback is being fired after calling stop. It's incredibly difficult for me to tell what thread is writing what and when with just console output. It could very well be that the receive callback is being triggered as expected while libhackrf is chugging away through it's stop code. I have not encountered a single instance in my testing so far of actually having the callback triggered honest-to-god after I printed the result from hackrf_stop_rx().

Code for reference to make sense of the log below

    fprintf(stdout,"Calling hackrf_stop_rx()\n");                                       <- Callbacks observed triggering around this point.
    result = hackrf_stop_rx(device);                                                       <- Callbacks observed triggering around this point.
    fprintf(stdout,"hackrf_stop_rx result: %s (%d)\n", hackrf_error_name(result), result); <- No callbacks observed after this point.

My (slightly modified) logs: I've tried to redact some of the redundant output for readability.

...
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)                     
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: Resource busy (-1000)               <- Next attempt at restarting gets a busy signal, fails.
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: unspecified error (-9999)
Calling hackrf_start_rx()
hackrf_start_rx result: Resource busy (-1000)

So I guess that's a very verbose way of saying... ¯\(ツ)/¯

straithe · 2021-07-28T18:35:15Z

Thanks for reporting, reproducing, and investigating this, everyone! It seems we might have a bug on our hands and I've tagged this issue as such.

I don't think anyone here at GSG has the cycles to tackle this right now, but we will get to it when we can.

For now, if anyone keeps experimenting with this issue, please keep adding your findings as it may help us or someone in the community create a solution eventually!

AlexFWulff · 2021-07-29T20:28:27Z

@straithe I need this rapid start/stop functionality for an end application. Is there any initial guidance you can provide on how to troubleshoot this? Is there any reason you can think of why the 2018 firmware release is more stable?

straithe · 2021-07-29T21:43:05Z

No, sorry. Also, the person on the team I would get to look at this is on a much needed vacation for a few weeks. I can try to get this higher up their stack once they get back.

AlexFWulff · 2021-07-29T21:47:46Z

No worries - thanks. I'll start poking around.

straithe · 2021-08-23T18:07:21Z

I've brought this up to the team! Do you have any other updates from poking around?

AlexFWulff · 2021-08-23T18:38:08Z

I've brought this up to the team! Do you have any other updates from poking around?

Not much unfortunately-I've been mainly using the 2018 release because I can get my applications to run for a few minutes instead of seconds before locking up, but it still stops eventually.

martinling · 2021-12-30T21:43:59Z

I've started looking at this, but it's been taking about 30-40 minutes for the problem to show up on my system, so it's going to be a slow business.

It's definitely still an issue though, I've reproduced it on both 2021.03.01 and current master.

martinling · 2021-12-31T04:37:04Z

A few further notes, building on @dmaltsiniotis' observations above:

This definitely seems to be easier to reproduce on Windows.
It's influenced by USB throughput. As well being more likely to happen at higher sample rates, I can actually trigger the failure just by moving a mouse that's plugged into the same USB hub as the HackRF.

Of these two proposed explanations:

The first path is checking for obvious or non-obvious race-type conditions or buffer issues in the libhackrf stop code.

The second path is much more mundane: The possibility that gist code itself is causing this issue by by not processing the callback fast enough, causing the receive buffer to back up, which in turn puts the device into a weird state when trying to stop and start again quickly.

...I'm inclined to rule out the second.

The firmware just doesn't care about the buffer being full. There isn't really any "backed up" state for the device to get into. The M0 core keeps dumping RX data into a circular buffer unconditionally. If it wraps, old data just gets overwritten. Meanwhile, the M4 core keeps doing bulk transfers from the buffer to the host. If the host doesn't request data quickly enough to keep up with the sampling rate, it just silently loses some data - this happens very easily for all sorts of reasons (slow callbacks, poor I/O bandwidth/scheduling, other devices on the bus, etc) and the device just doesn't care. Even with my work in #982, the situation isn't really different - all that changes is that there's now some stats available out-of-band which tell you how much you lost.

And the firmware doesn't seem to be getting into any weird state or locking up. Things stall on the host because hackrf_start_rx() fails, but that doesn't seem to be due to any issue on the device. As soon as the device is closed and reopened by libhackrf, everything works normally again. That closing and opening doesn't really mean anything from the device's point of view. It's not going through a reset or anything.

I think this is purely a host-side issue, with things getting into some bad state due to the interactions between the main thread calling hackrf_stop_rx(), the transfer thread processing libusb events, and the bulk transfers that are still in flight around that time. Something, sometimes, isn't being cleaned up correctly as it all gets torn down, and that leaves things in a state where the next hackrf_start_rx() call fails.

The fact that we see multiple RX callbacks being made between hackrf_stop_rx() being called and that function returning should be a good clue as to what the problem is. Something in there isn't safe - my guess would be that some sort of locking is needed between the work done in cancel_transfers(), which happens in the main thread, and the completion handling being done in hackrf_libusb_transfer_callback(), which happens in the transfer thread.

martinling · 2021-12-31T05:13:02Z

I think I understand what's happening:

The main thread calls hackrf_stop_rx(). That in turn calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer().
At the same time though, one of those transfers is already in the process of completing. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. This can happen after cancel_transfers() has cancelled that transfer, so the transfer is now started again.
Now there's actually a transfer still active, despite cancel_transfers() having cancelled all transfers and hackrf_stop_rx() having completed. This is the bad state that shouldn't happen.
Then, you call hackrf_start_rx() again. It calls prepare_transfers(), which tries to submit all 4 transfers, expecting none of them to be in use. But one of them is already submitted, so you get a Resource busy error.

So the solution is to make sure that no in-flight transfer can end up being restarted by the transfer thread after cancel_transfers() has been called. I think that may be as simple as setting some state at the start of cancel_transfers(), which is checked by hackrf_libusb_transfer_callback() before it resubmits a transfer.

Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.

gozu42 · 2022-01-02T15:08:36Z

the test program still locks up here with #1029 libhackrf.

$ ./hackrf_loop 
Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Finally started
Start
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet

sometimes with multiple successful rounds, but lockup within a few seconds.

libhackrf version: git-b346790e (0.6)
Firmware Version: git-a83bfc4b (API:1.05)

(edit: quote fix)

martinling · 2022-01-02T15:15:13Z

Oh, interesting. On my Windows test system (which was the only one I could reproduce this readily on) it was a complete fix.

The failure you see there looks different though - there's no error reported from hackrf_start_rx(). There might be a firmware-side element to it. Do you get the same with firmware from master, or 2021.03.01?

gozu42 · 2022-01-02T15:18:57Z

sorry, cant test with fw 2021.03.01 currently because that lacks the usb-addr-zero fix from #987

martinling · 2022-01-29T10:45:50Z

I believe all the host-side causes of failing to stop/start RX are now fixed in #1029.

I have confirmed that there is also a firmware-side bug which can cause RX stop/start to appear to succeed, but stall after the first 16KB of data. I've started a separate issue for this in #1042.

Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.

straithe added the technical support request for technical support label Jul 21, 2021

straithe self-assigned this Jul 21, 2021

straithe added bug and removed technical support request for technical support labels Jul 28, 2021

straithe added the needs validation described problem or solution requires further validation label Nov 15, 2021

straithe assigned martinling Nov 15, 2021

straithe mentioned this issue Nov 15, 2021

Freeze in Gqrx with hackRF 2021-03-1 driver version #883

Closed

martinling mentioned this issue Dec 31, 2021

Fixes for concurrency bugs in start/stop operations #1029

Merged

martinling linked a pull request Dec 31, 2021 that will close this issue

Fixes for concurrency bugs in start/stop operations #1029

Merged

martinling removed the needs validation described problem or solution requires further validation label Jan 1, 2022

martinling removed a link to a pull request Jan 10, 2022

Fixes for concurrency bugs in start/stop operations #1029

Merged

martinling mentioned this issue Jan 29, 2022

Firmware USB streaming can stall after the first 16K USB transfer following RX restart #1042

Closed

martinling linked a pull request Jan 29, 2022 that will close this issue

Fixes for concurrency bugs in start/stop operations #1029

Merged

mossmann closed this as completed in #1029 Feb 8, 2022

martinling mentioned this issue Feb 8, 2022

Move transceiver mode changes out of USB ISR. #1045

Merged

martinling mentioned this issue Nov 16, 2022

[Question]: Looping hackrf_sweep x times #1226

Closed

dmaltsiniotis mentioned this issue Jan 19, 2023

hackrf_sweep sometimes doesn't work on Raspberry Pi Zero 2 W #1260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated Start/Stop RX Produces Error #916

Repeated Start/Stop RX Produces Error #916

AlexFWulff commented Jul 12, 2021 •

edited by sync-by-unito bot

Loading

straithe commented Jul 21, 2021

AlexFWulff commented Jul 21, 2021 •

edited

Loading

xaki23 commented Jul 22, 2021 •

edited

Loading

AlexFWulff commented Jul 22, 2021 •

edited

Loading

dmaltsiniotis commented Jul 23, 2021 •

edited

Loading

straithe commented Jul 28, 2021

AlexFWulff commented Jul 29, 2021

straithe commented Jul 29, 2021

AlexFWulff commented Jul 29, 2021

straithe commented Aug 23, 2021

AlexFWulff commented Aug 23, 2021

martinling commented Dec 30, 2021

martinling commented Dec 31, 2021 •

edited

Loading

martinling commented Dec 31, 2021

gozu42 commented Jan 2, 2022 •

edited

Loading

martinling commented Jan 2, 2022

gozu42 commented Jan 2, 2022

martinling commented Jan 29, 2022

Repeated Start/Stop RX Produces Error #916

Repeated Start/Stop RX Produces Error #916

Comments

AlexFWulff commented Jul 12, 2021 • edited by sync-by-unito bot Loading

straithe commented Jul 21, 2021

AlexFWulff commented Jul 21, 2021 • edited Loading

xaki23 commented Jul 22, 2021 • edited Loading

AlexFWulff commented Jul 22, 2021 • edited Loading

dmaltsiniotis commented Jul 23, 2021 • edited Loading

Preliminary findings

Code for reference to make sense of the log below

My (slightly modified) logs: I've tried to redact some of the redundant output for readability.

straithe commented Jul 28, 2021

AlexFWulff commented Jul 29, 2021

straithe commented Jul 29, 2021

AlexFWulff commented Jul 29, 2021

straithe commented Aug 23, 2021

AlexFWulff commented Aug 23, 2021

martinling commented Dec 30, 2021

martinling commented Dec 31, 2021 • edited Loading

martinling commented Dec 31, 2021

gozu42 commented Jan 2, 2022 • edited Loading

martinling commented Jan 2, 2022

gozu42 commented Jan 2, 2022

martinling commented Jan 29, 2022

AlexFWulff commented Jul 12, 2021 •

edited by sync-by-unito bot

Loading

AlexFWulff commented Jul 21, 2021 •

edited

Loading

xaki23 commented Jul 22, 2021 •

edited

Loading

AlexFWulff commented Jul 22, 2021 •

edited

Loading

dmaltsiniotis commented Jul 23, 2021 •

edited

Loading

martinling commented Dec 31, 2021 •

edited

Loading

gozu42 commented Jan 2, 2022 •

edited

Loading