Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated Start/Stop RX Produces Error #916

Closed
AlexFWulff opened this issue Jul 12, 2021 · 18 comments · Fixed by #1029
Closed

Repeated Start/Stop RX Produces Error #916

AlexFWulff opened this issue Jul 12, 2021 · 18 comments · Fixed by #1029
Assignees
Labels

Comments

@AlexFWulff
Copy link

AlexFWulff commented Jul 12, 2021

I think that this is still a problem on the most recent release (2021.03.1). I wrote a C program that can reproduce an error where, after a few repeated start rx/stop rx calls in quick succession, the TX light stays on and the rx callback is never triggered.

Here is my hackrf_info output (platform is RPi 3 raspbian):

hackrf_info version: 2021.03.1
libhackrf version: unknown (0.5)
Found HackRF
Index: 0
Serial number: 0000000000000000088869dc24397f1b
Board ID Number: 2 (HackRF One)
Firmware Version: 2021.03.1 (API:1.04)
Part ID Number: 0xa000cb3c 0x0041474f

libhackrf is not installed globally - instead I'm manually linking in the compiled libhackrf 2021.03.1 binary.

Check out this gist for the program I'm using.

Below is the expected output. The rx should start, maybe get a few cycles before the rx callback is called, and then should restart again indefinitely.

Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Finally started
Start
Not started yet
Valid length: 262144
Valid length: 262144
Finally started
Start
Valid length: 262144
....

Below is the output I get after letting the program run for about 10 seconds (sometimes more, sometimes less). Eventually, the callback is never triggered and the program gets stuck on "Not started yet" forever.

Finally started
Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
... [repeats a few times]
Finally started
Start
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
... [continues forever]

Is there a way to fix this to allow for start_rx and stop_rx to be called in quick succession?

Originally posted by AlexFWulff in #394 (comment)

@straithe straithe added the technical support request for technical support label Jul 21, 2021
@straithe
Copy link
Member

Have you tried reproducing this issue on a Linux desktop or laptop that is not a Raspberry Pi?

@straithe straithe self-assigned this Jul 21, 2021
@AlexFWulff
Copy link
Author

AlexFWulff commented Jul 21, 2021

Yes - I have tried this on macOS natively and in a Debian Virtualbox instance w/ macOS host.

@xaki23
Copy link

xaki23 commented Jul 22, 2021

i can repro this with the supplied example code on both a raspi4 and fedora-vm.
tried some variants, but cant make much sense of it yet.
this is meant as a rough collection of observations.

putting the device open/close inside the loop seems to reliably prevent the problem.

slowing down the loop (sleep(1)s between start/stop and/or after stop) reveals a start_rx error that is not visible in the unsleeping loop.
i also noticed that when this happens i see data arriving (the rx callback fires) after the stop_rx.

Valid length: 262144
Valid length: 262144
Finally started
Stop
Valid length: 262144
Start
hackrf_start_rx() failed: Resource busy (-1000)
Valid length: 262144  
Valid length: 262144  

in this example, note the rx_callback firing ("Valid length") after the Stop, and still firing even though the start "failed".

so my guess is stop_rx is not fully cleaning up in-flight transfers, and that messes up some libhackrf or libusb internal state.
the "open/close inside loop prevents it" agrees with that.

it also shows something that seems to be timing-dependent.
when i tried it on the raspi, it would happen MUCH more frequently when the system i sshed-from to the raspi was under heavy load.
the only feedback mechanism i can imagine there is the print(s) inside the rx_callback blocking briefly.

beware the example code has some "odd" minor issues, like "result |=" assignment (so an earlier error will be reported again in the wrong place) and some of the error reporting using the wrong thing to report (the stop_rx error report claims it is start_rx).
not smacking the original poster there (supplying the example code really, really helped! 👍 for that!), just a warning for @straithe or anyone else trying it out.

@AlexFWulff
Copy link
Author

AlexFWulff commented Jul 22, 2021

@xaki23 interesting, and I'm glad you were able to reproduce this too. My experience was that I got much more time before things started failing on the RPi as compared to my VM. I also noticed that putting open/close in the loop mitigates the error, but this is obviously not ideal as far as fast repeated sampling goes.

I noticed that with the 2018 HackRF firmware release things are quite a bit more stable, although the problem still persists. This program runs for about 3x longer with the 2018 firmware. Can you try re-flashing your HackRF with the older firmware and confirming?

@dmaltsiniotis
Copy link
Contributor

dmaltsiniotis commented Jul 23, 2021

Using the GIST as a starting point, I was also able to reproduce the behavior described. The device sometimes getting stuck when calling hackrf_start_rx() with an error of Resource busy (-1000) immediately after a calling hackrf_stop_rx(). I don't have a full understanding of the underlying library and driver yet, so I'm still digging in.

Preliminary findings

I have found that adjusting the sample rate exacerbates or alleviates the issue. For example, I cannot get it to "get stuck" at sample rates at or below 14e6. I don't think this has anything actually to do with the sample rate, but probably at high to max sample rates, buffers somewhere tend to be "more full" causing some kind of buffer issue somewhere under certain conditions.

I have also found another way of triggering this behavior consistently on any sample rate and setting, totally by accident. It requires a bit of explanation for non-windows users though: If a right mouse click is issued to a command prompt window on Windows, the command prompt enters a "cursor highlight mode" of sorts, making it easy to select text from the console without it scrolling. This has another side-effect: It pauses the running process, which is incredibly frustrating when you don't realize there's a small highlight somewhere and you're waiting ages for some command to finish... but I digress. What I found was that using the given GIST code, any amount of highlighting the window (a.k.a. pausing the process) even for less than a second, triggers this issue.

These preliminary results are leading me down two paths of investigation:

  • The first path is checking for obvious or non-obvious race-type conditions or buffer issues in the libhackrf stop code.
  • The second path is much more mundane: The possibility that gist code itself is causing this issue by by not processing the callback fast enough, causing the receive buffer to back up, which in turn puts the device into a weird state when trying to stop and start again quickly.

Additionally, after digging a bit through libhackrf code, it seems someone encountered this very issue (or something similar to it) and attempted to mitigate it a little while back by adding a 10ms sleep on the call to hackrf_stop_rx(): b4ea51a

Regarding receiving data after stop as @xaki23 mentions, while I too see the same console outputs, I'm not totally convinced the callback is being fired after calling stop. It's incredibly difficult for me to tell what thread is writing what and when with just console output. It could very well be that the receive callback is being triggered as expected while libhackrf is chugging away through it's stop code. I have not encountered a single instance in my testing so far of actually having the callback triggered honest-to-god after I printed the result from hackrf_stop_rx().

Code for reference to make sense of the log below

    fprintf(stdout,"Calling hackrf_stop_rx()\n");                                       <- Callbacks observed triggering around this point.
    result = hackrf_stop_rx(device);                                                       <- Callbacks observed triggering around this point.
    fprintf(stdout,"hackrf_stop_rx result: %s (%d)\n", hackrf_error_name(result), result); <- No callbacks observed after this point.

My (slightly modified) logs: I've tried to redact some of the redundant output for readability.

...
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: HACKRF_SUCCESS (0)                     
Calling hackrf_start_rx()
hackrf_start_rx result: HACKRF_SUCCESS (0)                     <- No issues starting again, but also didn't receive "stray" callbacks.
rx_callback() not called yet. Waiting 100ms.
rx_callback() callback triggered with valid_length: 262144
... (repeated 11x)
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
rx_callback() callback triggered with valid_length: 262144  <- These came in sometime during the stop process
hackrf_stop_rx result: HACKRF_SUCCESS (0)
Calling hackrf_start_rx()
hackrf_start_rx result: Resource busy (-1000)               <- Next attempt at restarting gets a busy signal, fails.
rx_callback() was triggered at least once.
Calling hackrf_stop_rx()
hackrf_stop_rx result: unspecified error (-9999)
Calling hackrf_start_rx()
hackrf_start_rx result: Resource busy (-1000)

So I guess that's a very verbose way of saying... ¯\(ツ)

@straithe straithe added bug and removed technical support request for technical support labels Jul 28, 2021
@straithe
Copy link
Member

Thanks for reporting, reproducing, and investigating this, everyone! It seems we might have a bug on our hands and I've tagged this issue as such.

I don't think anyone here at GSG has the cycles to tackle this right now, but we will get to it when we can.

For now, if anyone keeps experimenting with this issue, please keep adding your findings as it may help us or someone in the community create a solution eventually!

@AlexFWulff
Copy link
Author

@straithe I need this rapid start/stop functionality for an end application. Is there any initial guidance you can provide on how to troubleshoot this? Is there any reason you can think of why the 2018 firmware release is more stable?

@straithe
Copy link
Member

No, sorry. Also, the person on the team I would get to look at this is on a much needed vacation for a few weeks. I can try to get this higher up their stack once they get back.

@AlexFWulff
Copy link
Author

No worries - thanks. I'll start poking around.

@straithe
Copy link
Member

I've brought this up to the team! Do you have any other updates from poking around?

@AlexFWulff
Copy link
Author

I've brought this up to the team! Do you have any other updates from poking around?

Not much unfortunately-I've been mainly using the 2018 release because I can get my applications to run for a few minutes instead of seconds before locking up, but it still stops eventually.

@straithe straithe added the needs validation described problem or solution requires further validation label Nov 15, 2021
@martinling
Copy link
Member

I've started looking at this, but it's been taking about 30-40 minutes for the problem to show up on my system, so it's going to be a slow business.

It's definitely still an issue though, I've reproduced it on both 2021.03.01 and current master.

@martinling
Copy link
Member

martinling commented Dec 31, 2021

A few further notes, building on @dmaltsiniotis' observations above:

  • This definitely seems to be easier to reproduce on Windows.
  • It's influenced by USB throughput. As well being more likely to happen at higher sample rates, I can actually trigger the failure just by moving a mouse that's plugged into the same USB hub as the HackRF.

Of these two proposed explanations:

  • The first path is checking for obvious or non-obvious race-type conditions or buffer issues in the libhackrf stop code.
  • The second path is much more mundane: The possibility that gist code itself is causing this issue by by not processing the callback fast enough, causing the receive buffer to back up, which in turn puts the device into a weird state when trying to stop and start again quickly.

...I'm inclined to rule out the second.

The firmware just doesn't care about the buffer being full. There isn't really any "backed up" state for the device to get into. The M0 core keeps dumping RX data into a circular buffer unconditionally. If it wraps, old data just gets overwritten. Meanwhile, the M4 core keeps doing bulk transfers from the buffer to the host. If the host doesn't request data quickly enough to keep up with the sampling rate, it just silently loses some data - this happens very easily for all sorts of reasons (slow callbacks, poor I/O bandwidth/scheduling, other devices on the bus, etc) and the device just doesn't care. Even with my work in #982, the situation isn't really different - all that changes is that there's now some stats available out-of-band which tell you how much you lost.

And the firmware doesn't seem to be getting into any weird state or locking up. Things stall on the host because hackrf_start_rx() fails, but that doesn't seem to be due to any issue on the device. As soon as the device is closed and reopened by libhackrf, everything works normally again. That closing and opening doesn't really mean anything from the device's point of view. It's not going through a reset or anything.

I think this is purely a host-side issue, with things getting into some bad state due to the interactions between the main thread calling hackrf_stop_rx(), the transfer thread processing libusb events, and the bulk transfers that are still in flight around that time. Something, sometimes, isn't being cleaned up correctly as it all gets torn down, and that leaves things in a state where the next hackrf_start_rx() call fails.

The fact that we see multiple RX callbacks being made between hackrf_stop_rx() being called and that function returning should be a good clue as to what the problem is. Something in there isn't safe - my guess would be that some sort of locking is needed between the work done in cancel_transfers(), which happens in the main thread, and the completion handling being done in hackrf_libusb_transfer_callback(), which happens in the transfer thread.

@martinling
Copy link
Member

I think I understand what's happening:

  1. The main thread calls hackrf_stop_rx(). That in turn calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer().

  2. At the same time though, one of those transfers is already in the process of completing. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. This can happen after cancel_transfers() has cancelled that transfer, so the transfer is now started again.

  3. Now there's actually a transfer still active, despite cancel_transfers() having cancelled all transfers and hackrf_stop_rx() having completed. This is the bad state that shouldn't happen.

  4. Then, you call hackrf_start_rx() again. It calls prepare_transfers(), which tries to submit all 4 transfers, expecting none of them to be in use. But one of them is already submitted, so you get a Resource busy error.

So the solution is to make sure that no in-flight transfer can end up being restarted by the transfer thread after cancel_transfers() has been called. I think that may be as simple as setting some state at the start of cancel_transfers(), which is checked by hackrf_libusb_transfer_callback() before it resubmits a transfer.

martinling added a commit to martinling/hackrf that referenced this issue Dec 31, 2021
Fixes bug greatscottgadgets#916.

Previously, there was a race which could lead to a transfer being left
active after cancel_transfers() completed. This would then cause the
next prepare_transfers() call to fail, because libusb_submit_transfer()
would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(),
   which iterates through the 4 transfers in use and cancels them one
   by one with libusb_cancel_transfer().

2. During this time, a transfer is completed. The transfer thread calls
   hackrf_libusb_transfer_callback(), which handles the data and then
   calls libusb_submit_transfer() to resubmit that transfer.

3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one
   transfer is still active.

4. The next hackrf_start_rx() call fails, because prepare_transfers()
   tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers
or restart them. This ensures that only one of these actions can happen
for a given transfer; it's no longer possible for a transfer to be
cancelled and then immediately restarted.
@martinling martinling linked a pull request Dec 31, 2021 that will close this issue
@martinling martinling removed the needs validation described problem or solution requires further validation label Jan 1, 2022
martinling added a commit to martinling/hackrf that referenced this issue Jan 1, 2022
Fixes bug greatscottgadgets#916.

Previously, there was a race which could lead to a transfer being left
active after cancel_transfers() completed. This would then cause the
next prepare_transfers() call to fail, because libusb_submit_transfer()
would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(),
   which iterates through the 4 transfers in use and cancels them one
   by one with libusb_cancel_transfer().

2. During this time, a transfer is completed. The transfer thread calls
   hackrf_libusb_transfer_callback(), which handles the data and then
   calls libusb_submit_transfer() to resubmit that transfer.

3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one
   transfer is still active.

4. The next hackrf_start_rx() call fails, because prepare_transfers()
   tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers
or restart them. This ensures that only one of these actions can happen
for a given transfer; it's no longer possible for a transfer to be
cancelled and then immediately restarted.
@gozu42
Copy link
Contributor

gozu42 commented Jan 2, 2022

the test program still locks up here with #1029 libhackrf.

$ ./hackrf_loop 
Start
Not started yet
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Valid length: 262144
Finally started
Start
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet
Not started yet

sometimes with multiple successful rounds, but lockup within a few seconds.

libhackrf version: git-b346790e (0.6)
Firmware Version: git-a83bfc4b (API:1.05)

(edit: quote fix)

@martinling
Copy link
Member

Oh, interesting. On my Windows test system (which was the only one I could reproduce this readily on) it was a complete fix.

The failure you see there looks different though - there's no error reported from hackrf_start_rx(). There might be a firmware-side element to it. Do you get the same with firmware from master, or 2021.03.01?

@gozu42
Copy link
Contributor

gozu42 commented Jan 2, 2022

sorry, cant test with fw 2021.03.01 currently because that lacks the usb-addr-zero fix from #987

@martinling
Copy link
Member

I believe all the host-side causes of failing to stop/start RX are now fixed in #1029.

I have confirmed that there is also a firmware-side bug which can cause RX stop/start to appear to succeed, but stall after the first 16KB of data. I've started a separate issue for this in #1042.

martinling added a commit to martinling/hackrf that referenced this issue Feb 3, 2022
Fixes bug greatscottgadgets#916.

Previously, there was a race which could lead to a transfer being left
active after cancel_transfers() completed. This would then cause the
next prepare_transfers() call to fail, because libusb_submit_transfer()
would return an error due to the transfer already being in use.

The sequence of events that could cause this was:

1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(),
   which iterates through the 4 transfers in use and cancels them one
   by one with libusb_cancel_transfer().

2. During this time, a transfer is completed. The transfer thread calls
   hackrf_libusb_transfer_callback(), which handles the data and then
   calls libusb_submit_transfer() to resubmit that transfer.

3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one
   transfer is still active.

4. The next hackrf_start_rx() call fails, because prepare_transfers()
   tries to submit a transfer which is already in use.

To fix this, we add a lock which must be held to either cancel transfers
or restart them. This ensures that only one of these actions can happen
for a given transfer; it's no longer possible for a transfer to be
cancelled and then immediately restarted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants