-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated Start/Stop RX Produces Error #916
Comments
Have you tried reproducing this issue on a Linux desktop or laptop that is not a Raspberry Pi? |
Yes - I have tried this on macOS natively and in a Debian Virtualbox instance w/ macOS host. |
i can repro this with the supplied example code on both a raspi4 and fedora-vm. putting the device open/close inside the loop seems to reliably prevent the problem. slowing down the loop (sleep(1)s between start/stop and/or after stop) reveals a start_rx error that is not visible in the unsleeping loop.
in this example, note the rx_callback firing ("Valid length") after the Stop, and still firing even though the start "failed". so my guess is stop_rx is not fully cleaning up in-flight transfers, and that messes up some libhackrf or libusb internal state. it also shows something that seems to be timing-dependent. beware the example code has some "odd" minor issues, like "result |=" assignment (so an earlier error will be reported again in the wrong place) and some of the error reporting using the wrong thing to report (the stop_rx error report claims it is start_rx). |
@xaki23 interesting, and I'm glad you were able to reproduce this too. My experience was that I got much more time before things started failing on the RPi as compared to my VM. I also noticed that putting open/close in the loop mitigates the error, but this is obviously not ideal as far as fast repeated sampling goes. I noticed that with the 2018 HackRF firmware release things are quite a bit more stable, although the problem still persists. This program runs for about 3x longer with the 2018 firmware. Can you try re-flashing your HackRF with the older firmware and confirming? |
Using the GIST as a starting point, I was also able to reproduce the behavior described. The device sometimes getting stuck when calling hackrf_start_rx() with an error of Resource busy (-1000) immediately after a calling hackrf_stop_rx(). I don't have a full understanding of the underlying library and driver yet, so I'm still digging in. Preliminary findingsI have found that adjusting the sample rate exacerbates or alleviates the issue. For example, I cannot get it to "get stuck" at sample rates at or below 14e6. I don't think this has anything actually to do with the sample rate, but probably at high to max sample rates, buffers somewhere tend to be "more full" causing some kind of buffer issue somewhere under certain conditions. I have also found another way of triggering this behavior consistently on any sample rate and setting, totally by accident. It requires a bit of explanation for non-windows users though: If a right mouse click is issued to a command prompt window on Windows, the command prompt enters a "cursor highlight mode" of sorts, making it easy to select text from the console without it scrolling. This has another side-effect: It pauses the running process, which is incredibly frustrating when you don't realize there's a small highlight somewhere and you're waiting ages for some command to finish... but I digress. What I found was that using the given GIST code, any amount of highlighting the window (a.k.a. pausing the process) even for less than a second, triggers this issue. These preliminary results are leading me down two paths of investigation:
Additionally, after digging a bit through libhackrf code, it seems someone encountered this very issue (or something similar to it) and attempted to mitigate it a little while back by adding a 10ms sleep on the call to hackrf_stop_rx(): b4ea51a Regarding receiving data after stop as @xaki23 mentions, while I too see the same console outputs, I'm not totally convinced the callback is being fired after calling stop. It's incredibly difficult for me to tell what thread is writing what and when with just console output. It could very well be that the receive callback is being triggered as expected while libhackrf is chugging away through it's stop code. I have not encountered a single instance in my testing so far of actually having the callback triggered honest-to-god after I printed the result from hackrf_stop_rx(). Code for reference to make sense of the log below
My (slightly modified) logs: I've tried to redact some of the redundant output for readability.
So I guess that's a very verbose way of saying... ¯\(ツ)/¯ |
Thanks for reporting, reproducing, and investigating this, everyone! It seems we might have a bug on our hands and I've tagged this issue as such. I don't think anyone here at GSG has the cycles to tackle this right now, but we will get to it when we can. For now, if anyone keeps experimenting with this issue, please keep adding your findings as it may help us or someone in the community create a solution eventually! |
@straithe I need this rapid start/stop functionality for an end application. Is there any initial guidance you can provide on how to troubleshoot this? Is there any reason you can think of why the 2018 firmware release is more stable? |
No, sorry. Also, the person on the team I would get to look at this is on a much needed vacation for a few weeks. I can try to get this higher up their stack once they get back. |
No worries - thanks. I'll start poking around. |
I've brought this up to the team! Do you have any other updates from poking around? |
Not much unfortunately-I've been mainly using the 2018 release because I can get my applications to run for a few minutes instead of seconds before locking up, but it still stops eventually. |
I've started looking at this, but it's been taking about 30-40 minutes for the problem to show up on my system, so it's going to be a slow business. It's definitely still an issue though, I've reproduced it on both 2021.03.01 and current master. |
A few further notes, building on @dmaltsiniotis' observations above:
Of these two proposed explanations:
...I'm inclined to rule out the second. The firmware just doesn't care about the buffer being full. There isn't really any "backed up" state for the device to get into. The M0 core keeps dumping RX data into a circular buffer unconditionally. If it wraps, old data just gets overwritten. Meanwhile, the M4 core keeps doing bulk transfers from the buffer to the host. If the host doesn't request data quickly enough to keep up with the sampling rate, it just silently loses some data - this happens very easily for all sorts of reasons (slow callbacks, poor I/O bandwidth/scheduling, other devices on the bus, etc) and the device just doesn't care. Even with my work in #982, the situation isn't really different - all that changes is that there's now some stats available out-of-band which tell you how much you lost. And the firmware doesn't seem to be getting into any weird state or locking up. Things stall on the host because I think this is purely a host-side issue, with things getting into some bad state due to the interactions between the main thread calling The fact that we see multiple RX callbacks being made between |
I think I understand what's happening:
So the solution is to make sure that no in-flight transfer can end up being restarted by the transfer thread after |
Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.
Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.
the test program still locks up here with #1029 libhackrf.
sometimes with multiple successful rounds, but lockup within a few seconds.
(edit: quote fix) |
Oh, interesting. On my Windows test system (which was the only one I could reproduce this readily on) it was a complete fix. The failure you see there looks different though - there's no error reported from |
sorry, cant test with fw 2021.03.01 currently because that lacks the usb-addr-zero fix from #987 |
Fixes bug greatscottgadgets#916. Previously, there was a race which could lead to a transfer being left active after cancel_transfers() completed. This would then cause the next prepare_transfers() call to fail, because libusb_submit_transfer() would return an error due to the transfer already being in use. The sequence of events that could cause this was: 1. Main thread calls hackrf_stop_rx(), which calls cancel_transfers(), which iterates through the 4 transfers in use and cancels them one by one with libusb_cancel_transfer(). 2. During this time, a transfer is completed. The transfer thread calls hackrf_libusb_transfer_callback(), which handles the data and then calls libusb_submit_transfer() to resubmit that transfer. 3. Now, cancel_transfers() and hackrf_stop_rx() are completed but one transfer is still active. 4. The next hackrf_start_rx() call fails, because prepare_transfers() tries to submit a transfer which is already in use. To fix this, we add a lock which must be held to either cancel transfers or restart them. This ensures that only one of these actions can happen for a given transfer; it's no longer possible for a transfer to be cancelled and then immediately restarted.
I think that this is still a problem on the most recent release (2021.03.1). I wrote a C program that can reproduce an error where, after a few repeated start rx/stop rx calls in quick succession, the TX light stays on and the rx callback is never triggered.
Here is my hackrf_info output (platform is RPi 3 raspbian):
libhackrf is not installed globally - instead I'm manually linking in the compiled libhackrf 2021.03.1 binary.
Check out this gist for the program I'm using.
Below is the expected output. The rx should start, maybe get a few cycles before the rx callback is called, and then should restart again indefinitely.
Below is the output I get after letting the program run for about 10 seconds (sometimes more, sometimes less). Eventually, the callback is never triggered and the program gets stuck on "Not started yet" forever.
Is there a way to fix this to allow for start_rx and stop_rx to be called in quick succession?
Originally posted by AlexFWulff in #394 (comment)
The text was updated successfully, but these errors were encountered: