Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hackrf_sweep sometimes doesn't output result #1052

Closed
openedhardware opened this issue Feb 23, 2022 · 22 comments · Fixed by #1081
Closed

hackrf_sweep sometimes doesn't output result #1052

openedhardware opened this issue Feb 23, 2022 · 22 comments · Fixed by #1081
Assignees
Labels

Comments

@openedhardware
Copy link

Version information

Operating system:

$ cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

hackrf_info output:

$ hackrf_info
hackrf_info version: 2021.03.1
libhackrf version: 2021.03.1 (0.6)
Found HackRF
Index: 0
Serial number: 0000000000000000088869dc2a82261b
Board ID Number: 2 (HackRF One)
Firmware Version: 2021.03.1 (API:1.04)
Part ID Number: 0xa000cb3c 0x005d475a

Output

hackrf_sweep sometimes outputs results, and sometimes it doesn't:

$ hackrf_sweep -f 915:925 -1
call hackrf_sample_rate_set(20.000 MHz)
call hackrf_baseband_filter_bandwidth_set(15.000 MHz)
Sweeping from 915 MHz to 935 MHz
Stop with Ctrl-C
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
0 total sweeps completed, 0.00 sweeps/second
^CCaught signal 2

Exiting...
Total sweeps: 0 in 12.02455 seconds (0.00 sweeps/second)
hackrf_stop_rx() done
hackrf_close() done
hackrf_exit() done
exit

Hardware issue or something else?

Cheers!

@straithe straithe self-assigned this Feb 24, 2022
@straithe straithe added the technical support request for technical support label Feb 24, 2022
@martinling
Copy link
Member

I suggest trying again with current git master versions of the firmware & host code. I recently fixed several intermittent start/stop problems in PR #1029.

@metayan
Copy link

metayan commented Mar 24, 2022

Tested on Linux, where I didn't manage to reproduce the problem, even after 101018 iterations of
hackrf_sweep -f 770:970 -w 250000 -l 16 -g 24 -N 1

Used

while true; do
  time LIBUSB_DEBUG=4 timeout 1 \
  ./hackrf-tools/src/hackrf_sweep -f 770:970 -w 250000 -l 16 -g 24 -N 1 \
  2>&1 |ts -i "%.s" |ts -s "%.s" |tee $(echo |ts "%Y%m%d-%H%M%.S" |cut -c 1-22).log;
done

to have full logs with timing info saved in case the issue appears.
(The timeout 1 is needed to automate this, because the "no data" situation has the sweep go into an endless loop.)
Moved the resulting log files into a directory and did a simple
egrep -rL 800000000 .
in there to expose any files that didn't contain actual sweep data.
Nothing reported, so all good here on Linux.

These tests were done with
https://github.com/martinling/hackrf/tree/concurrency-cleanup

Preparing some Darwin tests.

@metayan
Copy link

metayan commented Mar 24, 2022

Found the problem.
On Darwin, running
https://github.com/martinling/hackrf/tree/concurrency-cleanup
https://github.com/martinling/hackrf/tree/fix-stack-variables
(Those changes are not the cause, though, because this issue has been present for a long time, at least on Darwin.)

Sometimes the buffer that rx_callback gets doesn't start with 0x7F 0x7F mark.
Instead the mark comes 512 bytes from the end of the buffer.
Well, maybe that's the beginning of the next block. At least it's 512 bytes off.

Have to go now, so can't analyse further at the moment, but attaching a couple of raw blocks.
Can supply massive log files if needed.

buffer-out-of-sync-part.log.txt

@martinling
Copy link
Member

@metayan it's possible this is the result of the RX buffer overrunning because the host isn't retrieving the data quickly enough. If you run hackrf_debug --state after this has happened, that should confirm if there was a shortfall.

@metayan
Copy link

metayan commented Mar 25, 2022

Thanks for the hackrf_debug --state tip.
Getting misaligned RX buffers without reported shortfalls on Darwin.

On Darwin libusb doesn't exit cleanly. Never has. Maybe related? New issue?

@metayan
Copy link

metayan commented Mar 25, 2022

With some logging added in rx_callback.c

for(j=0; j<BLOCKS_PER_TRANSFER; j++) {
	fprintf(stderr, "rx_callback block %d\n", j);
	ubuf = (uint8_t*) buf;
	int b;
	for (b=0; b<16; b++) fprintf(stderr, "%02X ", ubuf[b]);
	fprintf(stderr, "\n");
	for (b=BYTES_PER_BLOCK-512; b<BYTES_PER_BLOCK-512+16; b++) fprintf(stderr, "%02X ", ubuf[b]);
	fprintf(stderr, "\n");
	fflush(stderr);
	if(ubuf[0] == 0x7F && ubuf[1] == 0x7F) {
		fprintf(stderr, "rx_callback if ubuf[0]\n");
		frequency = ((uint64_t)(ubuf[9]) << 56) | ((uint64_t)(ubuf[8]) << 48) | ((uint64_t)(ubuf[7]) << 40)
				| ((uint64_t)(ubuf[6]) << 32) | ((uint64_t)(ubuf[5]) << 24) | ((uint64_t)(ubuf[4]) << 16)
				| ((uint64_t)(ubuf[3]) << 8) | ubuf[2];
	} else {
		fprintf(stderr, "rx_callback if ubuf[0] else continue\n");
		buf += BYTES_PER_BLOCK;
		continue;
	}

I get

Stop with Ctrl-C
[ 0.045140] [01199bc9] libusb: debug [darwin_async_io_callback] an async io operation has completed
[ 0.045185] [01199bca] libusb: debug [usbi_wait_for_events] poll() returned 1
[ 0.045196] [01199bca] libusb: debug [handle_event_trigger] event triggered
[ 0.045199] [01199bca] libusb: debug [darwin_handle_transfer_completion] handling transfer completion type bulk with kernel status 0
[ 0.045202] [01199bca] libusb: debug [usbi_handle_transfer_completion] transfer 0x7f92bf608850 has callback 0x107a45880
rx_callback 0
rx_callback 1
rx_callback block 0
FE FE FE FE FF FE FE FE FE FE FE FF FE FE FE FE
7F 7F 40 4B 4C 00 00 00 00 00 01 ED 05 0C ED FA
rx_callback if ubuf[0] else continue
rx_callback block 1
0C F4 FA 0F F5 F1 0E 00 F0 09 FF ED 07 0B EE FC
7F 7F 00 2D 31 01 00 00 00 00 FE FD FE FF FE FF
rx_callback if ubuf[0] else continue
rx_callback block 2
FD FE FD FE FE FD FD FF FD FE FE FE FD FE FD FE
7F 7F 40 78 7D 01 00 00 00 00 FD FE FD 00 FD FE
rx_callback if ubuf[0] else continue
rx_callback block 3
FF FE FC FF FD FD FF FF FE FF FE FE FE FF FE FE
7F 7F 00 5A 62 02 00 00 00 00 FE FE FE FF FF FF
rx_callback if ubuf[0] else continue
rx_callback block 4
FE FE FE FF FD FE FD FE FE FE FE FE FE FF FE FE
7F 7F 40 A5 AE 02 00 00 00 00 FE 00 FD FE FD FE
rx_callback if ubuf[0] else continue
rx_callback block 5
FF FD FE FE FE FE FE FF FE FE FE FE FF FF FE FE
7F 7F 00 87 93 03 00 00 00 00 FE FE FD FE FD FD
rx_callback if ubuf[0] else continue
rx_callback block 6
FE FE FD FE FE FE FE FE FE FE FF FE FE FE FE FD
7F 7F 40 D2 DF 03 00 00 00 00 FD FD FE FE FE FF
rx_callback if ubuf[0] else continue
rx_callback block 7
FD FC FE FE FD FE FE FE FF FF FD FE FE FE FD FF
7F 7F 00 B4 C4 04 00 00 00 00 FF FE FD FF FF FD
rx_callback if ubuf[0] else continue
rx_callback block 8
FE FE FD 00 FD FD FE FE FD FF FE FD FE FF FE FF
7F 7F 40 FF 10 05 00 00 00 00 FD FE FD FE FE FF
rx_callback if ubuf[0] else continue
rx_callback block 9
FD FE FE FE FD FE FE FD FF FD FE FF FE FF FE 00
7F 7F 00 E1 F5 05 00 00 00 00 FF FB FE FD FF FC
rx_callback if ubuf[0] else continue
rx_callback block 10
FB 01 FC 01 FD FF FC 01 FC 00 FD 02 FE 02 FF 01
7F 7F 40 2C 42 06 00 00 00 00 FC FD FD 01 01 FD
rx_callback if ubuf[0] else continue
rx_callback block 11
FE 01 FF FE FD FB FB 00 00 01 00 FB FB FE FD 00
7F 7F 00 0E 27 07 00 00 00 00 FE FE FE FE FE FE
rx_callback if ubuf[0] else continue
rx_callback block 12
FE FF FE FE FD FE FE FE FD FE FD FE FE FE FE FE
7F 7F 40 59 73 07 00 00 00 00 FE FE FE FF FE FE
rx_callback if ubuf[0] else continue
rx_callback block 13
FD FE FE FF FD FE FE FE FD FF FD FE FD FF FE FE
7F 7F 00 3B 58 08 00 00 00 00 FD FE FE FE FE FE
rx_callback if ubuf[0] else continue
rx_callback block 14
FD FE FE FE FE FE FE FE FD FF FE FE FE FD FE FD
7F 7F 40 86 A4 08 00 00 00 00 FD FE FD FE FE FE
rx_callback if ubuf[0] else continue
rx_callback block 15
FD FF FE FF FF FF FE FE FE FF FE FF FD FD FE FD
7F 7F 00 68 89 09 00 00 00 00 FD FE FD FE FE FE
rx_callback if ubuf[0] else continue
[ 0.045962] [01199bca] libusb: debug [libusb_submit_transfer] transfer 0x7f92bf608850
[ 0.045966] [01199bca] libusb: debug [ep_to_pipeRef] converting ep address 0x81 to pipeRef and interface
[ 0.045976] [01199bca] libusb: debug [ep_to_pipeRef] pipe 1 on interface 0 matches

and hackrf_debug --state reports

M0 state:
Requested mode: 0 (IDLE) [complete]
Active mode: 0 (IDLE)
M0 count: 17559712 bytes
M4 count: 17547264 bytes
Number of shortfalls: 0
Longest shortfall: 0 bytes
Shortfall limit: 0 bytes
Mode change threshold: 17563648 bytes
Next mode: 0 (IDLE)
Error: 0 (NONE)

The full log:
buffer-out-of-sync-02.log.txt

@martinling
Copy link
Member

Hi @metayan, I'm very confused by all this, but here's some thoughts I have:

  • I'm glad that there's no shortfalls reported, because the current sweep code was designed to make them impossible (if the buffer's full, the sweep should block on the device side until the host retrieves the current buffered data). But I thought it made sense to check.
  • 512 bytes is the payload size of one USB data packet. So the 0x7F 0x7F being offset by that much suggests that a transfer was off by one packet somewhere. But I don't know how or why that would show up at this layer. It's possible for libusb transfers to return early with less data than requested, but that should be handled correctly - libhackrf sets valid_length, and the callback in hackrf_sweep does use that value rather than assuming the length.

@martinling
Copy link
Member

@miek has pointed out that this bug came up previously:

We fixed a really similar bug a while ago that was affecting all platforms. On restarting sweep, a single packet would go missing throwing off the sync for everything else #368. What was happening was that the firmware was re-initialising the endpoints when it shouldn't, so sometimes we'd send a DATA0 instead of DATA1, and the OS would silently drop it. This was the fix: #702

Looking at the fix, I suspect now that I broke this in #1045.

@metayan
Copy link

metayan commented Mar 25, 2022

I find it a bit strange that the offset persists across invocations of hackrf_sweep.
Occasionally (very seldom) it does get out of the error by itself with further sweep tries, but usually a reset is needed.
I'll have a more thorough look - thought that you might go 'aha' and know immediately what's going on when seeing the previous log. ;)

Just so you know, I did get shortfalls when printing the whole buffer - I guess it took way too long.

@metayan
Copy link

metayan commented Mar 25, 2022

Oh, I see your comment regarding #702 and #1045 now.

@martinling
Copy link
Member

Just so you know, I did get shortfalls when printing the whole buffer - I guess it took way too long.

Ah, I was wrong about the current sweep code guarantees. It doesn't guarantee no shortfalls, but it should guarantee that if shortfalls happen, they won't result in lost or corrupted data (which could happen before #982).

When the host is too slow to process the data, the buffer will fill up, resulting in shortfalls as you saw when you were printing out the buffer contents. The only effect of those shortfalls should be on timing, though. The firmware should just wait for there to be buffer space, before continuing with RX.

Oh, I see your comment regarding #702 and #1045 now.

At the moment, #1045 looks like a possible culprit because it touches some of the same code as #702, but it's still not clear what the actual problem with it is. There was a mystery on that PR though that makes me extra suspicious: why we still needed the usb_endpoint_flush calls in the set_transceiver_mode vendor request handler (which runs from the USB0 ISR, so may happen in the middle of the rest of the sweep code).

If you're up for more testing, it might be interesting to test master from just before #1045 was merged, and perhaps also just before #982 as well.

@metayan
Copy link

metayan commented Mar 25, 2022

Happens with 51f1066 (just before #1045)
and also de76404 (just before #982).

@martinling
Copy link
Member

Oh, interesting. So this dates back to before all the relevant work I've done recently.

I guess the next question is: did it work correctly immediately after #702 was merged? If so, then we should be able to bisect from there to find where it broke. And if not, then there was always some other way for this to happen, which wasn't fixed by that change.

@martinling martinling self-assigned this Mar 25, 2022
@martinling martinling added bug and removed technical support request for technical support labels Mar 25, 2022
@metayan
Copy link

metayan commented Mar 26, 2022

Tested with 43e6f99 and 451873d and both can get into the state where no buffers are accepted.

While testing this, I encountered some weirdness with the flashed old firmware, for which I filed #1076 .
Managed to run the tests with the corresponding firmware in RAM.

@martinling
Copy link
Member

OK, seems like this isn't a regression then, but something that wasn't fixed by #702 in the first place.

We'd really like to get a low level USB capture of this happening. Could you share the details of the Darwin system you're testing on - i.e. model and MacOS version?

It's very odd that this seems to be host-dependent.

@metayan
Copy link

metayan commented Mar 26, 2022

Using macOS Darwin 10.15.7
MacBookPro11,3

I'm pretty certain that I have also encountered this issue on Linux - on devuan running on a Raspberry PI4, but that was quite a long time ago.
The original issue reporter was also using Raspberry PI.

There is some weirdness on Darwin during closing that I have just filed as issue #1079 .
I have a slight suspicion that it might be related, but maybe it's just wishful thinking.

@martinling
Copy link
Member

This seems like it could be related - from your debug output on #1079:

[darwin_abort_transfers] calling clear pipe stall to clear the data toggle bit

@martinling
Copy link
Member

I'm looking into finding a system to reproduce this on so that we can capture what's happening on the wire.

I'd be very interested if the missing 512 bytes can be reproduced on something other than MacOS. Just seeing hackrf_sweep stalled doesn't necessarily indicate the same underlying problem.

I'm pretty certain that I have also encountered this issue on Linux - on devuan running on a Raspberry PI4, but that was quite a long time ago. The original issue reporter was also using Raspberry PI.

The OP was running the 2021.03.1 release, and we fixed several other things that could have caused this symptom since then.

We had a lot of hackrf_sweep testing being done on RPi by @gozu42 quite recently, so if current firmware were failing as described then I would have expected to hear about it.

@metayan
Copy link

metayan commented Mar 28, 2022

The fact that this 'no output' issue persists over hackrf_sweep invocations - most often until the device is reset - to me means that some state is kept in the device. Occasionally the device "snaps out of it" by itself, but this happens very seldom.
This leads me to suspect that some variable is not initialised properly in the firmware.
Now, why this would only happen on Darwin is somewhat of a mystery, but then again, the USB handling on Darwin seems to be something else...

I have been tweaking the Darwin libusb code and can get it to close down somewhat cleanly now. Not completely, though. Still figuring it out.

I can build any firmware and host software needed and test on both Linux and Darwin.

@mossmann
Copy link
Member

This seems like it could be related - from your debug output on #1079:

[darwin_abort_transfers] calling clear pipe stall to clear the data toggle bit

On Darwin libusb calls ClearPipeStallBothEnds() when aborting transfers:

https://github.com/tenderlove/libusb/blob/7634714aa696175b08016b6f2185a75a2f55a113/libusb/os/darwin_usb.c#L1505

This sends a ClearFeature(ENDPOINT_HALT) to the device:

https://opensource.apple.com/source/IOUSBFamily/IOUSBFamily-198.4.3/Documentation/IOUSBLib/Functions/InterfaceInterface.html#ClearPipeStallBothEnds

According to the USB spec (sections 8.5.2 and 9.4.5), the data toggle should be initialized to DATA0 in two conditions: at configuration and when handling ClearFeature(ENDPOINT_HALT). Our firmware does this at configuration, but we have no handling of ClearFeature(ENDPOINT_HALT).

I think we need to add handling of USB_STANDARD_REQUEST_CLEAR_FEATURE to usb_standard_request() in usb_standard_request.c.

@martinling
Copy link
Member

We've reproduced this just now on a MacBookPro16,1 running macOS Monterey 12.3, and I have a LUNA analyzer capture of it.

The change in #1081 failed to fix it, and hung things worse. I have a capture of that happening too.

The captures are big, so I'll need to dig through and extract the interesting parts before posting here.

@martinling
Copy link
Member

Thanks to some corrections to #1081 from @miek, this is now fixed by that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants