-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USB disconnects and reconnects cause kernel panics #1272
Comments
I am able to crash Raspbian Wheezy 3.18.11+ with the same script, although the trace is different |
I updated to the 4.4.0 development kernel and the problem still occurs. It's worth noting that having multiple USB devices plugged in seems to make it easier to reproduce (i.e. less cycles of USB reset needed to crash), but it can occur with just a keyboard plugged in. |
Here's the syslog output after one run of my script:
on kernel 4.4.0+, same USB configuration as listed in the original post. This time it shows a warning, but this doesn't always happen, If I leave the script running in a loop, then Linux eventually dies with one of the other error messages listed earlier.
|
The reason for the 4.4 backtrace seems to be the same issue that we had with the i2s driver, dma_free_coherent being called with interrupts disabled. dwc_otg_hcd_qh_free calls DWC_DMA_FREE inside a spinlock_irqsave block. Looks like preparation for using DMA pool had been started but it looks unfinished and everything's "#if 0"ed out. |
Here's another backtrace from 4.1.13 |
Here's kern.log showing multiple oops and eventual panic on 4.4.1 |
Another log, different backtrace |
Updated the kernel using apt-get to 4.1.17+ #834, still getting a crash, but this time it says kernel BUG at drivers/usb/host/dwc_otg/dwc_otg_hcd_intr.c:2425! Full crash log: crash 4.1.17.txt Is anyone still maintaining the dwc_otg driver? I see there is a dwc2 driver in mainline, should I switch to that instead? |
It's possible that memory's being leaked when disconnecting/reconnecting as dwc_otg will stop and start the host driver. dwc2 is the upstream driver but lacks FIQ support. If you have no low- or full-speed devices then it will probably work OK. It is the go-to for Pi Zero in device mode, however. |
I found usb_modeswitch 2.3.0-1 (on archlinux) to be the cause of my kernel panics when I plug in/out my usb modem. When I downgrade the package to the previous version (2.2.6-1), all is fine. |
@tpfkanep This particular problem doesn't just happen with the USB modem. I can reproduce it with a USB audio card or other devices as well. Having multiple USB devices seems to make it easier to reproduce though. |
Seen again on 4.4.6+ #861:
|
A user on the usb_modeswitch forum posted the following solution (working great on my Pi for a couple of days now):
http://www.draisberghof.de/usb_modeswitch/bb/viewtopic.php?f=4&t=2498 |
@tpfkanep Thanks for the info. This might fix your particular issue, but as previously mentioned, I can reproduce the kernel panics even without usb_modeswitch and a usb modem, e.g with a USB audio card. |
@jkuek has your issue been resolved? If so, please close this issue. Thanks. |
I experience what looks like the same crash. This raspberry pi B+ is headless, uses the onboard ethernet adapter and has 3 USB devices plugged in. I connected another, otherwise unused, B+ to the failing pi's on-board serial console to capture the crash, see below. This is with a pure Debian sid armel install plus >>> import time
>>> time.ctime(246972.664501)
'Sat Jan 3 20:36:12 1970' (ie, the crash timestamp represents an offset of 3 days and 20 hours since kernel boot)
|
Please provide the output of
|
That pi is used to stream TV in my lan. Are plugged in:
The workload at crash time is: In the case of this crash, a single host was connected. Crashs happen with a rather low probability compared to the number of transmitted packets (TCP bandwidth when streaming is peer*2MB/s, peer being 1 or 0 in practice). For about an hour per day of streaming, crashes happen once or twice a week. I expect the TV tuner to also send 2MB/s of data and the smartcard reader just a tiny amount (not measured, though). So, from a USB level, I expect the relevant pieces to be:
I rebooted from kdb, and have built a kernel with vmlinux image: $ git checkout rpi-4.4.y
$ git rev-parse HEAD
5e46914b3417fe9ff42546dcacd0f41f9a0fb172
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- bcmrpi_defconfig
$ nice -n 19 make -j16 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- zImage vmlinux modules dtbs and just booted it. Next time, I should have an actually usable kgdb... @Hexxeh: Could you build and make vmlinux available whenever you make a release ? It would help debugging. Here are the descriptors, minus serial numbers:
|
I haven't tried recently but when I originally reported the issue, I was able to reproduce it fairly often by disconnecting/reconnecting USB devices from the RPi. The disconnects were either manual, or by changing the buspower file as documented. It was definitely easier to reproduce with multiple USB devices plugged in, but the issue did still occur with a single USB device (I tried both USB audio and USB 3g modem) My conclusion is there was something happening when multiple USB devices (or a composite device with many interfaces) are enumerating... |
I got a similar crash and have a working gdb this time. It turns out I needed more than just vmlinux to get full debug symbols. The crash is still in kdb output:
reassembled gdb session:
So the null pointer dereference happens when removing an entry from I'll keep reading the code and report if I find something - but if anyone is familiar with this code, please do not wait for me. [edit]: a bit more from the gdb session:
[edit2] Dumping details from
So
|
So, after more investigations... The The issue seems to revolve around |
For the reference, I switched to dwc2 driver (loading the proper devicetree overlay), and did not have a crash so far. If I understand correctly, the dwc2 module lacks some ARM-specific IRQ-ish mechanism the out-of[-vanilla]-tree driver has. As the hc_ptr_array items are produced in the interrupt handler it makes some sense. At a glance, I did not notice a system load increase, but this rpi is comfortably overspec for what it does so it would take a large one for an increase to be noticeable. |
Thanks for the update. Last time I tried the dwc2 driver, I had some problems with other USB full-speed devices, eg. keyboards. I'm not sure whether that is still the case. |
There's been several bugfixes that may have a bearing on this issue. Please retest with a Pi that has latest rpi-update kernel. |
I re-ran the tests today, and could still cause the system to throw a kernel panic. I've attached the console output. For completeness, here is my USB set-up:
To trigger the panic, I am running this script:
which calls this script:
I ran it a few times, the issue appears to happen randomly |
I'm running the same script (with a few modifications) with 3x full-speed devices and a keyboard in addition to a connected eth0 interface. I'm not seeing a crash, which implies that whatever happening is a function of what devices you have connected (:() |
Ah. I have seen one crash after ~30 minutes of runtime. Investigating... |
I've also seen two crashes where in dwc_otg_hcd_handle_hc_fsm(), qh->qtd_list ends up being NULL. Unsure if this is related to having a corrupted free_hc_list. I have found that hcd->available_host_channels gets steadily incremented above the hardware maximum of 8 channels as we potentially increment it twice when dequeueing a transfer currently being performed by the FIQ. This may or may not be enough to trigger double-frees/double-adds in free_hc_list... |
Smoking gun located: we modify host channel lists, queue heads and URB lists in |
Sounds like great progress. As I've mentioned previously, different combinations of USB devices do seem to influence how often the crash occurs, but I've seen it fail with just a USB keyboard attached. |
@P33M There was a comment in this issue from HiassofT on 4 Feb 2016. I don't claim to understand it, but it might help get to the bottom of what's going on. |
The more I dig the more wrong-ness I keep finding...
|
As far as point 2) goes, I think I will swap away from root port disconnect crashes while I have a more reproducible bug pending (queueing transactions without a sane initial FIQ state). |
Point 3) is a small wild goose chase - because there are only two non-periodic transaction queues (inactive and active), on each scheduling call I've fixed several potential crash-inducing conditions due to the FIQ racing against |
I believe I have a fix for the list corruption - in The corruption happens because if the CPU is slow in executing the second unnecessary disable, a bit in HAINT is set as the channel already halted, which then triggers The fix is to just let the dequeue code path run to completion and delete the second cleanup. |
I have run the test script for a few hours - so far so good.
I've left the unit to run overnight, will report again tomorrow. As an aside, I've seen issues on current 4.4.x kernels where sometimes disconnecting/reconnecting my 3G modem can seemingly cause all other USB devices to disconnect and reconnect. I had put this down as a probably hardware problem (perhaps power-related?), but have you seen anything in your recent travels that might suggest a software reason? |
3G modems typically play fast and loose with the USB maximum Vbus current draw (500mA). Peak loading when the modem is transmitting can exceed several watts. I can well imagine that the power supplies draw a surge on hotplug that upsets Vbus on the Pi. I've run the disconnect/reconnect script overnight with 3 full-speed devices plugged in, no crashes resulted. Closing as fixed. |
Ah yes, but in my case the modem is externally powered, and it's just a USB connection to the Pi. I concur with your findings, I ran the script overnight and the unit is still running! Thanks. Will those syslog error messages be in the official release? The additional extra empty line could be removed. |
@P33M just noticed one issue: I stopped the script after running it for > 24 hours, but my USB network and 3G devices were no longer working. My USB devices were present:
However, neither eth0 nor wwan0 (3g modem) are accessible.
I checked the syslog and the dhcpcd.service failed about 2 hours after starting the test
After restarting dhcpcd (systemctl restart dhcpcd), everything is ok again. |
So eth0 came back without a reboot, correct? |
as reported, eth0 seemed to be ok from a driver perspective (it was present in /sys/class/net) but did not show in the list when I ran ifconfig. |
@P33M do you know if your fix (or some other fix) landed upstream? I'm having similar hangs on desktop on stock 4.4.0 and custom USB device that's externally powered. |
The fixes are specific to the dwc_otg driver, which is a downstream implementation. Also, unless your desktop hardware is powered by a mobile phone SoC using an OTG USB core, it's highly unlikely that this issue is the one you're seeing. |
Ugh, thanks, I'll keep digging! |
@P33M I'm getting a lot of issues similar to what you describe, when running webcams over the usb, namely sporadically get dmesg messages, and eventually get following, which requires a power cycle to recover from. I'm running, Is the fix you mention in this release?
|
Fix up unprivileged test case results for 'Dest pointer in r0' verifier tests given they now need to reject R0 containing a pointer value, and add a couple of new related ones with 32bit cmpxchg as well. root@foo:~/bpf/tools/testing/selftests/bpf# ./test_verifier #0/u invalid and of negative number OK #0/p invalid and of negative number OK [...] #1268/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK #1269/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK #1270/p XDP pkt read, pkt_data <= pkt_meta', good access OK #1271/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK #1272/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK Summary: 1900 PASSED, 0 SKIPPED, 0 FAILED Acked-by: Brendan Jackman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
commit e523102 upstream. Fix up unprivileged test case results for 'Dest pointer in r0' verifier tests given they now need to reject R0 containing a pointer value, and add a couple of new related ones with 32bit cmpxchg as well. root@foo:~/bpf/tools/testing/selftests/bpf# ./test_verifier #0/u invalid and of negative number OK #0/p invalid and of negative number OK [...] #1268/p XDP pkt read, pkt_meta' <= pkt_data, bad access 1 OK #1269/p XDP pkt read, pkt_meta' <= pkt_data, bad access 2 OK #1270/p XDP pkt read, pkt_data <= pkt_meta', good access OK #1271/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK #1272/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK Summary: 1900 PASSED, 0 SKIPPED, 0 FAILED Acked-by: Brendan Jackman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
I'm experiencing an infrequent crash, which has a number of reported errors:
Other common messages seen are
"Unable to handle kernel paging request at virtual address 0x......"
In normal use, this seems to occur because one or USB devices have spontaneously disconnected and reconnected. The root cause of this is still unknown (but outside the scope of this issue)
However I am able to make the OS crash easily by running the following script to force USB devices to reconnect:
I have a second script which runs this in a loop (with a "sleep 2" in between each cycle), and this results in a kernel panic after a few minutes.
I'm not running any other applications at the time, and this is at the moment just a fresh install of Raspbian Jessie Lite.
I'm using a RPi B+ , with the following USB connections:
Port 2 - keyboard
Port 4 - Sierra Wireless 3g modem
Port 5 - Audio card
I've attached the output of lsusb -v, lsusb.txt
I initially thought it might be related to one of the USB devices, but I tried unplugging the modem and audio card but the problem still occurred. Interestingly, I have never seen the crash if I don't have any USB devices connected.
I originally noticed the problem on Raspbian Jessie Lite, kernel 4.1.13+. I also tried running rpi-update to update the kernel to 4.1.16+, but the problem persisted.
I'm not sure how to get the output of the kernel dumps, after the problem occurs, I reset the unit and there's no kernel info in the logs. Some photos are here:
What other information can I provide to help resolve the issue?
The text was updated successfully, but these errors were encountered: