Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel OOPS with high CPU load, related to internal wired network adaptor on 3B+. LUKS volume also unmounts. #3843

Closed
hamishmb opened this issue Sep 9, 2020 · 24 comments

Comments

@hamishmb
Copy link

hamishmb commented Sep 9, 2020

Describe the bug
Periodically (apparently randomly), my LUKS volume (formatted ext4) I'm using as a NAS volume on my Pi 3B+ gets unmounted. Recently I realised this was happening at the same time I get a kernel OOPS related to the network driver. There is nothing in the dmesg output to suggest the drive is at fault. It gets unmounted even with option "errors=remount-ro". Sometimes the network also dies and the cable needs to be unplugged and replugged. (output is below)

To reproduce
Unfortunately probably difficult to reproduce. Related bug report at https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+bug/1861936

Expected behaviour
No kernel OOPS, but network dropout, and LUKS volume stays unmounted.

Actual behaviour
Below are relevant kernel messages (apologies for slightly bizarre wrapping):

[983261.923799] ------------[ cut here ]------------
[983261.923836] NETDEV WATCHDOG: enxb827eb7194f1 (lan78xx): transmit
queue 0 timed out
[983261.923915] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:466
dev_watchdog+0x2b0/0x2b8
[983261.923919] Modules linked in: xt_recent dm_crypt aes_neon_bs
aes_neon_blk crypto_simd cryptd aes_arm64 algif_skcipher af_alg 8021q
garp stp llc dm_mod brcmfmac brcmutil sg cfg80211 joydev evdev
bcm2835_codec(C) bcm2835_v4l2(C) rfkill v4l2_mem2mem
bcm2835_mmal_vchiq(C) v4l2_common videobuf2_vmalloc videobuf2_dma_contig
videobuf2_memops videobuf2_v4l2 videobuf2_common videodev
raspberrypi_hwmon media hwmon vc_sm_cma(C) uio_pdrv_genirq uio
nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt
nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_multiport xt_LOG
nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nft_compat
nft_counter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp
nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
nf_tables nfnetlink i2c_dev
[983261.924031] ip_tables x_tables ipv6
[983261.924051] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G
C 4.19.118-v8+ #1311
[983261.924056] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
[983261.924061] pstate: 80000005 (Nzcv daif -PAN -UAO)
[983261.924067] pc : dev_watchdog+0x2b0/0x2b8
[983261.924071] lr : dev_watchdog+0x2b0/0x2b8
[983261.924074] sp : ffffff8008013d70
[983261.924077] x29: ffffff8008013d70 x28: ffffffc03c59ba00
[983261.924083] x27: ffffff8008cc5000 x26: 0000000000000140
[983261.924090] x25: 00000000ffffffff x24: ffffffc03b24d480
[983261.924102] x23: ffffffc03b24d45c x22: ffffffc03b306280
[983261.924114] x21: ffffff8008cc6000 x20: ffffffc03b24d000
[983261.924119] x19: 0000000000000000 x18: ffffff8008cc8688
[983261.924125] x17: 0000000000000000 x16: 0000000000000000
[983261.924130] x15: ffffff8008df8e78 x14: 74756f2064656d69
[983261.924135] x13: 7420302065756575 x12: 712074696d736e61
[983261.924140] x11: 7274203a29787838 x10: 0000000000000000
[983261.924145] x9 : 0000000000016644 x8 : 0000000000000000
[983261.924150] x7 : ffffff8008cc8688 x6 : ffffffc03e5a20a0
[983261.924155] x5 : 0000000000000000 x4 : fffffffffffffff8
[983261.924160] x3 : 0000000000000000 x2 : 0000000000000004
[983261.924167] x1 : f1fa254012d51300 x0 : 0000000000000000
[983261.924179] Call trace:
[983261.924187] dev_watchdog+0x2b0/0x2b8
[983261.924199] call_timer_fn+0x34/0x1c8
[983261.924207] expire_timers+0xbc/0x150
[983261.924215] run_timer_softirq+0xb4/0x1a8
[983261.924224] __do_softirq+0x17c/0x3cc
[983261.924232] irq_exit+0xe8/0xf8
[983261.924238] __handle_domain_irq+0x90/0x100
[983261.924243] bcm2836_arm_irqchip_handle_irq+0x68/0xd8
[983261.924248] el1_irq+0xb4/0x130
[983261.924254] arch_cpu_idle+0x30/0x1f0
[983261.924260] default_idle_call+0x24/0x40
[983261.924265] do_idle+0x224/0x240
[983261.924270] cpu_startup_entry+0x28/0x30
[983261.924276] secondary_start_kernel+0x188/0x1d8
[983261.924280] ---[ end trace b2a07353fff8d125 ]---

System
Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

  • Which model of Raspberry Pi? Pi 3B+
  • Which OS and version (cat /etc/rpi-issue)?
    Raspberry Pi reference 2016-03-18
    Generated using Pi-gen, https://github.com/RPi-Distro/Pi-gen, stage4
  • Which firmware version (vcgencmd version)?
    Aug 19 2020 17:40:36
    Copyright (c) 2012 Broadcom
    version e90cba19a98a0d1f2ef086b9cafcbca00778f094 (clean) (release) (start_cd)
  • Which kernel version (uname -a)?
    Definitely happens on previous stable 4.9 kernel. Unfortunately I don't have the output for that, but the OOPS says 4.19.118-v8+. This is my current kernel (from repos):
    Linux raspberrypi 5.4.51-v8+ I2c1_baudrate being overridden by i2c.conf in 25-Feb-2016 distribution #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 aarch64 GNU/Linux

I'm happy to wait until this happens again to debug it if needed. It happens in the Ubuntu kernel (linked report above) that are similar to this version though so I suspect it is still present.
Logs
See above.

Additional context
Summarising above info:

  • LUKS volume (ext4) on external HDD gets unmounted without warning every time the OOPS happens.
  • Occurs on an IPv4 - only network.
  • Network cable sometimes needs unplugging and replugging for any communication to work (but not always).
  • Seems to be most likely to happen under high CPU load.
@pelwell
Copy link
Contributor

pelwell commented Sep 9, 2020

Can you try with dtoverlay=dwc2 in config.txt? That will select the upstream USB driver that behaves differently, and the result of the test will help to guide our investigation.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

Sure, but shall I first wait to be 1000% sure it still occurs with the 5.4 kernel? Might take a day or so to happen again.

Is there any chance of data corruption using that driver? I have backups but I'd really rather not have that - it runs Nextcloud so it might corrupt stuff on my other systems too.

@pelwell
Copy link
Contributor

pelwell commented Sep 9, 2020

That driver is the only driver for the old Pi USB controller upstream. What it lacks in optimised FIQ-based interrupt handling it makes up for in bug fixes. In other words, the throughput might go down, but the reliability should go up.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

Okay, sounds good. I'm not too bothered about total throughput.

I'll make sure this still happens in the new kernel first though, otherwise we'll be chasing down a bug that might not exist anymore. Either way, if it doesn't happen within a few days I'll test the new driver for you anyway.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

Okay, just happened again with Linux 5.4. I'll try your suggestion now and report back.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

NB: No kernel OOPS last time, but USB devices all reset.

I've booted with the new overlay and all seems okay, USB is still working and so is the network. Any way I can verify that it booted with the new overlay instead of falling back to something else?

@pelwell
Copy link
Contributor

pelwell commented Sep 9, 2020

Off the top of my head, dmesg | grep dwc should return different results in the two cases, and it should be obvious which is which.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

Yeah, I'm using the correct overlay then. Cheers! I'll get back to you on how well it works.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

Okay, no kernel oops/network issues (yet) but my LUKS storage still gets unmounted with no explanation (no USB reset this time either). This might be multiple different problems occurring at the same time though. I'll keep monitoring and see if I get another OOPS.

@hamishmb
Copy link
Author

hamishmb commented Sep 9, 2020

NB: Someone else had success (with the network side) by reverting to an older firmware revision: https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+bug/1861936/comments/66

@cfilipem
Copy link

cfilipem commented Sep 9, 2020

The wired network issue seems to be like the one I reported on the following issue: raspberrypi/firmware#1444

I had those network issues on that 3B+ when I upgraded to 5.4 kernel (firmware 1.20200723-1). Since I reverted to 4.19.118-v7+ (firmware 1.20200601-1), had no more network issues. Going now with 33 days uptime.

@hamishmb
Copy link
Author

Hmm, I was running the same kernel before and still had the issue. Granted, it was the (official) v8+ kernel, because my pi is also doing BOINC projects right now and I needed 64-bit process support, but I'm pretty sure the network also went down several times when I was running the default kernel.

@cfilipem
Copy link

Hmm, I was running the same kernel before and still had the issue. Granted, it was the (official) v8+ kernel, because my pi is also doing BOINC projects right now and I needed 64-bit process support, but I'm pretty sure the network also went down several times when I was running the default kernel.

Ok. In my case, I was using the 32-bit kernel, both on 5.4 and on 4.19. And I don’t have anything else connected to the USB ports.

@hamishmb
Copy link
Author

Okay, no network or USB issues since I enabled this overlay, and nothing in dmesg either - I'm considering this a win. No noticable slowness either compared to the old driver. The LUKS volume also seems more stable but I think there are other interactions too that are unrelated to this bug.

It does look similar to the bug you reported @cfilipem. Perhaps this overlay could be useful for you as well.

Is this to become the default driver sometime soon? It seems great.

@pelwell
Copy link
Contributor

pelwell commented Sep 14, 2020

2711 has a more capable XHCI USB2 controller, as well as a PCIe-attached USB3 controller, so over time the on-board OTG controller will become less important. Also, AArch64 Linux does not support FIQs, so we are forced to use the dwc2 there.

It would be nice to switch to the upstream driver to get rid of the support burden of the optimised downstream driver, but I don't think we're there yet. For some users of mass storage devices there would probably be no loss of performance, but other devices using (mumble) small packets (mumble) isochronous transfers (mumble) would be negatively affected, especially on slower Pis. However, every issue like this adds more weight to the argument to change over.

@P33M
Copy link
Contributor

P33M commented Sep 14, 2020

The OP is using aarch64 which implicitly disables FIQ in dwc_otg, but still uses the FIQ code within a regular IRQ. I'd lean towards this being an unsupported configuration - you lose the benefits of a hard realtime FIQ handler while also adding in a layer of complexity by having two interrupts related to the same hardware happening in the same context.

Maybe the recommendation should be to disable dwc_otg if you're on a BCM2837 platform and are using aarch64.

@pelwell
Copy link
Contributor

pelwell commented Sep 14, 2020

DTBs are common between aarch32 and aarch64; we have the ability to make arch-specific changes at build time but the firmware will use the same filename in both cases so it's of little use. This means that for practical purposes this change would have to take the form of a recommendation (and I wouldn't be surprised if that already exists somewhere), or persuading the firmware to apply the dwc2 overlay automatically.

@hamishmb
Copy link
Author

So I'm okay to be using the overlay on aarch64? It definitely seems better than the old driver.

@pelwell
Copy link
Contributor

pelwell commented Sep 14, 2020

Better than okay, you're recommended.

@hamishmb
Copy link
Author

Okay thanks.

This is great, seems to work really well and far less crashy than the old driver. My LUKS storage does also seem better, but I think there are other causes for that behaviour.

If this is useful to push towards the upstream driver, should I leave this open? Otherwise happy to mark as solved.

@pelwell
Copy link
Contributor

pelwell commented Sep 14, 2020

Close it - GitHub search works well enough.

@hamishmb
Copy link
Author

Fixed with:
dtoverlay=dwc2

@hamishmb
Copy link
Author

Unfortunately, the OOPS at the top just happened again :/ This time it needed a full reboot. Double checked I'm still using the new driver.

I still think it's better though, and of course it might not be this drivers fault that it fell over. Any other places to look?

@hamishmb hamishmb reopened this Sep 14, 2020
@hamishmb
Copy link
Author

Hasn't happened again, seems fine now. Have also deployed in a bigger network of Pis and all seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants