Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta-lxatac-bsp: lxatac-net-switch-hacks: workarounds for connection losses #216

Merged
merged 2 commits into from
Nov 27, 2024

Conversation

hnez
Copy link
Member

@hnez hnez commented Nov 18, 2024

This PR adds two workarounds that should make the network connection more reliable under high loads:

Increase atomic memory pool size

When experiencing load the kernels default min_free_kbytes (of around 2M) seem to little. Hot paths can run out of memory.
Increasing the limit to 8M seems to mitigate the problem.

This manifests in issues in communicating with the ethernet switch under high loads, resulting in network connection losses.

This is only fighting symptoms of an underlying issue, which why it is marked as a hack.

Increase SPI kernel thread priority

When the system is under high load some SPI transfers with the ethernet switch will time out before they are handled.

Increase the priority of the kernel thread that handles the SPI transfer to work around the issue.

It does not make a lot of sense for a SPI transfer, that is 100% under the hosts control (it does not and can not wait for the device for example) to time out in the first place.
This means we are only fighting symptoms here, which is why this change is also marked as a hack.

@hnez
Copy link
Member Author

hnez commented Nov 18, 2024

I've marked @marckleinebudde, who has investigated the issue and came up with the workarounds, and @jluebbe, who has Yocto expertise as reviewers to review content (@marckleinebudde) and implementation (@jluebbe).

Copy link
Member

@marckleinebudde marckleinebudde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

When experiencing load the kernels default min_free_kbytes (of around 2M)
seem to little. Hot paths can run out of memory.
Increasing the limit to 8M seems to mitigate the problem.

This manifests in issues in communicating with the ethernet switch under
high loads, resulting in network connection losses.

This is only fighting symptoms of an underlying issue,
which why it is marked as a hack.

Signed-off-by: Leonard Göhrs <[email protected]>
…hread

When the system is under high load some SPI transfers with the ethernet
switch will time out before they are handled.

Increase the priority of the kernel thread that handles the SPI transfer
to work around the issue.

It does not make a lot of sense for a SPI transfer, that is 100% under the
hosts control (it does not and can not wait for the device for example) to
time out in the first place.
This means we are only fighting symptoms here, which is why this change
is also marked as a hack.

Signed-off-by: Leonard Göhrs <[email protected]>
@hnez hnez merged commit 39027a4 into linux-automation:scarthgap Nov 27, 2024
4 checks passed
@hnez hnez deleted the spi-irq-prio branch November 27, 2024 09:16
@hnez
Copy link
Member Author

hnez commented Nov 27, 2024

For completion sake here is the kernel log of the issue this PR aims to work around:

Aug 08 05:30:13 lxatac-00009 kernel: ksz-switch spi0.0: SPI transfer timed out
Aug 08 05:30:13 lxatac-00009 kernel: lowmem_reserve[]: 0 0
Aug 08 05:30:13 lxatac-00009 kernel: spi_master spi0: failed to transfer one message from queue
Aug 08 05:30:13 lxatac-00009 kernel: Normal: 132*4kB (UMC) 12*8kB (UM) 18*16kB (UM) 47*32kB (M) 213*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 16048kB
Aug 08 05:30:13 lxatac-00009 kernel: spi_master spi0: noqueue transfer failed
Aug 08 05:30:13 lxatac-00009 kernel: 90760 total pagecache pages
Aug 08 05:30:13 lxatac-00009 kernel: 131072 pages RAM
Aug 08 05:30:13 lxatac-00009 kernel: 0 pages HighMem/MovableOnly
Aug 08 05:30:13 lxatac-00009 kernel: 5739 pages reserved
Aug 08 05:30:13 lxatac-00009 kernel: 16384 pages cma reserved
> Aug 08 05:30:13 lxatac-00009 kernel: spi_stm32 44009000.spi: spurious IT (sr=0x00010002, ier=0x00000000)
Aug 08 05:30:13 lxatac-00009 kernel: ksz-switch spi0.0: can't read 16bit reg: 0x2100 -ETIMEDOUT
Aug 08 05:30:13 lxatac-00009 kernel: ------------[ cut here ]------------
Aug 08 05:30:13 lxatac-00009 kernel: WARNING: CPU: 1 PID: 14772 at /drivers/net/phy/phy.c:1262 _phy_state_machine+0x158/0x26c
Aug 08 05:30:13 lxatac-00009 kernel: phy_check_link_status+0x0/0xec: returned: -110
Aug 08 05:30:13 lxatac-00009 kernel: Modules linked in: sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 uas usb_storage ftdi_sio usbserial dm_mod
Aug 08 05:30:13 lxatac-00009 kernel: CPU: 1 PID: 14772 Comm: kworker/1:5 Not tainted 6.9.0-20240514-1 #1 60245d6b4b3ea67a7e30e624a34e435f92404f6f
Aug 08 05:30:13 lxatac-00009 kernel: Hardware name: STM32 (Device Tree Support)
Aug 08 05:30:13 lxatac-00009 kernel: Workqueue: events_power_efficient phy_state_machine
Aug 08 05:30:13 lxatac-00009 kernel: Call trace: 
Aug 08 05:30:13 lxatac-00009 kernel:  unwind_backtrace from show_stack+0x10/0x14
Aug 08 05:30:14 lxatac-00009 kernel:  show_stack from dump_stack_lvl+0x50/0x64
Aug 08 05:30:14 lxatac-00009 kernel:  dump_stack_lvl from __warn+0x94/0xc0
Aug 08 05:30:14 lxatac-00009 kernel:  __warn from warn_slowpath_fmt+0x120/0x1b4
Aug 08 05:30:14 lxatac-00009 kernel:  warn_slowpath_fmt from _phy_state_machine+0x158/0x26c
Aug 08 05:30:14 lxatac-00009 kernel:  _phy_state_machine from phy_state_machine+0x1c/0x3c
Aug 08 05:30:14 lxatac-00009 kernel:  phy_state_machine from process_one_work+0x148/0x2cc
Aug 08 05:30:14 lxatac-00009 kernel:  process_one_work from worker_thread+0x250/0x44c
Aug 08 05:30:14 lxatac-00009 kernel:  worker_thread from kthread+0x110/0x12c
Aug 08 05:30:14 lxatac-00009 kernel:  kthread from ret_from_fork+0x14/0x28
Aug 08 05:30:14 lxatac-00009 kernel: Exception stack(0xe0bd5fb0 to 0xe0bd5ff8)
Aug 08 05:30:14 lxatac-00009 kernel: 5fa0:                                     00000000 00000000 00000000 00000000
Aug 08 05:30:14 lxatac-00009 kernel: 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Aug 08 05:30:14 lxatac-00009 kernel: 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Aug 08 05:30:14 lxatac-00009 kernel: ---[ end trace 0000000000000000 ]---
Aug 08 05:30:14 lxatac-00009 kernel: ksz-switch spi0.0 uplink: Link is Down
Aug 08 05:30:14 lxatac-00009 kernel: tac-bridge: port 1(uplink) entered disabled state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants