Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCU stalls on copy_to_user actions in kernel #11274

Closed
1 of 2 tasks
maxboone opened this issue Mar 8, 2024 · 26 comments
Closed
1 of 2 tasks

RCU stalls on copy_to_user actions in kernel #11274

maxboone opened this issue Mar 8, 2024 · 26 comments

Comments

@maxboone
Copy link

maxboone commented Mar 8, 2024

Windows Version

Microsoft Windows [Version 10.0.22631.3155]

WSL Version

2.1.1.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.146.1-2

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

  • Start WSL on Linux
  • Work on the system for a while
  • It'll start dumping RCU stalls in the kernel logs

Expected Behavior

No RCU stalls, copy_to_user actions fault correctly in the kernel and free up the CPU again

Actual Behavior

System locks up.

Diagnostic Logs

When the kernel is built with CONFIG_RSEQ:

[ 1301.906019] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1301.907897] rcu:     1-....: (104816 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=51500
[ 1301.909720]  (t=105011 jiffies g=8973 q=3510)
[ 1301.910689] Task dump for CPU 1:
[ 1301.911380] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1301.913333] Call trace:
[ 1301.913785]  dump_backtrace+0x0/0x1b0
[ 1301.914694]  show_stack+0x1c/0x24
[ 1301.915494]  sched_show_task+0x164/0x190
[ 1301.916105]  dump_cpu_task+0x48/0x54
[ 1301.916837]  rcu_dump_cpu_stacks+0xec/0x130
[ 1301.917522]  rcu_sched_clock_irq+0x908/0xa40
[ 1301.918803]  update_process_times+0xa0/0x190
[ 1301.919619]  tick_sched_timer+0x5c/0xd0
[ 1301.920287]  __hrtimer_run_queues+0x140/0x32c
[ 1301.921390]  hrtimer_interrupt+0xf4/0x240
[ 1301.922079]  hv_stimer0_isr+0x28/0x30
[ 1301.922699]  hv_stimer0_percpu_isr+0x14/0x20
[ 1301.923514]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1301.924360]  handle_domain_irq+0x64/0x90
[ 1301.924952]  gic_handle_irq+0x58/0x128
[ 1301.925609]  call_on_irq_stack+0x20/0x38
[ 1301.926268]  do_interrupt_handler+0x54/0x5c
[ 1301.926894]  el1_interrupt+0x2c/0x4c
[ 1301.927543]  el1h_64_irq_handler+0x14/0x20
[ 1301.928252]  el1h_64_irq+0x74/0x78
[ 1301.928911]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1301.929527]  do_notify_resume+0xf8/0xeb0
[ 1301.931746]  el0_svc+0x3c/0x50
[ 1301.932741]  el0t_64_sync_handler+0x9c/0x120
[ 1301.933676]  el0t_64_sync+0x158/0x15c
[ 1468.089387] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1468.090681] rcu:     1-....: (149813 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=73593
[ 1468.092491]  (t=150019 jiffies g=8973 q=4968)
[ 1468.093259] Task dump for CPU 1:
[ 1468.093982] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1468.095831] Call trace:
[ 1468.096253]  dump_backtrace+0x0/0x1b0
[ 1468.097658]  show_stack+0x1c/0x24
[ 1468.098868]  sched_show_task+0x164/0x190
[ 1468.099487]  dump_cpu_task+0x48/0x54
[ 1468.100903]  rcu_dump_cpu_stacks+0xec/0x130
[ 1468.101611]  rcu_sched_clock_irq+0x908/0xa40
[ 1468.102838]  update_process_times+0xa0/0x190
[ 1468.103719]  tick_sched_timer+0x5c/0xd0
[ 1468.104406]  __hrtimer_run_queues+0x140/0x32c
[ 1468.105287]  hrtimer_interrupt+0xf4/0x240
[ 1468.105953]  hv_stimer0_isr+0x28/0x30
[ 1468.106588]  hv_stimer0_percpu_isr+0x14/0x20
[ 1468.107497]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1468.109079]  handle_domain_irq+0x64/0x90
[ 1468.109684]  gic_handle_irq+0x58/0x128
[ 1468.110504]  call_on_irq_stack+0x20/0x38
[ 1468.111322]  do_interrupt_handler+0x54/0x5c
[ 1468.111945]  el1_interrupt+0x2c/0x4c
[ 1468.112568]  el1h_64_irq_handler+0x14/0x20
[ 1468.114268]  el1h_64_irq+0x74/0x78
[ 1468.115010]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1468.115848]  do_notify_resume+0xf8/0xeb0
[ 1468.116700]  el0_svc+0x3c/0x50
[ 1468.117973]  el0t_64_sync_handler+0x9c/0x120
[ 1468.118875]  el0t_64_sync+0x158/0x15c
[ 1634.272774] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1634.273960] rcu:     1-....: (194688 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=95152
[ 1634.275462]  (t=195027 jiffies g=8973 q=6426)
[ 1634.276232] Task dump for CPU 1:
[ 1634.276900] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1634.278492] Call trace:
[ 1634.278904]  dump_backtrace+0x0/0x1b0
[ 1634.279530]  show_stack+0x1c/0x24
[ 1634.280323]  sched_show_task+0x164/0x190
[ 1634.281025]  dump_cpu_task+0x48/0x54
[ 1634.281731]  rcu_dump_cpu_stacks+0xec/0x130
[ 1634.282321]  rcu_sched_clock_irq+0x908/0xa40
[ 1634.283189]  update_process_times+0xa0/0x190
[ 1634.284083]  tick_sched_timer+0x5c/0xd0
[ 1634.284730]  __hrtimer_run_queues+0x140/0x32c
[ 1634.285560]  hrtimer_interrupt+0xf4/0x240
[ 1634.286247]  hv_stimer0_isr+0x28/0x30
[ 1634.286878]  hv_stimer0_percpu_isr+0x14/0x20
[ 1634.287682]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1634.288558]  handle_domain_irq+0x64/0x90
[ 1634.289195]  gic_handle_irq+0x58/0x128
[ 1634.289787]  call_on_irq_stack+0x20/0x38
[ 1634.290395]  do_interrupt_handler+0x54/0x5c
[ 1634.291090]  el1_interrupt+0x2c/0x4c
[ 1634.291835]  el1h_64_irq_handler+0x14/0x20
[ 1634.292476]  el1h_64_irq+0x74/0x78
[ 1634.293174]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1634.293958]  do_notify_resume+0xf8/0xeb0
[ 1634.294543]  el0_svc+0x3c/0x50
[ 1634.295654]  el0t_64_sync_handler+0x9c/0x120
[ 1634.297186]  el0t_64_sync+0x158/0x15c
[ 1800.463929] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1800.539655] rcu:     1-....: (238052 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=115916
[ 1800.691505]  (t=240096 jiffies g=8973 q=7884)
[ 1800.692592] Task dump for CPU 1:
[ 1800.725924] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000a
[ 1800.789528] Call trace:
[ 1800.791436]  dump_backtrace+0x0/0x1b0
[ 1800.899671]  show_stack+0x1c/0x24
[ 1800.913866]  sched_show_task+0x164/0x190
[ 1800.943585]  dump_cpu_task+0x48/0x54
[ 1800.960038]  rcu_dump_cpu_stacks+0xec/0x130
[ 1800.970630]  rcu_sched_clock_irq+0x908/0xa40
[ 1801.002321]  update_process_times+0xa0/0x190
[ 1801.010376]  tick_sched_timer+0x5c/0xd0
[ 1801.016018]  __hrtimer_run_queues+0x140/0x32c
[ 1801.039996]  hrtimer_interrupt+0xf4/0x240
[ 1801.132611]  hv_stimer0_isr+0x28/0x30
[ 1801.177669]  hv_stimer0_percpu_isr+0x14/0x20
[ 1801.233281]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1801.234490]  handle_domain_irq+0x64/0x90
[ 1801.335334]  gic_handle_irq+0x58/0x128
[ 1801.336804]  call_on_irq_stack+0x20/0x38
[ 1801.338003]  do_interrupt_handler+0x54/0x5c
[ 1801.353157]  el1_interrupt+0x2c/0x4c
[ 1801.403627]  el1h_64_irq_handler+0x14/0x20
[ 1801.467756]  el1h_64_irq+0x74/0x78
[ 1801.491109]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1801.519716]  do_notify_resume+0xf8/0xeb0
[ 1801.570939]  el0_svc+0x3c/0x50
[ 1801.587231]  el0t_64_sync_handler+0x9c/0x120
[ 1801.598718]  el0t_64_sync+0x158/0x15c
[ 1836.057995] hrtimer: interrupt took 44005651 ns
[ 1962.769498] Exception:
[ 1962.770161] Operation canceled @p9io.cpp:258 (AcceptAsync)
[ 1962.944713]
[ 1967.784952] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1967.856279] rcu:     1-....: (276088 ticks this GP) idle=9b9/1/0x4000000000000002 softirq=5231/5231 fqs=132893
[ 1968.000378]  (t=285406 jiffies g=8973 q=9331)
[ 1968.103228] rcu: rcu_sched kthread starved for 1379 jiffies! g8973 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2
[ 1968.320482] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1968.600515] rcu: RCU grace-period kthread stack dump:
[ 1968.683365] task:rcu_sched       state:R  running task     stack:    0 pid:   14 ppid:     2 flags:0x00000008
[ 1968.852308] Call trace:
[ 1968.888048]  __switch_to+0xb4/0xec
[ 1968.999212]  __schedule+0x2cc/0x800
[ 1969.061683]  schedule+0x64/0x100
[ 1969.144007]  schedule_timeout+0x9c/0x184
[ 1969.220361]  rcu_gp_fqs_loop+0x100/0x364
[ 1969.296220]  rcu_gp_kthread+0x108/0x140
[ 1969.388705]  kthread+0x124/0x130
[ 1969.468734]  ret_from_fork+0x10/0x20
[ 1969.557124] rcu: Stack dump where RCU GP kthread last ran:
[ 1969.689180] Task dump for CPU 2:
[ 1969.770725] task:weston          state:R  running task     stack:    0 pid:  145 ppid:   133 flags:0x0000000f
[ 1970.150921] Call trace:
[ 1970.246591]  __switch_to+0xb4/0xec
[ 1970.344106]  0xffff000002cb1d00
[ 1970.437906] Task dump for CPU 1:
[ 1970.528765] task:weston          state:R  running task     stack:    0 pid:  137 ppid:   133 flags:0x0000000b
[ 1971.000258] Call trace:
[ 1971.129089]  dump_backtrace+0x0/0x1b0
[ 1971.277596]  show_stack+0x1c/0x24
[ 1971.362667]  sched_show_task+0x164/0x190
[ 1971.434626]  dump_cpu_task+0x48/0x54
[ 1971.554021]  rcu_dump_cpu_stacks+0xec/0x130
[ 1971.691116]  rcu_sched_clock_irq+0x908/0xa40
[ 1971.826492]  update_process_times+0xa0/0x190
[ 1971.899642]  tick_sched_timer+0x5c/0xd0
[ 1971.964201]  __hrtimer_run_queues+0x140/0x32c
[ 1972.133174]  hrtimer_interrupt+0xf4/0x240
[ 1972.185289]  hv_stimer0_isr+0x28/0x30
[ 1972.278692]  hv_stimer0_percpu_isr+0x14/0x20
[ 1972.415058]  handle_percpu_devid_irq+0x8c/0x1b4
[ 1972.495565]  handle_domain_irq+0x64/0x90
[ 1972.577916]  gic_handle_irq+0x58/0x128
[ 1972.695828]  call_on_irq_stack+0x20/0x38
[ 1972.740457]  do_interrupt_handler+0x54/0x5c
[ 1972.783890]  el1_interrupt+0x2c/0x4c
[ 1972.887542]  el1h_64_irq_handler+0x14/0x20
[ 1972.939049]  el1h_64_irq+0x74/0x78
[ 1972.993798]  clear_rseq_cs.isra.0+0x4c/0x60
[ 1973.103571]  do_notify_resume+0xf8/0xeb0
[ 1973.162704]  el0_svc+0x3c/0x50
[ 1973.203463]  el0t_64_sync_handler+0x9c/0x120
[ 1973.282090]  el0t_64_sync+0x158/0x15c

When the kernel is built without CONFIG_RSEQ (performance and time till stall is significantly better):

[  675.812339] rcu: INFO: rcu_sched self-detected stall on CPU
[  675.814587] rcu:     3-....: (14893 ticks this GP) idle=762c/1/0x4000000000000000 softirq=6920/6920 fqs=6610
[  675.815606] rcu:     (t=15001 jiffies g=50497 q=1304 ncpus=8)
[  675.816520] CPU: 3 PID: 232 Comm: snapfuse Not tainted 6.7.7-WSL2-STABLE+ #2
[  675.816550] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  675.816553] pc : __arch_copy_to_user+0x1a0/0x240
[  675.817689] lr : _copy_to_iter+0xf0/0x560
[  675.818069] sp : ffff800082ceba80
[  675.818070] x29: ffff800082cebac0 x28: 0000000001b2c000 x27: 0000000000000005
[  675.818074] x26: 0000000000000000 x25: ffff00004c491000 x24: 0000000000000000
[  675.818076] x23: 0000000000001000 x22: 0000040000000000 x21: ffff800082cebd30
[  675.818079] x20: ffff800082cebd30 x19: 0000000000001000 x18: 0000000000000000
[  675.818081] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00004c491000
[  675.818083] x14: 9887db4ae914c054 x13: 6bcd444ce14effe5 x12: 0b22b481c6001041
[  675.818086] x11: 7513c0250d7df247 x10: b85affa4063b12c7 x9 : 368beb85bc648557
[  675.818088] x8 : 217c88df9795370e x7 : a16d77942052b4ab x6 : 0000aaf844516fff
[  675.818090] x5 : 0000aaf844517e2f x4 : 0000000000000000 x3 : 0000000000003daf
[  675.818092] x2 : 0000000000000dc0 x1 : ffff00004c491210 x0 : 0000aaf844516e2f
[  675.818096] Call trace:
[  675.818143]  __arch_copy_to_user+0x1a0/0x240
[  675.818147]  copy_page_to_iter+0xbc/0x140
[  675.818150]  filemap_read+0x1b0/0x398
[  675.818427]  generic_file_read_iter+0x48/0x168
[  675.818429]  ext4_file_read_iter+0x58/0x288
[  675.818681]  vfs_read+0x1e8/0x280
[  675.818804]  ksys_pread64+0x90/0xf0
[  675.818806]  __arm64_sys_pread64+0x24/0x48
[  675.818807]  invoke_syscall.constprop.0+0x54/0x128
[  675.818912]  do_el0_svc+0x44/0xf0
[  675.818914]  el0_svc+0x24/0xb0
[  675.819041]  el0t_64_sync_handler+0x138/0x148
[  675.819043]  el0t_64_sync+0x14c/0x150
[  681.501178] block sda: the capability attribute has been deprecated.
[  741.700330] rcu: INFO: rcu_sched self-detected stall on CPU
[  741.701707] rcu:     4-....: (14940 ticks this GP) idle=8074/1/0x4000000000000000 softirq=13021/13037 fqs=6400
[  741.703152] rcu:     (t=15001 jiffies g=50713 q=5093 ncpus=8)
[  741.704017] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  741.704047] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  741.704050] pc : __arch_copy_to_user+0x190/0x240
[  741.704424] lr : _copy_to_iter+0xf0/0x560
[  741.704565] sp : ffff800082cfb870
[  741.704566] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  741.704569] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  741.704571] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  741.704574] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  741.704608] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  741.704610] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  741.704612] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  741.704614] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  741.704617] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  741.704619] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  741.704621] Call trace:
[  741.704647]  __arch_copy_to_user+0x190/0x240
[  741.704651]  simple_copy_to_iter+0x48/0x98
[  741.704939]  __skb_datagram_iter+0x7c/0x280
[  741.704941]  skb_copy_datagram_iter+0x48/0xc8
[  741.704943]  unix_stream_read_actor+0x30/0x68
[  741.705137]  unix_stream_read_generic+0x304/0xb70
[  741.705139]  unix_stream_recvmsg+0xc0/0xd0
[  741.705140]  sock_recvmsg+0x88/0x108
[  741.705170]  ____sys_recvmsg+0x78/0x198
[  741.705171]  ___sys_recvmsg+0x80/0xf0
[  741.705173]  __sys_recvmsg+0x5c/0xd0
[  741.705175]  __arm64_sys_recvmsg+0x28/0x50
[  741.705177]  invoke_syscall.constprop.0+0x54/0x128
[  741.705316]  do_el0_svc+0xcc/0xf0
[  741.705317]  el0_svc+0x24/0xb0
[  741.705369]  el0t_64_sync_handler+0x138/0x148
[  741.705371]  el0t_64_sync+0x14c/0x150
[  743.232431] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 4-.... } 15347 jiffies s: 721 root: 0x10/.
[  743.234297] rcu: blocking rcu_node structures (internal RCU debug):
[  743.235477] Sending NMI from CPU 1 to CPUs 4:
[  743.235491] NMI backtrace for cpu 4
[  743.235531] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  743.235535] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  743.235537] pc : __arch_copy_to_user+0x190/0x240
[  743.235598] lr : _copy_to_iter+0xf0/0x560
[  743.235603] sp : ffff800082cfb870
[  743.235604] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  743.235607] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  743.235610] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  743.235612] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  743.235614] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  743.235617] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  743.235619] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  743.235621] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  743.235623] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  743.235626] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  743.235628] Call trace:
[  743.235630]  __arch_copy_to_user+0x190/0x240
[  743.235632]  simple_copy_to_iter+0x48/0x98
[  743.235636]  __skb_datagram_iter+0x7c/0x280
[  743.235639]  skb_copy_datagram_iter+0x48/0xc8
[  743.235641]  unix_stream_read_actor+0x30/0x68
[  743.235644]  unix_stream_read_generic+0x304/0xb70
[  743.235646]  unix_stream_recvmsg+0xc0/0xd0
[  743.235647]  sock_recvmsg+0x88/0x108
[  743.235650]  ____sys_recvmsg+0x78/0x198
[  743.235651]  ___sys_recvmsg+0x80/0xf0
[  743.235653]  __sys_recvmsg+0x5c/0xd0
[  743.235655]  __arm64_sys_recvmsg+0x28/0x50
[  743.235657]  invoke_syscall.constprop.0+0x54/0x128
[  743.235661]  do_el0_svc+0xcc/0xf0
[  743.235663]  el0_svc+0x24/0xb0
[  743.235667]  el0t_64_sync_handler+0x138/0x148
[  743.235668]  el0t_64_sync+0x14c/0x150

And

[ 1559.425979] rcu:     7-....: (14977 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1559.431367] rcu:     (t=15002 jiffies g=67965 q=36939 ncpus=8)
[ 1559.432083] rcu: rcu_sched kthread starved for 2866 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1559.433645] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1559.434604] rcu: RCU grace-period kthread stack dump:
[ 1559.435549] rcu: Stack dump where RCU GP kthread last ran:
[ 1739.451891] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1739.452511] rcu:     7-....: (59616 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1739.453498] rcu:     (t=60008 jiffies g=67965 q=36939 ncpus=8)
[ 1739.454053] rcu: rcu_sched kthread starved for 47871 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1739.455135] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1739.456110] rcu: RCU grace-period kthread stack dump:
[ 1739.456687] rcu: Stack dump where RCU GP kthread last ran:
[ 1919.467822] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1919.468776] rcu:     7-....: (104010 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1919.470405] rcu:     (t=105012 jiffies g=67965 q=36941 ncpus=8)
[ 1919.472599] rcu: rcu_sched kthread starved for 92875 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1919.474013] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1919.474977] rcu: RCU grace-period kthread stack dump:
[ 1919.475641] rcu: Stack dump where RCU GP kthread last ran:
@maxboone
Copy link
Author

maxboone commented Mar 8, 2024

Please refer to other issues for logs facing this problem, and a thread in the RCU kernel discussion that suggests in the direction of improper fault handling of the copy_to_user. Also, note that this doesn't happen on regular Gen2 VMs in Hyper-V, but the CPU virtualization seems very different between a Hyper-V VM and WSL VM.

@david-nordvall
Copy link

An update after some more rounds of patch Tuesdays and more WSL pre-releases. After updating to WSL 2.2.2 today the stack traces changed somewhat. Now instead of the previous copy_to_user errors, I get this:

[ 2474.978293] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-.... 4-.... 7-.... } 15058 jiffies s: 497 root: 0x92/.
[ 2474.980247] rcu: blocking rcu_node structures (internal RCU debug):
[ 2474.980704] Sending NMI from CPU 2 to CPUs 1:
[ 2474.980711] NMI backtrace for cpu 1
[ 2474.980715] CPU: 1 PID: 4179 Comm: docker Not tainted 6.7.7-WSL2-STABLE+ #2
[ 2474.980718] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2474.980721] pc : mm_release+0x48/0x110
[ 2474.980730] lr : mm_release+0x20/0x110
[ 2474.980732] sp : ffff800086733bb0
[ 2474.980733] x29: ffff800086733bb0 x28: ffff0000a81a0788 x27: 0000000000000009
[ 2474.980736] x26: 0000000000000000 x25: 0000ab760004f300 x24: 0000000000000008
[ 2474.980739] x23: 0000000000000000 x22: ffff000020828000 x21: ffff000000734840
[ 2474.980741] x20: ffff000000734840 x19: ffff0000a81a0000 x18: 0000000000000000
[ 2474.980744] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 2474.980746] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 2474.980748] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008003b750
[ 2474.980749] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 2474.980751] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 2474.980753] x2 : 0000ff29c740f1d0 x1 : 0000000000000000 x0 : 0000ff29c740f1d0
[ 2474.980755] Call trace:
[ 2474.980757]  mm_release+0x48/0x110
[ 2474.980759]  exit_mm_release+0x2c/0x50
[ 2474.980761]  do_exit+0x21c/0x9f0
[ 2474.980763]  do_group_exit+0x38/0xa0
[ 2474.980765]  get_signal+0x978/0x980
[ 2474.980767]  do_notify_resume+0x12c/0xe28
[ 2474.980770]  el0_svc+0x90/0xb0
[ 2474.980774]  el0t_64_sync_handler+0x138/0x148
[ 2474.980776]  el0t_64_sync+0x14c/0x150
.
.
.

The symptoms are still exactly the same as before, however, and just as crippling.

As you can see, I am not running the WSL stock kernel but a custom 6.7.7 kernel built with CONFIG_RSEQ.

PS C:\Users\DavidNordvall> wsl --version
WSL-version: 2.2.2.0
Kernelversion: 5.15.150.1-2
WSLg-version: 1.0.61
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3374

@david-nordvall
Copy link

And after another patch Tuesday upgrade I'm back to getting the same old copy_to_user stack traces.

PS C:\Users\DavidNordvall> wsl --version
WSL-version: 2.2.2.0
Kernelversion: 5.15.150.1-2
WSLg-version: 1.0.61
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3447

@craigloewen-msft, @benhillis and @pmartincic, can any of you please provide any update on this (even if it is just to say that this won't be getting attention any time soon)? Has the bug fix mentioned in #10667 (comment) been released? Have you hade a chance to look at the logs that has been provided in all the Github issues regarding this problem?

@OneBlue
Copy link
Collaborator

OneBlue commented Apr 18, 2024

@maxboone: Do you see the same behavior with the stock WSL kernel ?

If so can you share /logs of a repro ?

@maxboone
Copy link
Author

maxboone commented Apr 18, 2024

@maxboone: Do you see the same behavior with the stock WSL kernel ?

If so can you share /logs of a repro ?

Yes, please refer to other issues that got stale, there are logs in there.

Please refer to other issues for logs facing this problem, and a thread in the RCU kernel discussion that suggests in the direction of improper fault handling of the copy_to_user. Also, note that this doesn't happen on regular Gen2 VMs in Hyper-V, but the CPU virtualization seems very different between a Hyper-V VM and WSL VM.

I will collect logs again over the weekend. An exact repro is hard as this just happens over time.

@david-nordvall
Copy link

david-nordvall commented Apr 19, 2024

Here are some logs for you, @OneBlue.

As @maxboone says, it's really hard to collect logs. Not because it's hard to reproduce the problem (I can do it within one minute; see #10667 for a detailed description of my use case) but because it isn't a "specific event" but rather a case of intermittent stalls and slowdowns that increasingly gets worse over time (for example, I get a bunch of hrtimer: interrupt took 16343583 ns events in my dmesg -w logs, which indicates to me that things just takes longer than expected). So exactly "when" should I collect these logs?

I have, however, tried to collect logs at three points of slowdown/hang. I really hope they help!

EDIT: This is done using the WSL2 pre-release (but not with any custom kernel)

PS C:\Users\DavidNordvall> wsl --version
WSL-version: 2.2.2.0
Kernelversion: 5.15.150.1-2
WSLg-version: 1.0.61
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3447
david@DavidsSrfcPro9:~$ uname -a
Linux DavidsSrfcPro9 5.15.150.1-microsoft-standard-WSL2 #1 SMP Thu Mar 7 03:23:44 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

@maxboone
Copy link
Author

maxboone commented Apr 24, 2024

@OneBlue can you confirm these logs (from @david-nordvall) are sufficient to continue?

@PeronGH
Copy link

PeronGH commented Apr 28, 2024

I recently encountered the RCU stall again, but the stack trace looks different. Here it is:

[ 6199.460642] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6199.461362] rcu:     1-....: (13402 ticks this GP) idle=3a7/1/0x4000000000000002 softirq=19983/19983 fqs=6423
[ 6199.462090]  (t=15000 jiffies g=57053 q=1668)
[ 6199.462515] Task dump for CPU 1:
[ 6199.462818] task:weston          state:R  running task     stack:    0 pid:  857 ppid:   645 flags:0x0000080b
[ 6199.463667] Call trace:
[ 6199.463913]  dump_backtrace+0x0/0x1b0
[ 6199.464635]  show_stack+0x1c/0x24
[ 6199.464959]  sched_show_task+0x164/0x190
[ 6199.465521]  dump_cpu_task+0x48/0x54
[ 6199.466252]  rcu_dump_cpu_stacks+0xec/0x130
[ 6199.466542]  rcu_sched_clock_irq+0x914/0xa50
[ 6199.467061]  update_process_times+0xa0/0x190
[ 6199.467529]  tick_sched_timer+0x5c/0xd0
[ 6199.467869]  __hrtimer_run_queues+0x13c/0x330
[ 6199.468327]  hrtimer_interrupt+0xf4/0x240
[ 6199.468694]  hv_stimer0_isr+0x28/0x30
[ 6199.469128]  hv_stimer0_percpu_isr+0x14/0x20
[ 6199.469550]  handle_percpu_devid_irq+0x8c/0x1b4
[ 6199.469982]  handle_domain_irq+0x64/0x90
[ 6199.470382]  gic_handle_irq+0x58/0x128
[ 6199.470692]  call_on_irq_stack+0x20/0x38
[ 6199.471010]  do_interrupt_handler+0x54/0x5c
[ 6199.471349]  el1_interrupt+0x2c/0x4c
[ 6199.471699]  el1h_64_irq_handler+0x14/0x20
[ 6199.472039]  el1h_64_irq+0x74/0x78
[ 6199.472500]  mm_release+0xd8/0x140
[ 6199.473073]  exit_mm_release+0x2c/0x40
[ 6199.473630]  do_exit+0x19c/0xa60
[ 6199.474011]  do_group_exit+0x3c/0xa4
[ 6199.474326]  get_signal+0x1e4/0x9b0
[ 6199.474732]  do_notify_resume+0x138/0xe6c
[ 6199.475091]  el0_svc+0x3c/0x4c
[ 6199.475581]  el0t_64_sync_handler+0x9c/0x120
[ 6199.476007]  el0t_64_sync+0x158/0x15c
[ 6199.476323] Task dump for CPU 4:
[ 6199.476673] task:rdp-source      state:R  running task     stack:    0 pid:  861 ppid:   645 flags:0x0000080b
[ 6199.477631] Call trace:
[ 6199.478165]  __switch_to+0xb4/0xec
[ 6199.478831]  0xffff000048ae4880
[ 6200.364438] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... 4-... } 15226 jiffies s: 357 root: 0x12/.
[ 6200.365645] rcu: blocking rcu_node structures (internal RCU debug):
[ 6200.366292] Task dump for CPU 1:
[ 6200.366673] task:weston          state:R  running task     stack:    0 pid:  857 ppid:   645 flags:0x0000080b
[ 6200.367605] Call trace:
[ 6200.367822]  __switch_to+0xb4/0xec
[ 6200.368293]  0xffff0000026a8e80
[ 6200.368629] Task dump for CPU 4:
[ 6200.368932] task:rdp-source      state:R  running task     stack:    0 pid:  861 ppid:   645 flags:0x0000080b
[ 6200.369669] Call trace:
[ 6200.369834]  __switch_to+0xb4/0xec
[ 6200.370124]  0xffff000048ae4880
[ 6208.996562] hrtimer: interrupt took 28039200 ns
[ 6379.492381] rcu: INFO: rcu_sched self-detected stall on CPU
[ 6379.493137] rcu:     4-....: (59969 ticks this GP) idle=e35/1/0x4000000000000002 softirq=20957/20957 fqs=23939
[ 6379.493761]  (t=60008 jiffies g=57053 q=1696)
[ 6379.494037] Task dump for CPU 1:
[ 6379.494290] task:weston          state:R  running task     stack:    0 pid:  857 ppid:   645 flags:0x0000080b
[ 6379.494897] Call trace:
[ 6379.495030]  __switch_to+0xb4/0xec
[ 6379.495236]  0xffff0000026a8e80
[ 6379.495426] Task dump for CPU 4:
[ 6379.495640] task:rdp-source      state:R  running task     stack:    0 pid:  861 ppid:   645 flags:0x0000080b
[ 6379.496227] Call trace:
[ 6379.496432]  dump_backtrace+0x0/0x1b0
[ 6379.496758]  show_stack+0x1c/0x24
[ 6379.497047]  sched_show_task+0x164/0x190
[ 6379.497323]  dump_cpu_task+0x48/0x54
[ 6379.497707]  rcu_dump_cpu_stacks+0xec/0x130
[ 6379.498010]  rcu_sched_clock_irq+0x914/0xa50
[ 6379.498318]  update_process_times+0xa0/0x190
[ 6379.498588]  tick_sched_timer+0x5c/0xd0
[ 6379.498856]  __hrtimer_run_queues+0x13c/0x330
[ 6379.499170]  hrtimer_interrupt+0xf4/0x240
[ 6379.499381]  hv_stimer0_isr+0x28/0x30
[ 6379.499589]  hv_stimer0_percpu_isr+0x14/0x20
[ 6379.499865]  handle_percpu_devid_irq+0x8c/0x1b4
[ 6379.500236]  handle_domain_irq+0x64/0x90
[ 6379.500536]  gic_handle_irq+0x58/0x128
[ 6379.500789]  call_on_irq_stack+0x20/0x38
[ 6379.501045]  do_interrupt_handler+0x54/0x5c
[ 6379.501352]  el1_interrupt+0x2c/0x4c
[ 6379.501636]  el1h_64_irq_handler+0x14/0x20
[ 6379.501892]  el1h_64_irq+0x74/0x78
[ 6379.502067]  mm_release+0xd8/0x140
[ 6379.502233]  exit_mm_release+0x2c/0x40
[ 6379.502398]  do_exit+0x19c/0xa60
[ 6379.502567]  do_group_exit+0x3c/0xa4
[ 6379.502730]  get_signal+0x1e4/0x9b0
[ 6379.502895]  do_notify_resume+0x138/0xe6c
[ 6379.503057]  el0_svc+0x3c/0x4c
[ 6379.503220]  el0t_64_sync_handler+0x9c/0x120
[ 6379.503430]  el0t_64_sync+0x158/0x15c
[ 6394.924424] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... 4-... } 63866 jiffies s: 357 root: 0x12/.
[ 6394.925358] rcu: blocking rcu_node structures (internal RCU debug):
[ 6394.925924] Task dump for CPU 1:
[ 6394.926295] task:weston          state:R  running task     stack:    0 pid:  857 ppid:   645 flags:0x0000080b
[ 6394.927296] Call trace:
[ 6394.927445]  __switch_to+0xb4/0xec
[ 6394.927736]  0xffff0000026a8e80
[ 6394.927998] Task dump for CPU 4:
[ 6394.928223] task:rdp-source      state:R  running task     stack:    0 pid:  861 ppid:   645 flags:0x0000080b
[ 6394.928838] Call trace:
[ 6394.928969]  __switch_to+0xb4/0xec
[ 6394.929175]  0xffff000048ae4880
WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
WSLg version: 1.0.61
MSRDC version: 1.2.5326
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26091.1-240325-1447.ge-release
Windows version: 10.0.22635.3566

The kernel is the latest official one compiled with RSEQ disabled.

@david-nordvall
Copy link

@OneBlue, just curious if you've had the chance to have a look at the logs I provided? Are they any help?

@maxboone
Copy link
Author

@OneBlue any update?

@mazugrin
Copy link

mazugrin commented May 15, 2024

Still happening with kernel v6.9 and wsl version 2.2.4.0:

[12689.226208] rcu: INFO: rcu_sched self-detected stall on CPU
[12689.226969] rcu:     2-....: (20995 ticks this GP) idle=1ae4/1/0x4000000000000000 softirq=717849/717849 fqs=9028
[12689.228228] rcu:     (t=21006 jiffies g=766217 q=139020 ncpus=8)
[12689.228885] CPU: 2 PID: 299 Comm: rsyslogd Not tainted 6.9.0 #1
[12689.228899] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[12689.228905] pc : clear_rseq_cs.isra.0+0x20/0x28
[12689.228921] lr : __rseq_handle_notify_resume+0x6c/0x348
[12689.228927] sp : ffff8000852a3db0
[12689.228930] x29: ffff8000852a3db0 x28: ffff00010ea45b80 x27: 0000000000000000
[12689.228941] x26: 0000000000000000 x25: 0000000000000000 x24: 0000ffffb1fdec2c
[12689.228949] x23: 0000000080000000 x22: ffff8000852a3eb0 x21: ffff8000852a3eb0
[12689.228958] x20: ffff00010ea45b80 x19: 0000000000000000 x18: 0000000000000000
[12689.228966] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000852a3d28
[12689.228974] x14: ffff00010ea45c00 x13: 0000000000000001 x12: ffff800081358cc8
[12689.228982] x11: ffff000102302180 x10: 0000000000000900 x9 : ffff8000852a3700
[12689.228991] x8 : 000000385994105c x7 : 000000006644e87c x6 : 0000ffffd2d94518
[12689.228999] x5 : 0000ffffd2d94518 x4 : 0000000000400100 x3 : 0000000000000000
[12689.229007] x2 : 0000000000000000 x1 : 0000ffffb22c7fe8 x0 : 0000000000000000
[12689.229016] Call trace:
[12689.229020]  clear_rseq_cs.isra.0+0x20/0x28
[12689.229027]  do_notify_resume+0xa8/0x138
[12689.229035]  el0_svc+0xb4/0x11c
[12689.229043]  el0t_64_sync_handler+0x134/0x150
[12689.229048]  el0t_64_sync+0x14c/0x150

@maxboone
Copy link
Author

@OneBlue @pmartincic with the announcement of the new surface copilot pc series, any update?

@david-nordvall
Copy link

Not to mention with the release of Docker Desktop for Windows on ARM. I'm traveling and won't be able to test the official Docker Desktop release on my Surface Pro 9 5G for a couple of weeks. Has anyone else tested it?

@mazugrin
Copy link

Still happening on kernel 6.9.5 under WSL version: 2.2.4.0:
This thing is unusable for any real work. Will it ever be fixed @OneBlue @pmartincic ?

[  978.163963] rcu: INFO: rcu_sched self-detected stall on CPU
[  978.165483] rcu:     4-....: (5249 ticks this GP) idle=978c/1/0x4000000000000000 softirq=28554/28554 fqs=2221
[  978.165840] rcu:              hardirqs   softirqs   csw/system
[  978.166034] rcu:      number:     2623         84            0
[  978.166233] rcu:     cputime:        0          0        10488   ==> 10488(ms)
[  978.166469] rcu:     (t=5250 jiffies g=42345 q=334 ncpus=8)
[  978.166678] Sending NMI from CPU 4 to CPUs 0:
[  978.166924] NMI backtrace for cpu 0
[  978.167171] CPU: 0 PID: 3149 Comm: play-dev-mode-p Not tainted 6.9.5 #6
[  978.167459] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  978.167721] pc : clear_rseq_cs.isra.0+0x20/0x28
[  978.168664] lr : __rseq_handle_notify_resume+0x6c/0x348
[  978.168950] sp : ffff80008909bdb0
[  978.169151] x29: ffff80008909bdb0 x28: ffff00026fae4c40 x27: 0000000000000000
[  978.169536] x26: 0000000000000000 x25: 0000000000000000 x24: 0000ffffb0039dfc
[  978.169902] x23: 0000000080001000 x22: ffff80008909beb0 x21: ffff80008909beb0
[  978.170282] x20: ffff00026fae4c40 x19: 0000000000000000 x18: 0000000000000000
[  978.170660] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  978.171051] x14: ffff00026fae4cc0 x13: 0000000000000001 x12: ffff800081328cc0
[  978.171449] x11: ffff000100d26400 x10: 0000000000000900 x9 : ffff80008909bb40
[  978.171879] x8 : ffff00026fae55a0 x7 : ffff0001019bf000 x6 : 0000000000000000
[  978.172225] x5 : 0000000000000000 x4 : 0000000000400040 x3 : 0000000000000000
[  978.172536] x2 : 0000000000000000 x1 : 0000fffe53bff8c8 x0 : 0000000000000000
[  978.172859] Call trace:
[  978.172998]  clear_rseq_cs.isra.0+0x20/0x28
[  978.173151]  do_notify_resume+0xa8/0x138
[  978.173325]  el0_svc+0xb4/0x11c
[  978.173841]  el0t_64_sync_handler+0x134/0x150
[  978.174151]  el0t_64_sync+0x14c/0x150
[  978.174924] CPU: 4 PID: 3148 Comm: play-dev-mode-p Not tainted 6.9.5 #6
[  978.175321] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  978.175703] pc : clear_rseq_cs.isra.0+0x20/0x28
[  978.176155] lr : __rseq_handle_notify_resume+0x6c/0x348
[  978.176354] sp : ffff800089093db0
[  978.176507] x29: ffff800089093db0 x28: ffff00020dd0adc0 x27: 0000000000000000
[  978.176814] x26: 0000000000000000 x25: 0000000000000000 x24: 0000ffffb0039dfc
[  978.177113] x23: 0000000080001000 x22: ffff800089093eb0 x21: ffff800089093eb0
[  978.177412] x20: ffff00020dd0adc0 x19: 0000000000000000 x18: 0000000000000000
[  978.177749] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  978.178047] x14: ffff00020dd0ae40 x13: 0000000000000001 x12: ffff800081328cc0
[  978.178342] x11: ffff000100d26400 x10: 0000000000000900 x9 : ffff800089093b40
[  978.178641] x8 : ffff00020dd0b720 x7 : ffff0001019bf800 x6 : 0000000000000000
[  978.178941] x5 : 0000000000000000 x4 : 0000000000400040 x3 : 0000000000000000
[  978.179236] x2 : 0000000000000000 x1 : 0000fffe541ff8c8 x0 : 0000000000000000
[  978.179533] Call trace:
[  978.179636]  clear_rseq_cs.isra.0+0x20/0x28
[  978.179795]  do_notify_resume+0xa8/0x138
[  978.180002]  el0_svc+0xb4/0x11c
[  978.180151]  el0t_64_sync_handler+0x134/0x150
[  978.180358]  el0t_64_sync+0x14c/0x150

@maxboone
Copy link
Author

@jhovold does this problem look like anything you've come across building the kernel for the ThinkPads with SQ3?

@maxboone
Copy link
Author

maxboone commented Jul 1, 2024

@pmartincic @OneBlue checking in

@maxboone
Copy link
Author

maxboone commented Jul 3, 2024

@kelsey-steele tagging as releaser of kernel sources - is there anything related to this on the roadmap or in the current out-of-tree?

@halfmanhalftaco
Copy link

I was plagued by extremely frequent rcu stalls resulting in high CPU usage on Win11 23H2 arm64 (lenovo x13s), but after updating to 24H2 (26100.1150) the problem has completely disappeared. WSL is finally usable on this machine now.

WSL/wslg/kernel versions appear the same as on 23H3, so maybe some underlying OS/Hyper-V bugs were fixed?

WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
WSLg version: 1.0.61
MSRDC version: 1.2.5326
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26091.1-240325-1447.ge-release
Windows version: 10.0.26100.1150

@mazugrin
Copy link

I'm starting to believe this is true! (Too many times I've been told it was fixed when it wasn't) I upgraded to 24H2 and am now able to use WSL for at least a few hours and have seen no instability at all. Could it be true that it's FINALLY fixed!!!??? I'm happy as a clam now with my Robo & Kala tablet.

@david-nordvall
Copy link

How are you able to install 24H2? Have you joined an insider channel (Release Preview?)? I'd rather not since my Surface Pro 9 is my daily driver. Rally encouraging news either way!

@maxboone
Copy link
Author

I was plagued by extremely frequent rcu stalls resulting in high CPU usage on Win11 23H2 arm64 (lenovo x13s), but after updating to 24H2 (26100.1150) the problem has completely disappeared. WSL is finally usable on this machine now.

I'm going to update today and will report back, I hope it'll work!

@craigloewen-msft @benhillis could you confirm something's in 23H2 that would fix this?

Nevuly added a commit to Nevuly/WSL2-Linux-Kernel-Rolling that referenced this issue Jul 15, 2024
@mazugrin
Copy link

How are you able to install 24H2? Have you joined an insider channel (Release Preview?)? I'd rather not since my Surface Pro 9 is my daily driver. Rally encouraging news either way!

Yes, indeed you do have to follow Release Preview. For what it's worth, the language used to describe what that means makes it sound like a very low-risk channel to follow.

@maxboone
Copy link
Author

Same on Surface Pro X SQ2, since the 24H2 update I haven't run into any RSEQ / RCU related stalls.

It's a shame that I can't mount the Hyper-V disks to switch back to WSL, but that works fine through qemu-nbd:

sudo -i
modprobe nbd max_part=16
qemu-nbd -c /dev/nbd0 /mnt/c/ProgramData/Microsoft/Windows/Virtual\ Hard\ Disks/ubuntu0.vhdx
partprobe /dev/nbd0
mount /dev/ubuntu-vg/ubuntu-lv /mnt/ubuntu0

I'll keep monitoring whether the stalls really stay away, but it looks like this issue has been fixed!

@maxboone
Copy link
Author

After running the 24H2 update for two days, I can gladly say that I am not running into these stalls anymore. Closing this issue.

@f0o
Copy link

f0o commented Jul 27, 2024

@maxboone what's the availability of 24H2? latest preview channel update I'm able to pull is 2024-07 Cumulative Preview for 23H2

@maxboone
Copy link
Author

@maxboone what's the availability of 24H2?

After I pulled that one I got the update for 24H2 over the Release Preview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants