Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock in autotrim #15217

Open
clhedrick opened this issue Aug 28, 2023 · 1 comment
Open

deadlock in autotrim #15217

clhedrick opened this issue Aug 28, 2023 · 1 comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@clhedrick
Copy link

System information

Type | Version/Name
Ubuntu | 22.04 LTS
Distribution Name | Ubuntu
Distribution Version | 22.04
Kernel Version | 5.15.0-76-generic
Architecture | x86_4
OpenZFS Version | zfs-2.1.5-1ubuntu6~22.04.1

Describe the problem you're observing

We installed a new pool. 2 mirrored pairs (4 disks), Samsung MZ7L37T6HBLA-00A07 (7.68 TB SATA SSD).

The pool was set to autotrim.

Copied a bunch of data into it. While copying the pool hung

Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937854] task:txg_sync        state:D stack:    0 pid:811157 ppid:     2 flags:0x00004000
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937857] Call Trace:
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937858]  <TASK>
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937861]  __schedule+0x24e/0x590
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937866]  schedule+0x69/0x110
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937868]  schedule_timeout+0x87/0x140
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937872]  ? zio_issue_async+0x12/0x20 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937942]  ? __bpf_trace_tick_stop+0x20/0x20
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937945]  io_schedule_timeout+0x51/0x80
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937947]  __cv_timedwait_common+0x12c/0x170 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937952]  ? wait_woken+0x70/0x70
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937954]  __cv_timedwait_io+0x19/0x20 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.937958]  zio_wait+0x116/0x220 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938008]  dsl_pool_sync+0xb6/0x400 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938052]  ? __mod_timer+0x214/0x400
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938054]  spa_sync_iterate_to_convergence+0xe0/0x1f0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938101]  spa_sync+0x2dc/0x5b0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938142]  txg_sync_thread+0x266/0x2f0 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938190]  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938232]  thread_generic_wrapper+0x64/0x80 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938237]  ? __thread_exit+0x20/0x20 [spl]
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938242]  kthread+0x12a/0x150
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938244]  ? set_kthread_struct+0x50/0x50
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938245]  ret_from_fork+0x22/0x30
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938248]  </TASK>
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938254] INFO: task vdev_autotrim:1940159 blocked for more than 120 seconds.
Aug 22 15:51:39 eternal.lcsr.rutgers.edu kernel: [2527082.938266]       Tainted: P           O      5.15.0-76-generic #83-Ubuntu

and similar

Turned off autotrim. The problem went away.

I did "zpool trim" while doing the same copy. It finished without problem.

I've disabled autotrim, and will use Ubuntu's normal monthly trim.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

@clhedrick clhedrick added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 28, 2023
@hpannenb
Copy link

hpannenb commented Dec 7, 2023

Same issue here on a Proxmox 8.1 machine (Debian 12 based):

Turning on autotrim on two pools increases the system load due to vdev_autotrim processes in "blocked" status ( ref.to. e.g. https://forum.proxmox.com/threads/proxmox-ve-8-1-released.136960/page-5#post-609592 ).

root@pve:~# [ 8218.231630] INFO: task vdev_autotrim:133705 blocked for more than 362 seconds.
[ 8218.232450]       Tainted: P          IO       6.5.11-6-pve #1
[ 8218.233156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8218.233570] task:vdev_autotrim   state:D stack:0     pid:133705 ppid:2      flags:0x00004000
[ 8218.234422] Call Trace:
[ 8218.234558]  <TASK>
[ 8218.235053]  __schedule+0x3fd/0x1450
[ 8218.235275]  ? __wake_up_common_lock+0x8b/0xd0
[ 8218.235913]  schedule+0x63/0x110
[ 8218.236482]  cv_wait_common+0x109/0x140 [spl]
[ 8218.237144]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 8218.237437]  __cv_wait+0x15/0x30 [spl]
[ 8218.237649]  vdev_autotrim_thread+0x797/0x9a0 [zfs]
[ 8218.238532]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
[ 8218.239042]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 8218.239339]  thread_generic_wrapper+0x5f/0x70 [spl]
[ 8218.240000]  kthread+0xf2/0x120
[ 8218.240580]  ? __pfx_kthread+0x10/0x10
[ 8218.240844]  ret_from_fork+0x47/0x70
[ 8218.241050]  ? __pfx_kthread+0x10/0x10
[ 8218.241249]  ret_from_fork_asm+0x1b/0x30
[ 8218.241451]  </TASK>
[ 8218.241583] INFO: task vdev_autotrim:133957 blocked for more than 362 seconds.
[ 8218.242346]       Tainted: P          IO       6.5.11-6-pve #1
[ 8218.243038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8218.243458] task:vdev_autotrim   state:D stack:0     pid:133957 ppid:2      flags:0x00004000
[ 8218.244294] Call Trace:
[ 8218.244440]  <TASK>
[ 8218.244926]  __schedule+0x3fd/0x1450
[ 8218.245146]  ? __wake_up_common_lock+0x8b/0xd0
[ 8218.245772]  schedule+0x63/0x110
[ 8218.246504]  cv_wait_common+0x109/0x140 [spl]
[ 8218.247378]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 8218.247698]  __cv_wait+0x15/0x30 [spl]
[ 8218.247960]  vdev_autotrim_thread+0x797/0x9a0 [zfs]
[ 8218.248969]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
[ 8218.249616]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 8218.249938]  thread_generic_wrapper+0x5f/0x70 [spl]
[ 8218.250584]  kthread+0xf2/0x120
[ 8218.251172]  ? __pfx_kthread+0x10/0x10
[ 8218.251396]  ret_from_fork+0x47/0x70
[ 8218.251608]  ? __pfx_kthread+0x10/0x10
[ 8218.251808]  ret_from_fork_asm+0x1b/0x30
[ 8218.252014]  </TASK>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants