Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hung task timeout reported by CI #3387

Closed
plbossart opened this issue Jan 21, 2022 · 37 comments
Closed

Hung task timeout reported by CI #3387

plbossart opened this issue Jan 21, 2022 · 37 comments
Labels
bug Something isn't working CML Applies to Comet Lake platform disk caching & write performance HDA Applies to HD-Audio bus for codec connection P1 Blocker bugs or important features TGL Applies to Tiger Lake platform

Comments

@plbossart
Copy link
Member

https://sof-ci.sh.intel.com/#/result/planresultdetail/9293?model=CML_SKU0955_HDA&testcase=multiple-pause-resume-50

https://sof-ci.sh.intel.com/#/result/planresultdetail/9293?model=TGLH_SKU0A70_HDA&testcase=volume-basic-test-50

.===========================>>
[ 3318.829119] kernel: INFO: task kworker/u16:0:17963 blocked for more than 122 seconds.
[ 3318.829227] kernel: Not tainted 5.16.0-rc1-daily-default-20220111-0 #55744127
[ 3318.829239] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<<===========================

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 21, 2022

A few weeks ago I spent some time investigating a similar report for a volume-basic-test. My preliminary conclusion was that the storage was overwhelmed by the volume of kernel logs.

@kv2019i kv2019i added bug Something isn't working CML Applies to Comet Lake platform HDA Applies to HD-Audio bus for codec connection TGL Applies to Tiger Lake platform labels Jan 28, 2022
@kv2019i

This comment has been minimized.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 3, 2022

Just found by chance that journald.conf has some rate limit options, worth a try.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 11, 2022

My preliminary conclusion was that the storage was overwhelmed by the volume of kernel logs.

Another side effect: on some devices we have less than 1 week of logs:

jf-cml-hel-rt5682-02 $ journalctl --disk-usage
Archived and active journals take up 3.6G in the file system.

jf-cml-hel-rt5682-02 $  journalctl -b| head
-- Logs begin at Sat 2022-02-05 20:51:08 UTC, end at Fri 2022-02-11 23:31:14 UTC. --

Would some log level tuning make sense?

cc: @fredoh9 , @keqiaozhang , @aiChaoSONG , @kv2019i

@plbossart
Copy link
Member Author

new one: https://sof-ci.01.org/linuxpr/PR3338/build7152/devicetest/?model=CML_SKU0955_HDA&testcase=check-playback-3times

looks like file system issues again

[  246.845593] kernel: INFO: task kworker/u16:3:87 blocked for more than 122 seconds.
[  246.845793] kernel:       Not tainted 5.16.0-rc1-pr3338-7152-default #35fa8225
[  246.845825] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.845858] kernel: task:kworker/u16:3   state:D stack:12752 pid:   87 ppid:     2 flags:0x00004000
[  246.845878] kernel: Workqueue: writeback wb_workfn (flush-259:0)
[  246.845897] kernel: Call Trace:
[  246.845901] kernel:  
[  246.845913] kernel:  __schedule+0x2ac/0x850
[  246.845936] kernel:  schedule+0x49/0xc0
[  246.845947] kernel:  io_schedule+0x36/0x60
[  246.845957] kernel:  blk_mq_get_tag+0x117/0x280
[  246.845970] kernel:  ? lock_release+0x148/0x3f0
[  246.845979] kernel:  ? wait_woken+0x70/0x70
[  246.845993] kernel:  __blk_mq_alloc_requests+0x15e/0x2a0
[  246.846010] kernel:  blk_mq_submit_bio+0x1f6/0x810
[  246.846033] kernel:  submit_bio_noacct+0x254/0x2a0
[  246.846049] kernel:  ext4_io_submit+0x44/0x50
[  246.846059] kernel:  ext4_bio_write_page+0x145/0x3e0
[  246.846073] kernel:  mpage_submit_page+0x49/0x60
[  246.846084] kernel:  mpage_process_page_bufs+0xfc/0x110
[  246.846139] kernel:  mpage_prepare_extent_to_map+0x1d3/0x2b0
[  246.846178] kernel:  ext4_writepages+0x30c/0xeb0
[  246.846202] kernel:  ? __lock_acquire.isra.0+0x125/0x550
[  246.846238] kernel:  do_writepages+0xba/0x180
[  246.846253] kernel:  ? find_held_lock+0x32/0x90
[  246.846267] kernel:  ? sched_clock_cpu+0xd/0xd0
[  246.846285] kernel:  __writeback_single_inode+0x66/0x420
[  246.846307] kernel:  writeback_sb_inodes+0x20f/0x4d0
[  246.846341] kernel:  __writeback_inodes_wb+0x47/0xe0
[  246.846360] kernel:  wb_writeback+0x2da/0x360
[  246.846386] kernel:  wb_workfn+0x38c/0x520
[  246.846423] kernel:  process_one_work+0x291/0x530
[  246.846453] kernel:  worker_thread+0x48/0x3d0
[  246.846470] kernel:  ? process_one_work+0x530/0x530
[  246.846483] kernel:  kthread+0x175/0x1a0
[  246.846500] kernel:  ? set_kthread_struct+0x40/0x40
[  246.846520] kernel:  ret_from_fork+0x1f/0x30
[  246.846560] kernel:  
[  246.846607] kernel: 
                       Showing all locks held in the system:

@keqiaozhang
Copy link
Collaborator

Another one:
http://sof-ci.sh.intel.com/#/result/planresultdetail/10270?model=CML_SKU0955_HDA&testcase=volume-basic-test-50

2022-02-16 23:32:04 UTC [ERROR] Caught kernel log error
===========================>>
[ 3564.574060] kernel: INFO: task systemd-journal:256 blocked for more than 368 seconds.
[ 3564.574106] kernel:       Not tainted 5.16.0-rc1-daily-default-20220216-0 #5eef692b
[ 3564.574139] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3564.574528] kernel: INFO: task kworker/u16:5:355 blocked for more than 368 seconds.
[ 3564.574561] kernel:       Not tainted 5.16.0-rc1-daily-default-20220216-0 #5eef692b
[ 3564.574594] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

@plbossart
Copy link
Member Author

Another one: http://sof-ci.sh.intel.com/#/result/planresultdetail/10270?model=CML_SKU0955_HDA&testcase=volume-basic-test-50

2022-02-16 23:32:04 UTC [ERROR] Caught kernel log error
===========================>>
[ 3564.574060] kernel: INFO: task systemd-journal:256 blocked for more than 368 seconds.
[ 3564.574106] kernel:       Not tainted 5.16.0-rc1-daily-default-20220216-0 #5eef692b
[ 3564.574139] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3564.574528] kernel: INFO: task kworker/u16:5:355 blocked for more than 368 seconds.
[ 3564.574561] kernel:       Not tainted 5.16.0-rc1-daily-default-20220216-0 #5eef692b
[ 3564.574594] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

yes, but look further and it's again ext4 locking issues. Nothing apparently to do with SOF.

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 8, 2022

Here's another interesting issue that happened in daily 11687?model=TGLH_SKU0A70_HDA&testcase=volume-basic-test-50

  < huge volume of logs >
 250636 Apr 07 22:23:09 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=3
 250637 Apr 07 22:23:09 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
 250638 Apr 07 22:23:09 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx succeeded: 0x80010000: GLB_DAI_MSG: CONFIG
 250639 Apr 07 22:23:09 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: hda_link_dma_trigger: cmd=4
 250640 Apr 07 22:23:09 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=4
 250641 Apr 07 22:59:03 sh-tglh-0a70-hda-01 systemd-journald[330]: Missed 77443 kernel messages
 250642 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250643 Apr 07 22:59:03 sh-tglh-0a70-hda-01 systemd-journald[330]: Missed 2 kernel messages
 250644 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250645 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250646 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250647 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250648 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
 250649 Apr 07 22:59:03 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
   < huge volume of logs >

Note the 35 minutes gap!

@plbossart is there an easy way to reduce the log volume of this test? sof-logger has the ability to fine-tune per component log levels at run time silly me these are kernel logs.

Tracepoints maybe? #3482

@plbossart
Copy link
Member Author

Indeed this GLB_TRACE_MSG: DMA_POSITION is not very useful, we should check if we can remove it.

@marc-hb
Copy link
Collaborator

marc-hb commented May 10, 2022

This recent one also has a ton of WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch (logs also going to the disk)

https://sof-ci.01.org/softestpr/PR903/build21/devicetest/?model=CML_SKU0955_HDA&testcase=check-pause-resume-playback-10

[     4302536.547783] (        1238.593750) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4302548.110282] (          11.562500) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4303786.183150] (        1238.072876) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4303798.058149] (          11.875000) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4305036.443517] (        1238.385376) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4305048.058099] (          11.614583) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4306286.078884] (        1238.020752) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4306297.589300] (          11.510416) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4307536.130917] (        1238.541626) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4307547.745500] (          11.614583) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4308785.974618] (        1238.229126) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4308797.589200] (          11.614583) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4310036.391235] (        1238.802002) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4310048.162067] (          11.770833) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4311286.703685] (        1238.541626) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4311298.630768] (          11.927083) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4312536.547385] (        1237.916626) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[     4312548.214051] (          11.666666) c0 dai          1.3            src/audio/dai.c:951  WARN dai_copy(): nothing to copy
[     4313786.599419] (        1238.385376) c0 dai          1.3            src/audio/dai.c:943  WARN dai_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch

@marc-hb
Copy link
Collaborator

marc-hb commented May 10, 2022

Other interesting logs just found by chance.

Also linking to

Daily 12389?model=CML_SKU0955_HDA&testcase=check-ipc-flood

   3781 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3782 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3783 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3784 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3785 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 3 kernel messages
   3786 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3787 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 7 kernel messages
   3788 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3789 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 9 kernel messages
   3790 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3791 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 6 kernel messages
   3792 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3793 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 5 kernel messages
   3794 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3795 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 6 kernel messages
   3796 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3797 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 6 kernel messages
   3798 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3799 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 6 kernel messages
   3800 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   3801 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 6 kernel messages
   3802 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3803 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 7 kernel messages
   3804 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   3805 May 08 22:36:39 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 9 kernel messages
   3806 May 08 22:36:39 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>

...

   9810 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9811 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 5 kernel messages
   9812 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9813 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 5 kernel messages
   9814 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9815 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 5 kernel messages
   9816 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9817 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 4 kernel messages
   9818 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   9819 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 4 kernel messages
   9820 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9821 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 5 kernel messages
   9822 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9823 May 08 22:36:40 sh-cml-sku0955-hda-01 systemd-journald[249]: Missed 7 kernel messages
   9824 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9825 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   9826 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9827 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   9828 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9829 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   9830 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
   9831 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0xb0010000: GLB_TEST_MSG: IPC_>
   9832 May 08 22:36:40 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD

...

  48692 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_link_dma_trigger: cmd=1
  48693 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=1
  48694 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_link_dma_trigger: cmd=1
  48695 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=1
  48696 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 4 dir 0 cmd 1
  48697 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful
  48698 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
  48699 May 08 22:54:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TR>
  48700 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd-journald[5678]: Journal started
  48701 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd-journald[5678]: System Journal (/var/log/journal/bf4a5c878a224c3bb2fa2ed9e284a7e8) is >
  48702 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd-journald[5678]: Missed 27564 kernel messages
  48703 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30100000: GLB_TPLG_MSG: PIPE_NEW
  48704 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30100000: GLB_TPLG_MSG: PIPE>
  48705 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PIPELINE.11.DMIC1.IN setup complete
  48706 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30010000: GLB_TPLG_MSG: COMP_NEW
  48707 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30010000: GLB_TPLG_MSG: COMP>
  48708 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
  48709 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x80010000: GLB_DAI_MSG: CONFIG
  48710 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget DMIC1.IN setup complete
  48711 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30200000: GLB_TPLG_MSG: BUFFER_NEW
  48712 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30200000: GLB_TPLG_MSG: BUFF>
  48713 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget BUF11.2 setup complete
  48714 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30010000: GLB_TPLG_MSG: COMP_NEW

   ....

  50234 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 7 dir 1 cmd 1
  50235 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x24001e successful
  50236 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
  50237 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TR>
  50238 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
  50239 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSI>
  50240 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 7 dir 1 cmd 0
  50241 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
  50242 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: start operation timed out. Terminating.
  50243 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TR>
  50244 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x20240000 successful
  50245 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
  50246 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PC>
  50247 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 7 dir 1
  50248 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
  50249 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP>
  50250 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget DMIC1.IN freed

   ...

  51063 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: stream_tag 2
  51064 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
  51065 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60010000: GLB_STREAM_MSG: PC>
  51066 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: stream dir 1, posn mailbox offset is 790528
  51067 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 7 dir 1 cmd 1
  51068 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x2024001e successful
  51069 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
  51070 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TR>
  51071 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
  51072 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSI>
  51073 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing.
  51074 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 5273 (systemd-journal) with signal SIGKI>
  51075 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 249 (systemd-journal) with signal SIGKIL>
  51076 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4892 (systemd-journal) with signal SIGKI>
  51077 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4895 (systemd-journal) with signal SIGKI>
  51078 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4896 (systemd-journal) with signal SIGKI>
  51079 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4899 (systemd-journal) with signal SIGKI>
  51080 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4902 (systemd-journal) with signal SIGKI>
  51081 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Killing process 4423 (journal-offline) with signal SIGKI>
  51082 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
  51083 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSI>
  51084 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 7 dir 1 cmd 0
  51085 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
  51086 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TR>
  51087 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x20240000 successful
  51088 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
  51089 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PC>
  51090 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 7 dir 1
  51091 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
  51092 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP>

...

  51422 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30020000: GLB_TPLG_MSG: COMP_FREE
  51423 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30020000: GLB_TPLG_MSG: COMP>
  51424 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30110000: GLB_TPLG_MSG: PIPE_FREE
  51425 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x30110000: GLB_TPLG_MSG: PIPE>
  51426 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PIPELINE.11.DMIC1.IN freed
  51427 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: widget PCM7C freed
  51428 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 7 dir 1
  51429 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Failed with result 'timeout'.
  51430 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: Failed to start Journal Service.
  51431 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 7.
  51432 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: Stopped Journal Service.
  51433 May 08 23:09:53 sh-cml-sku0955-hda-01 systemd[1]: Starting Journal Service...
  51434 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: open stream 7 dir 1
  51435 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: period min 192 max 16384 bytes
  51436 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: period count 2 max 16
  51437 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: buffer max 65536 bytes
  51438 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: hw params stream 7 dir 1
  51439 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x40000 successful
  51440 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x40000 successful
  51441 May 08 23:09:53 sh-cml-sku0955-hda-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_setup_bdl: period_bytes:0x1f40


@keqiaozhang
Copy link
Collaborator

keqiaozhang commented May 18, 2022

I spent a couple days on this issue and made some progress. The error messages typically mean that the system is experiencing disk or memory congestion and processes are being starved of available resources. In other words, this issue is related to IO subsystem, it caused by high I/O load and file system failed to flush the caching data from memory to disk.

Some explanations from google:

By default Linux uses up to 40% of the available memory for file system caching.
After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous.
For flushing out this data to disk this there is a time limit of 120 seconds by default.
In the case here the IO subsystem is not fast enough to flush the data within 120 seconds.
This especially happens on systems with a lot of memory.

On our devices, the file system uses up to 10% of the available memory for system caching and the expire time is 432 seconds. These indicators can be configured in pagecache settings.
You can dump the default values by $ sysctl -a | grep "vm.dirty" and you can also configure these items in /etc/sysctl.conf

$ sysctl -a | grep  "vm.dirty"
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

So to avoid such issue, we can reduce the value of dirty_background_ratio(contains as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.) and dirty_ratio(contains: as a percentage of total available memory that contains free pages and reclaimable pages the number of pages at which a process which is generating disk writes will itself start writing out dirty data.)

I did a lot of tests to tune the values for dirty_background_ratio and vm.dirty_ratio. Finally I figured out a set of suitable values for our devices and passed my personal tests and daily test.
You can refer to internal issue 233 for more details.

As for why this issue has been exposed recently, my best guess is that there're some changes in kernel and increase the I/O load, since this issue only happens on some Dell devices, these devices may use the disk(nvme) with a low read-write speed. with these two factors, it easier for file systems to hit the I/O bottleneck.

I will pay close attention to the following test results, hope this issue will disappear forever on our devices after the pagecache setting changes

@plbossart
Copy link
Member Author

@keqiaozhang I didn't click on the fact that these errors only happen on Dell devices, is this really the case?

I vaguely recall that on some Dell devices we had to change the default for the disk in the BIOS, the default setting was RAID and we had to switch to AHCI or something.

@marc-hb
Copy link
Collaborator

marc-hb commented May 19, 2022

Daily 12694?modelSoc=TGLU&model=TGLU_SKU0A32_SDCA&testcase=check-pause-resume-playback-100

root@jf-tglu-sku0a32-sdca-03:~# sysctl -a | grep vm.dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 5
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 15
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60060000: GLB_STREAM_MSG: TRIG_PAUSE
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 2 dir 0 cmd 4
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60070000: GLB_STREAM_MSG: TRIG_RELEASE
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 2 dir 0 cmd 3
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60060000: GLB_STREAM_MSG: TRIG_PAUSE
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: INFO: task systemd-journal:253 blocked for more than 122 seconds.
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:       Not tainted 5.18.0-rc2-daily-default-20220518-0-g79973093ab8c #dev
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: task:systemd-journal state:D stack:12864 pid:  253 ppid:     1 flags:0x00000000
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: Call Trace:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  <TASK>
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __schedule+0x2b4/0x8c0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  schedule+0x4e/0xb0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? __folio_lock+0x11e/0x1f0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  io_schedule+0x12/0x40
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __folio_lock+0x11e/0x1f0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? filemap_invalidate_unlock_two+0x40/0x40
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? ext4_da_release_space+0x150/0x150
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  block_page_mkwrite+0x14a/0x170
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ext4_page_mkwrite+0x54d/0x700
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  do_page_mkwrite+0x52/0xf0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  do_wp_page+0x3de/0x6b0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __handle_mm_fault+0x6a5/0xcd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  handle_mm_fault+0x7a/0x1c0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  do_user_addr_fault+0x1d7/0x660
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  exc_page_fault+0x64/0x1b0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? asm_exc_page_fault+0x8/0x30
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  asm_exc_page_fault+0x1e/0x30
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: RIP: 0033:0x7f6994183c51
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: RSP: 002b:00007ffc96344f00 EFLAGS: 00010246
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: RAX: 00007f698d2f2b38 RBX: 00007ffc96344f50 RCX: 0000000000000001
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: RDX: 0000000000000000 RSI: 00007f69942d81bc RDI: 0000558a67ae1f80
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: RBP: 0000558a67af1130 R08: 00000000040e9328 R09: 0000558a67ae1f80
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: R10: 00000000000dad20 R11: 0000000000000000 R12: 0000000000000026
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: R13: 00000000002f43c0 R14: 00007ffc96344f30 R15: 0000558a67b22900
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  </TASK>
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: INFO: task kworker/u16:0:3719 blocked for more than 122 seconds.
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:       Not tainted 5.18.0-rc2-daily-default-20220518-0-g79973093ab8c #dev
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: task:kworker/u16:0   state:D stack:12968 pid: 3719 ppid:     2 flags:0x00004000
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: Workqueue: writeback wb_workfn (flush-259:0)
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: Call Trace:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  <TASK>
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __schedule+0x2b4/0x8c0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? __blk_flush_plug+0xdf/0x120
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  schedule+0x4e/0xb0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  io_schedule+0x12/0x40
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  blk_mq_get_tag+0x11c/0x280
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? complete_all+0x60/0x60
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __blk_mq_alloc_requests+0x188/0x2e0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? blk_attempt_plug_merge+0x76/0x90
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  blk_mq_submit_bio+0x421/0x700
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  submit_bio_noacct_nocheck+0x259/0x2b0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ext4_io_submit+0x20/0x40
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ext4_bio_write_page+0x16e/0x520
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  mpage_submit_page+0x4e/0x70
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  mpage_process_page_bufs+0xf8/0x110
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  mpage_prepare_extent_to_map+0x1f9/0x450
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ext4_writepages+0x311/0xeb0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  do_writepages+0xbf/0x180
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? find_held_lock+0x32/0x90
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? sched_clock_cpu+0xd/0xd0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __writeback_single_inode+0x6f/0x470
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  writeback_sb_inodes+0x214/0x4d0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  __writeback_inodes_wb+0x4c/0xe0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  wb_writeback+0x2df/0x360
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  wb_workfn+0x304/0x500
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  process_one_work+0x29c/0x530
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  worker_thread+0x4d/0x3c0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? process_one_work+0x530/0x530
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  kthread+0xf0/0x120
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ? kthread_complete_and_exit+0x20/0x20
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  ret_from_fork+0x1f/0x30
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  </TASK>
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 
                                                Showing all locks held in the system:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 1 lock held by khungtaskd/56:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffffffffa7d65cc0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0xf7
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 3 locks held by systemd-journal/253:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff9ffb0154b078 (&mm->mmap_lock#2){....}-{3:3}, at: do_user_addr_fault+0x136/0x660
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #1: ffff9ffb0c310580 (sb_pagefaults){....}-{0:0}, at: do_page_mkwrite+0x52/0xf0
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #2: ffff9ffb0f4405f0 (mapping.invalidate_lock#2){....}-{3:3}, at: ext4_page_mkwrite+0xff/0x700
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 1 lock held by in:imklog/546:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff9ffb06b9f6f8 (&f->f_pos_lock){....}-{3:3}, at: __fdget_pos+0x4a/0x60
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 4 locks held by kworker/u16:0/3719:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff9ffb0190d148 ((wq_completion)writeback){....}-{0:0}, at: process_one_work+0x208/0x530
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #1: ffffb78a4358fe58 ((work_completion)(&(&wb->dwork)->work)){....}-{0:0}, at: process_one_work+0x208/0x530
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #2: ffff9ffb0c3100e8 (&type->s_umount_key#35){....}-{3:3}, at: trylock_super+0x16/0x50
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel:  #3: ffff9ffb0c311c80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x180
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 4 locks held by kworker/5:3/12355:
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: 
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: =============================================
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 2 dir 0 cmd 4
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60070000: GLB_STREAM_MSG: TRIG_RELEASE
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 2 dir 0 cmd 3
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60060000: GLB_STREAM_MSG: TRIG_PAUSE
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
May 18 23:15:27 jf-tglu-sku0a32-sdca-03 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 2 dir 0 cmd 4

@marc-hb
Copy link
Collaborator

marc-hb commented May 19, 2022

I tried booting jf-tglu-sku0a32-sdca-03 in both AHCI and RAID/VMD modes and I compared the boot logs and lsblk. I found almost no difference. The only difference I could find is these two additional ACPI entries. They seem completely ignored by our kernel.

May 19 01:08:00 jf-tglu-sku0a32-sdca-03 kernel: ACPI: UEFI 0x0000000063CE3000 00063A (v01 INTEL  RstVmdE  00000000 ??   00000000)
May 19 01:08:00 jf-tglu-sku0a32-sdca-03 kernel: ACPI: UEFI 0x0000000063CE2000 00005C (v01 INTEL  RstVmdV  00000000 ??   00000000)

BIOS_AHCI_vs_RAID

I will run some drive tests in both modes tomorrow.

@marc-hb
Copy link
Collaborator

marc-hb commented May 19, 2022

ACPI: UEFI 0x0000000063CE3000 00063A (v01 INTEL RstVmdE

Rst stands for Rapid Storage Technology.

From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/rst-linux-paper.pdf

The primary benefit of using Intel RST is in the presence of an Intel RST option ROM where the system can boot directly from any Intel RST RAID volume type instead of creating a dedicated partition or using a RAID superblock partition to store the bootloader.
The following sections describe the different aspects of Intel RST support in Linux.

RST can cause problems when actually using RAID and/or dual booting with Windows. A typical mistake seems to be switching from RAID to AHCI without letting Windows know about that. This makes Windows unbootable.

I haven't found anyone reporting any performance difference.

@keqiaozhang
Copy link
Collaborator

@keqiaozhang I didn't click on the fact that these errors only happen on Dell devices, is this really the case?

Yes, from the issue history, it only happened on below 4 Dell platforms.
image

I vaguely recall that on some Dell devices we had to change the default for the disk in the BIOS, the default setting was RAID and we had to switch to AHCI or something.

Yes, but we chose AHCI mode for all Dell devices, at least for SH devices.

@marc-hb
Copy link
Collaborator

marc-hb commented May 20, 2022

jf-tglu-sku0a32-sdca-03 crashed again. Again, it crashed when logging was heavy.

This caused another "time warp". I took that opportunity to explain how journalctl works and what to do and not to do with journalctl when a time warp happen:

tl;dr: always use journalctl's -b option.

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented May 20, 2022

It seems easier to hit this issue on jf-tglu-sku0a32-sdca-01, so I compared the hardware configuration of 3 TGLU_SKU0A32_SDCA devices and found that jf-tglu-sku0a32-sdca-01 only has 8G memory, it's a low-end Andrews. we may need special cache setting for this device. So I re-tuned them and extended the timeout from 120 seconds to 360 seconds(/proc/sys/kernel/hung_task_timeout_secs). Now the cache settings for these 3 devices are:

Hostname Memory Size Disk Size dirty_background_ratio vm.dirty_ratio hung_task_timeout_secs
jf-tglu-sku0a32-sdca-01 8G 256G 5 10 360
jf-tglu-sku0a32-sdca-02 16G 512G 3 6 360
jf-tglu-sku0a32-sdca-03 16G 512G 3 6 360

I triggered 6 rounds of daily tests on these 3 devices separately, but 2 rounds of tests failed. It seems that adjusting pagecache settings can only reduce the probability of occurrence, but it cannot completely avoid this issue.

@marc-hb
Copy link
Collaborator

marc-hb commented May 21, 2022

I will run some drive tests in both modes tomorrow.

I think jf-tglu-sku0a32-sdca-03 (and maybe other similar devices) have temporary storage issues. I ran bonnie++ in AHCI mode and I found write performance temporarily half as good as normally expected!

# sudo bonnie++ -u root -n 0

Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
jf-tglu-sku0a32 31G  370k  99  400m  54  276m  28 1034k  99  912m  42 +++++ +++
Latency             25593us     353ms     646ms   21095us    3211us   18886us

So I switched to "RAID On)" (Rst) and the write performace doubled:

Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
jf-tglu-sku0a32 31G  654k  99  753m  71  562m  45 3311k  99  1.3g  60 +++++ +++
Latency             16010us     207ms   93349us    3515us     596us    7776us

After switching back to AHCI, the performance stayed high!!

I tried jf-tglu-sku0a32-sdca-01 in both modes and the performance was always in the high range.

Flash storage and wear leveling are very complex. Maybe storage on those devices is starting to wear out which could cause hangs? Or the drive firmware could just be buggy.

https://en.wikipedia.org/wiki/Wear_leveling https://www.bunniestudios.com/blog/?p=3554

@marc-hb
Copy link
Collaborator

marc-hb commented May 21, 2022

To make things worse, I've narrowed down a graphical issue on these devices. It's unrelated but it does not help. They're stuck to VT1. In other words, chvt 3 works but it's "blind": you can login but can't see anything because the display is stuck to VT1. The stock Ubuntu kernel is fine, this happens only with our current kernel. This should be filed in a different bug.

EDIT, mystery solved:

@marc-hb
Copy link
Collaborator

marc-hb commented May 23, 2022

The storage on (some of) those systems is definitely unreliable. @keqiaozhang please restore all the default vm_dirty values, I think they only move some thresholds without actually avoiding the issue.

I reserved jf-tglu-sku0a32-sdca-03 for some testing and I experienced another hang while apt upgrade was merely performing an update-initramfs

                                                Showing all locks held in the system:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 1 lock held by khungtaskd/56:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffffffff8a825560 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x28/0x14f
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 3 locks held by kworker/u16:1/7327:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff8881033f2948 ((wq_completion)writeback){....}-{0:0}, at: process_one_work+0x524/0xa20
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #1: ffff88824f53fdd0 ((work_completion)(&(&wb->dwork)->work)){....}-{0:0}, at: process_one_work+0x524/0xa20
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #2: ffff888101ab10e8 (&type->s_umount_key#50){....}-{3:3}, at: trylock_super+0x16/0x70
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 2 locks held by in:imklog/14325:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 3 locks held by rs:main Q:Reg/14326:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 1 lock held by systemd-journal/16121:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 1 lock held by sync/28706:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff8881314510e8 (&type->s_umount_key#35){....}-{3:3}, at: __x64_sys_syncfs+0x5c/0xc0
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 2 locks held by pager/29360:
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #0: ffff888112ccb8a0 (&tty->ldisc_sem){....}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x60
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel:  #1: ffffc9000154b2f0 (&ldata->atomic_read_lock){....}-{3:3}, at: n_tty_read+0x2b5/0x940
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: 
May 23 18:26:08 jf-tglu-sku0a32-sdca-03 kernel: =============================================

Immediately after I tried:

root@jf-tglu-sku0a32-sdca-03:~# time sync

real	0m27.269s
user	0m0.000s
sys	    0m0.010s

root@jf-tglu-sku0a32-sdca-03:~# time sync

real	0m0.010s
user	0m0.000s
sys	    0m0.008s

BTW the whole apt upgrade took a very unusually long time.

Another thing that took an unusually long time just now:

 bonnie++ -u root -n 0
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
jf-tglu-sku0a32 27G   23k  98 51.2m  99 84.1m  95  934k  99  395m  99 +++++ +++
Latency               677ms     124ms   99074us   11850us   17575us   18769us

That's 30 times slower than above on the same system!!

@marc-hb
Copy link
Collaborator

marc-hb commented May 23, 2022

That's 30 times slower than above on the same system!!

Ignore the last comment sorry. I was very unlucky and happened to test our special weekly kernel (unfortunately called 5.18.0-rc2-DAILY-default-20220521-g8a9e9090617) which is made mostly unusable because of all these debug options: https://github.com/thesofproject/kconfig/blob/21c34631a8517/memory-debug-defconfig
Thanks @fredoh9 for solving this mystery!

EDIT: we should check whether some of the reports above happened with a weekly kernel.

marc-hb added a commit to marc-hb/kconfig that referenced this issue May 24, 2022
When graphics fail, a blank screen is not useful.

Even when graphics work, without a framebuffer console chvt still
works but it is "blind" / with a frozen GUI instead! Incredibly
confusing. Linux distributions expect this.

Also useful in case of hangs like
thesofproject/linux#3387: text consoles
are always more responsive and they may even have some logs

Signed-off-by: Marc Herbert <[email protected]>
@keqiaozhang
Copy link
Collaborator

@marc-hb I restored all vm.dirty values on all our Dell devices. But from test ID:
http://sof-ci.sh.intel.com/#/result/planresultdetail/12825(aborted by me).
Almost all Dell devices(TGLH_SKU0A70_HDA TGLU_SKU0A32_SDCA and CML_SKU0955_HDA) run to the timeout issue several times.

But before the reproduction rate is low after adjusting the vm_dirty. Especially for CML_SKU0955_HDA. Therefore, the hung task timeout issue can be effectively avoided on some platforms by applying the workaround. So I changed it back and triggered the daily test again:
https://sof-ci.sh.intel.com/#/result/planresultdetail/12825.
Only TGLU_SKU0A32_SDCA met such issue. So I think we can apply this workaround until we figure out a better solution. At least no side effects have been found.

marc-hb added a commit to marc-hb/sof-test that referenced this issue May 24, 2022
We've experienced a fair number of test failures that seem to point to
storage performance issues:

thesofproject/linux#3387
thesofproject/linux#3669

This (temporary?) addition runs a quick write test after each audio test
to monitor storage sanity.

As a bonus feature the "sync" could help us collect more logs.

On our (slowest) BYT devices the test adds 3s per test; much less on
newer devices.

Sample output:

```
2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger
2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt
+ timeout -s CONT 5 sudo sync

real	0m0.062s
user	0m0.005s
sys	0m0.019s
+ timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s
+ timeout -s CONT 5 sudo sync

real	0m0.037s
user	0m0.004s
sys	0m0.018s
2022-05-24 23:20:44 UTC [INFO] Test Result: PASS!
```

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit to marc-hb/sof-test that referenced this issue May 25, 2022
We've experienced a fair number of test failures that seem to point to
storage performance issues:

thesofproject/linux#3387
thesofproject/linux#3669

This (temporary?) addition runs a quick write test after each audio test
to monitor storage sanity.

As a bonus feature the "sync" could help us collect more logs.

On our (slowest) BYT devices the test adds 3s per test; much less on
newer devices.

Sample output:

```
2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger
2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt
+ timeout -s CONT 5 sudo sync

real	0m0.062s
user	0m0.005s
sys	0m0.019s
+ timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s
+ timeout -s CONT 5 sudo sync

real	0m0.037s
user	0m0.004s
sys	0m0.018s
2022-05-24 23:20:44 UTC [INFO] Test Result: PASS!
```

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented May 25, 2022

I'm adding a quick storage sanity check + sync in

Please help review.

marc-hb added a commit to thesofproject/sof-test that referenced this issue May 25, 2022
We've experienced a fair number of test failures that seem to point to
storage performance issues:

thesofproject/linux#3387
thesofproject/linux#3669

This (temporary?) addition runs a quick write test after each audio test
to monitor storage sanity.

As a bonus feature the "sync" could help us collect more logs.

On our (slowest) BYT devices the test adds 3s per test; much less on
newer devices.

Sample output:

```
2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger
2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt
+ timeout -s CONT 5 sudo sync

real	0m0.062s
user	0m0.005s
sys	0m0.019s
+ timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s
+ timeout -s CONT 5 sudo sync

real	0m0.037s
user	0m0.004s
sys	0m0.018s
2022-05-24 23:20:44 UTC [INFO] Test Result: PASS!
```

Signed-off-by: Marc Herbert <[email protected]>
@keqiaozhang
Copy link
Collaborator

Thanks for @ujfalusi suggestion.

@marc-hb and @plbossart
I verified this issue with latest linux-next kernel https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/ and this issue cannot be reproduced anymore.

Here is the first round of test result for 2 Dell platforms which are easy to hit this issue: https://sof-ci.sh.intel.com/#/result/planresultdetail/12875
There's only one failure and caused by the new debugging code for checking the write speed of storage.

So I triggered another round of test on all 4 Dell platforms and also remove the workaround(pagecache setting), the results are really good and consistent.
Please refer to: https://sof-ci.sh.intel.com/#/result/planresultdetail/12877

This is really a good news, it seems that this issue has been fixed in latest upstream kernel.

@plbossart
Copy link
Member Author

@keqiaozhang can you try with 5.18 to see if that already fixes the problem? linux-next includes stuff that will be provided later, or in 5.19-rc1 and that in itself will add other issues.

@marc-hb
Copy link
Collaborator

marc-hb commented May 25, 2022

This is really a good news, it seems that this issue has been fixed in latest upstream kernel.

Maybe a newer kernel has marginally better performance but that's IMHO only hiding the real, device-specific issue here.

As noticed by @plbossart , we've observed these hangs only on certain TGL devices with (normally!) very high performance NVMe storage. I just discovered that the storage performance on some of our old BYT devices is ONE HUNDRED times lower than on those recent TGL devices:

Yet we never ever noticed anything like this on BYT!? Something does not add up.

Anyway we're not in the storage business so whatever hides this issue is good enough for me.

EDIT: CML TIMEOUT in sync in https://sof-ci.01.org/linuxpr/PR3645/build166/devicetest

plbossart pushed a commit to thesofproject/kconfig that referenced this issue May 25, 2022
When graphics fail, a blank screen is not useful.

Even when graphics work, without a framebuffer console chvt still
works but it is "blind" / with a frozen GUI instead! Incredibly
confusing. Linux distributions expect this.

Also useful in case of hangs like
thesofproject/linux#3387: text consoles
are always more responsive and they may even have some logs

Signed-off-by: Marc Herbert <[email protected]>
@keqiaozhang
Copy link
Collaborator

Yet we never ever noticed anything like this on BYT!? Something does not add up.

Be noticed that some BYT devices use the SD card for storage, not SSD or NVME. please check the storage size for the low performance BYT devices. the storage size should be 16G or 32G if it uses SD card.

@marc-hb
Copy link
Collaborator

marc-hb commented May 26, 2022

Be noticed that some BYT devices use the SD card for storage, not SSD or NVME.

Very interesting but that's not my point. My point is: incredibly slow BYT storage never triggered any "hung task" or TIMEOUT. So this is not just a "slow storage" issue, something else is really wrong with these devices and we still have no idea what it is. Maybe storage hangs sometimes, or maybe it's not even storage. So neither the newer kernel versions or vm_dirty changes can really fix these issues, they only make it less likely. So these failures will be back.

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented May 26, 2022

@keqiaozhang can you try with 5.18 to see if that already fixes the problem? linux-next includes stuff that will be provided later, or in 5.19-rc1 and that in itself will add other issues.

I tested with https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git/log/?h=for-5.18 kernel before, the hung task timeout issue is still reproducible.
The last update of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git/log/?h=for-5.18 is 14 days ago. Same commit ID as I tested before.

image

Then I tested with today's for-next, the hung task timeout issue remains.

I also checked with linux-next https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=v5.18, the hung task timeout is still exists.

So it seems that the fix only landed on linux-next master branch.

@marc-hb marc-hb added the P1 Blocker bugs or important features label May 27, 2022
@marc-hb
Copy link
Collaborator

marc-hb commented May 27, 2022

Reproduced again in daily 12927?model=TGLU_SKU0A32_SDCA&testcase=multiple-pause-resume-50

I had a look at the results of the quick storage test I just added in thesofproject/sof-test@34ca191b92a32 and write speed was blazingly fast as usual before and after this failure, zero issue found with actual storage.

Start Time: 2022-05-26 22:28:02 UTC
End Time: 2022-05-27 04:34:14 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 46f61a08b40c
SOF Branch: main
SOF Commit: 13b659259de2

May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 
                                                Showing all locks held in the system:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by khungtaskd/56:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffffffff94b65cc0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0xf7
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by journal-offline/20653:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb49d06c80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x1>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by in:imklog/581:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb4bac9af8 (&f->f_pos_lock){....}-{3:3}, at: __fdget_pos+0x4a/0x60
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 4 locks held by kworker/u16:1/7388:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb41002548 ((wq_completion)writeback){....}-{0:0}, at: process_one_work+0x208>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #1: ffffa3a5c8053e58 ((work_completion)(&(&wb->dwork)->work)){....}-{0:0}, at: process_>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #2: ffff9cdb49d020e8 (&type->s_umount_key#35){....}-{3:3}, at: trylock_super+0x16/0x50
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #3: ffff9cdb49d06c80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x1>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 3 locks held by kworker/0:2/18951:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60070000: GLB_STREAM_MSG: TRIG_RELEASE
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 1 dir 1 cmd 3
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60060000: GLB_STREAM_MSG: TRIG_PAUSE
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x20240000 successful
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 1 dir 1 cmd 4
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x2024001e successful
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60070000: GLB_STREAM_MSG: TRIG_RELEASE
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: INFO: task journal-offline:20653 blocked for more than 245 seconds.
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:       Tainted: G          I       5.18.0-rc2-daily-default-20220526-0-g46f61a08b40c #dev
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: task:journal-offline state:D stack:13784 pid:20653 ppid:     1 flags:0x00004000
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: Call Trace:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  <TASK>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __schedule+0x2b4/0x8c0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? __blk_flush_plug+0xdf/0x120
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  schedule+0x4e/0xb0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? folio_wait_bit+0xdd/0x1d0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  io_schedule+0x12/0x40
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  folio_wait_bit+0xdd/0x1d0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? filemap_invalidate_unlock_two+0x40/0x40
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  folio_wait_writeback+0x56/0xb0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  mpage_prepare_extent_to_map+0x185/0x450
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ext4_writepages+0x311/0xeb0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  do_writepages+0xbf/0x180
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? __lock_acquire.isra.0+0x125/0x550
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  file_write_and_wait_range+0xa6/0xe0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ext4_sync_file+0x1a5/0x3e0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  do_fsync+0x38/0x70
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __x64_sys_fsync+0x10/0x20
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  do_syscall_64+0x38/0x90
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: RIP: 0033:0x7fc65e7771fb
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: RSP: 002b:00007fc6548fdc70 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: RAX: ffffffffffffffda RBX: 0000563bbce09990 RCX: 00007fc65e7771fb
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: RDX: 00007fc65e5c69de RSI: 00007fc65e5c69de RDI: 0000000000000012
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: RBP: 00007fc65e5c81a0 R08: 0000000000000000 R09: 00007fc6548fe700
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: R10: 000000000000001e R11: 0000000000000293 R12: 0000000000000002
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: R13: 00007ffc382eec6f R14: 00007ffc382eec70 R15: 00007fc6548fdd80
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  </TASK>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: INFO: task kworker/u16:1:7388 blocked for more than 245 seconds.
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:       Tainted: G          I       5.18.0-rc2-daily-default-20220526-0-g46f61a08b40c #dev
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: task:kworker/u16:1   state:D stack:12920 pid: 7388 ppid:     2 flags:0x00004000
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: Workqueue: writeback wb_workfn (flush-259:0)
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: Call Trace:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  <TASK>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __schedule+0x2b4/0x8c0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? __blk_flush_plug+0xdf/0x120
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  schedule+0x4e/0xb0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  io_schedule+0x12/0x40
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  blk_mq_get_tag+0x11c/0x280
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? sched_clock_cpu+0xd/0xd0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? complete_all+0x60/0x60
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __blk_mq_alloc_requests+0x188/0x2e0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? blk_attempt_plug_merge+0x76/0x90
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  blk_mq_submit_bio+0x421/0x700
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  submit_bio_noacct_nocheck+0x259/0x2b0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ext4_io_submit+0x20/0x40
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ext4_bio_write_page+0x16e/0x520
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  mpage_submit_page+0x4e/0x70
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  mpage_process_page_bufs+0xf8/0x110
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  mpage_prepare_extent_to_map+0x1f9/0x450
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ext4_writepages+0x311/0xeb0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  do_writepages+0xbf/0x180
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? find_held_lock+0x32/0x90
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? sched_clock_cpu+0xd/0xd0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __writeback_single_inode+0x6f/0x470
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  writeback_sb_inodes+0x214/0x4d0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  __writeback_inodes_wb+0x4c/0xe0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  wb_writeback+0x2df/0x360
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  wb_workfn+0x371/0x500
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  process_one_work+0x29c/0x530
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  worker_thread+0x4d/0x3c0
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? process_one_work+0x530/0x530
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  kthread+0xf0/0x120
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ? kthread_complete_and_exit+0x20/0x20
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  ret_from_fork+0x1f/0x30
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  </TASK>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 
                                                Showing all locks held in the system:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by khungtaskd/56:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffffffff94b65cc0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0xf7
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by journal-offline/20653:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb49d06c80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x1>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 1 lock held by in:imklog/581:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb4bac9af8 (&f->f_pos_lock){....}-{3:3}, at: __fdget_pos+0x4a/0x60
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 4 locks held by kworker/u16:1/7388:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #0: ffff9cdb41002548 ((wq_completion)writeback){....}-{0:0}, at: process_one_work+0x208>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #1: ffffa3a5c8053e58 ((work_completion)(&(&wb->dwork)->work)){....}-{0:0}, at: process_>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #2: ffff9cdb49d020e8 (&type->s_umount_key#35){....}-{3:3}, at: trylock_super+0x16/0x50
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel:  #3: ffff9cdb49d06c80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x1>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 3 locks held by kworker/0:2/18951:
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: 
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: =============================================
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 7 dir 0 cmd 3
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x60060000: GLB_STREAM_MSG: TRIG_PAUSE
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x20140000 successful
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=3 dai iDisp3 Pin directi>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc3_hda_dai_trigger: cmd=4 dai iDisp3 Pin directi>
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 7 dir 0 cmd 4
May 26 23:32:52 jf-tglu-sku0a32-sdca-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x2014001e successful

@marc-hb
Copy link
Collaborator

marc-hb commented May 27, 2022

So it seems that the fix only landed on linux-next master branch.

This issue is intermittent, so the lack of reproduction can be just luck.

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 2, 2022

Whatever the problem is, it is definitely triggered by more intense logging. The sudo sync I recently added makes that obvious.

In daily test run 13008?model=TGLH_SKU0A70_HDA&testcase=check-ipc-flood

after 10,000 lines like this:

May 31 22:32:02 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 31 22:32:02 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 31 22:32:02 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 31 22:32:02 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
...

May 31 22:32:36 sh-tglh-0a70-hda-01 sudo[3291]:   ... ; USER=root ; COMMAND=/usr/bin/sync

May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:2:113 blocked for more than 122 seconds.
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:       Not tainted 5.18.0-rc2-daily-default-20220531-0-g46f61a08b40c #dev
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: task:kworker/u32:2   state:D stack:13032 pid:  113 ppid:     2 flags:0x00004000
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: Workqueue: writeback wb_workfn (flush-259:0)
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: Call Trace:
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:  <TASK>
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:  __schedule+0x2b4/0x8c0
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:  ? __blk_flush_plug+0xdf/0x120
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:  schedule+0x4e/0xb0
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel:  io_schedule+0x12/0x40
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:2:113 blocked for more than 122 seconds.
May 31 22:34:22 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3213 blocked for more than 122 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:2:113 blocked for more than 245 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3213 blocked for more than 245 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3292 blocked for more than 122 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:2:113 blocked for more than 368 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task journal-offline:3349 blocked for more than 122 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3213 blocked for more than 368 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3292 blocked for more than 245 seconds.
May 31 22:48:09 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:2:113 blocked for more than 491 seconds.

Same on the day before; 10,000 lines like these in only 2 seconds and then hang after sync.

May 30 22:32:33 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 30 22:32:33 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 30 22:32:33 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 30 22:32:33 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD
May 30 22:32:33 sh-tglh-0a70-hda-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0xb0010000: GLB_TEST_MSG: IPC_FLOOD

....

May 30 22:32:33 sh-tglh-0a70-hda-01 sudo[3228]:   .... ; USER=root ; COMMAND=/usr/bin/sync

May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: INFO: task kworker/u32:0:8 blocked for more than 122 seconds.
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:       Not tainted 5.18.0-rc2-daily-default-20220530-0-g46f61a08b40c #dev
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: task:kworker/u32:0   state:D stack:13208 pid:    8 ppid:     2 flags:0x00004000
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: Workqueue: writeback wb_workfn (flush-259:0)
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: Call Trace:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  <TASK>
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  __schedule+0x2b4/0x8c0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? __blk_flush_plug+0xdf/0x120
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  schedule+0x4e/0xb0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  io_schedule+0x12/0x40
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  blk_mq_get_tag+0x11c/0x280
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? sched_clock_cpu+0xd/0xd0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? complete_all+0x60/0x60
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  __blk_mq_alloc_requests+0x188/0x2e0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? blk_attempt_plug_merge+0x76/0x90
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  blk_mq_submit_bio+0x421/0x700
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  submit_bio_noacct_nocheck+0x259/0x2b0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ext4_io_submit+0x20/0x40
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ext4_bio_write_page+0x16e/0x520
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  mpage_submit_page+0x4e/0x70
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  mpage_process_page_bufs+0xf8/0x110
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  mpage_prepare_extent_to_map+0x1f9/0x450
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ext4_writepages+0x311/0xeb0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? sched_clock_cpu+0xd/0xd0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? __lock_acquire.isra.0+0x125/0x550
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  do_writepages+0xbf/0x180
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? find_held_lock+0x32/0x90
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? sched_clock_cpu+0xd/0xd0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  __writeback_single_inode+0x6f/0x470
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  writeback_sb_inodes+0x214/0x4d0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  wb_writeback+0xde/0x360
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  wb_workfn+0xda/0x500
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? lock_acquire+0xb3/0x180
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? process_one_work+0x208/0x530
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  process_one_work+0x29c/0x530
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  worker_thread+0x4d/0x3c0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? process_one_work+0x530/0x530
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  kthread+0xf0/0x120
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? kthread_complete_and_exit+0x20/0x20
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ret_from_fork+0x1f/0x30
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  </TASK>
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: INFO: task sync:3229 blocked for more than 122 seconds.
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:       Not tainted 5.18.0-rc2-daily-default-20220530-0-g46f61a08b40c #dev
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: task:sync            state:D stack:14520 pid: 3229 ppid:  3228 flags:0x00000000
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: Call Trace:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  <TASK>
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  __schedule+0x2b4/0x8c0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? __x64_sys_tee+0xd0/0xd0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  schedule+0x4e/0xb0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  wb_wait_for_completion+0x63/0x90
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? complete_all+0x60/0x60
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  sync_inodes_sb+0xe5/0x310
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ? __x64_sys_tee+0xd0/0xd0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  iterate_supers+0x8e/0xe0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  ksys_sync+0x40/0xa0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  __do_sys_sync+0xa/0x10
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  do_syscall_64+0x38/0x90
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: RIP: 0033:0x7f44a85eb25b
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: RSP: 002b:00007ffd6ff37b68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: RAX: ffffffffffffffda RBX: 00007ffd6ff37ca8 RCX: 00007f44a85eb25b
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: RDX: 00007f44a86c5501 RSI: 0000000000000000 RDI: 00007f44a868bbc0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: R10: 00007f44a86c2be0 R11: 0000000000000246 R12: 0000000000000000
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  </TASK>
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 
                                            Showing all locks held in the system:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 3 locks held by kworker/u32:0/8:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #0: ffff9134c1e7b948 ((wq_completion)writeback){....}-{0:0}, at: process_one_work+0x208/0x530
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #1: ffff9e16000ffe58 ((work_completion)(&(&wb->dwork)->work)){....}-{0:0}, at: process_one_work>
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #2: ffff9134c145ac80 (&sbi->s_writepages_rwsem){....}-{0:0}, at: do_writepages+0xbf/0x180
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 1 lock held by khungtaskd/96:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #0: ffffffff82b65cc0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0xf7
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 4 locks held by kworker/1:2/190:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 1 lock held by in:imklog/583:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #0: ffff9134c1830af8 (&f->f_pos_lock){....}-{3:3}, at: __fdget_pos+0x4a/0x60
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 1 lock held by sync/3229:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #0: ffff9134c14590e8 (&type->s_umount_key#35){....}-{3:3}, at: iterate_supers+0x78/0xe0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 1 lock held by sync/3308:
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel:  #0: ffff9134c14590e8 (&type->s_umount_key#35){....}-{3:3}, at: iterate_supers+0x78/0xe0
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: 
May 30 22:34:42 sh-tglh-0a70-hda-01 kernel: =============================================

@plbossart
Copy link
Member Author

@marc-hb I don't recall having seen this in recent PR or daily tests, should we close or is this still current?

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 29, 2022

It's been root-caused to failing hard drives, probably because of SSD firmware bugs, like for instance this one:

https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=754y5

Fixes an issue which can cause the SSD to occasionally not wake up from low power mode

(This is just random example; I'm not saying we use this particular model)

See internal issue 233 for more details.

Flash storage is complex, so it is buggy:
https://en.wikipedia.org/wiki/Write_amplification
https://www.bunniestudios.com/blog/?p=3554

It is still happening on some devices that have not been "serviced" yet, however the failure signature is usually very different since I added this quick storage test: thesofproject/sof-test#910

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 30, 2022

And now for something different:
https://sof-ci.01.org/sofpr/PR5959/build785/devicetest/?model=TGLU_UP_HDA_ZEPHYR&testcase=check-suspend-resume-with-playback-5

2022-06-30 10:11:14 UTC [ERROR] Caught kernel log error
===========================>>
[ 1063.661934] kernel: nvme nvme0: Device shutdown incomplete; abort shutdown
<<===========================

Never a dull moment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CML Applies to Comet Lake platform disk caching & write performance HDA Applies to HD-Audio bus for codec connection P1 Blocker bugs or important features TGL Applies to Tiger Lake platform
Projects
None yet
Development

No branches or pull requests

5 participants