-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hung task timeout reported by CI #3387
Comments
A few weeks ago I spent some time investigating a similar report for a volume-basic-test. My preliminary conclusion was that the storage was overwhelmed by the volume of kernel logs.
|
This comment has been minimized.
This comment has been minimized.
Just found by chance that journald.conf has some rate limit options, worth a try. |
Another side effect: on some devices we have less than 1 week of logs:
Would some log level tuning make sense? cc: @fredoh9 , @keqiaozhang , @aiChaoSONG , @kv2019i |
looks like file system issues again
|
Another one:
|
yes, but look further and it's again ext4 locking issues. Nothing apparently to do with SOF. |
Here's another interesting issue that happened in daily 11687?model=TGLH_SKU0A70_HDA&testcase=volume-basic-test-50
Note the 35 minutes gap! @plbossart is there an easy way to reduce the log volume of this test? Tracepoints maybe? #3482 |
Indeed this GLB_TRACE_MSG: DMA_POSITION is not very useful, we should check if we can remove it. |
This recent one also has a ton of
|
Other interesting logs just found by chance. Also linking to
Daily 12389?model=CML_SKU0955_HDA&testcase=check-ipc-flood
|
I spent a couple days on this issue and made some progress. The error messages typically mean that the system is experiencing disk or memory congestion and processes are being starved of available resources. In other words, this issue is related to IO subsystem, it caused by high I/O load and file system failed to flush the caching data from memory to disk. Some explanations from google:
On our devices, the file system uses up to 10% of the available memory for system caching and the expire time is 432 seconds. These indicators can be configured in pagecache settings.
So to avoid such issue, we can reduce the value of I did a lot of tests to tune the values for As for why this issue has been exposed recently, my best guess is that there're some changes in kernel and increase the I/O load, since this issue only happens on some Dell devices, these devices may use the disk(nvme) with a low read-write speed. with these two factors, it easier for file systems to hit the I/O bottleneck. I will pay close attention to the following test results, hope this issue will disappear forever on our devices after the pagecache setting changes |
@keqiaozhang I didn't click on the fact that these errors only happen on Dell devices, is this really the case? I vaguely recall that on some Dell devices we had to change the default for the disk in the BIOS, the default setting was RAID and we had to switch to AHCI or something. |
Daily 12694?modelSoc=TGLU&model=TGLU_SKU0A32_SDCA&testcase=check-pause-resume-playback-100
|
I tried booting
I will run some drive tests in both modes tomorrow. |
Rst stands for Rapid Storage Technology. From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/rst-linux-paper.pdf
RST can cause problems when actually using RAID and/or dual booting with Windows. A typical mistake seems to be switching from RAID to AHCI without letting Windows know about that. This makes Windows unbootable. I haven't found anyone reporting any performance difference. |
Yes, from the issue history, it only happened on below 4 Dell platforms.
Yes, but we chose AHCI mode for all Dell devices, at least for SH devices. |
jf-tglu-sku0a32-sdca-03 crashed again. Again, it crashed when logging was heavy. This caused another "time warp". I took that opportunity to explain how tl;dr: always use journalctl's |
It seems easier to hit this issue on
I triggered 6 rounds of daily tests on these 3 devices separately, but 2 rounds of tests failed. It seems that adjusting pagecache settings can only reduce the probability of occurrence, but it cannot completely avoid this issue. |
I think jf-tglu-sku0a32-sdca-03 (and maybe other similar devices) have temporary storage issues. I ran bonnie++ in AHCI mode and I found write performance temporarily half as good as normally expected!
So I switched to "RAID On)" (
After switching back to AHCI, the performance stayed high!! I tried jf-tglu-sku0a32-sdca-01 in both modes and the performance was always in the high range. Flash storage and wear leveling are very complex. Maybe storage on those devices is starting to wear out which could cause hangs? Or the drive firmware could just be buggy. https://en.wikipedia.org/wiki/Wear_leveling https://www.bunniestudios.com/blog/?p=3554 |
To make things worse, I've narrowed down a graphical issue on these devices. It's unrelated but it does not help. They're stuck to VT1. In other words, EDIT, mystery solved: |
The storage on (some of) those systems is definitely unreliable. I reserved jf-tglu-sku0a32-sdca-03 for some testing and I experienced another hang while
Immediately after I tried:
BTW the whole Another thing that took an unusually long time just now:
That's 30 times slower than above on the same system!! |
Ignore the last comment sorry. I was very unlucky and happened to test our special weekly kernel (unfortunately called EDIT: we should check whether some of the reports above happened with a weekly kernel. |
When graphics fail, a blank screen is not useful. Even when graphics work, without a framebuffer console chvt still works but it is "blind" / with a frozen GUI instead! Incredibly confusing. Linux distributions expect this. Also useful in case of hangs like thesofproject/linux#3387: text consoles are always more responsive and they may even have some logs Signed-off-by: Marc Herbert <[email protected]>
@marc-hb I restored all But before the reproduction rate is low after adjusting the |
We've experienced a fair number of test failures that seem to point to storage performance issues: thesofproject/linux#3387 thesofproject/linux#3669 This (temporary?) addition runs a quick write test after each audio test to monitor storage sanity. As a bonus feature the "sync" could help us collect more logs. On our (slowest) BYT devices the test adds 3s per test; much less on newer devices. Sample output: ``` 2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger 2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt + timeout -s CONT 5 sudo sync real 0m0.062s user 0m0.005s sys 0m0.019s + timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync 200+0 records in 200+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s + timeout -s CONT 5 sudo sync real 0m0.037s user 0m0.004s sys 0m0.018s 2022-05-24 23:20:44 UTC [INFO] Test Result: PASS! ``` Signed-off-by: Marc Herbert <[email protected]>
We've experienced a fair number of test failures that seem to point to storage performance issues: thesofproject/linux#3387 thesofproject/linux#3669 This (temporary?) addition runs a quick write test after each audio test to monitor storage sanity. As a bonus feature the "sync" could help us collect more logs. On our (slowest) BYT devices the test adds 3s per test; much less on newer devices. Sample output: ``` 2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger 2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt + timeout -s CONT 5 sudo sync real 0m0.062s user 0m0.005s sys 0m0.019s + timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync 200+0 records in 200+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s + timeout -s CONT 5 sudo sync real 0m0.037s user 0m0.004s sys 0m0.018s 2022-05-24 23:20:44 UTC [INFO] Test Result: PASS! ``` Signed-off-by: Marc Herbert <[email protected]>
I'm adding a quick storage sanity check + Please help review. |
We've experienced a fair number of test failures that seem to point to storage performance issues: thesofproject/linux#3387 thesofproject/linux#3669 This (temporary?) addition runs a quick write test after each audio test to monitor storage sanity. As a bonus feature the "sync" could help us collect more logs. On our (slowest) BYT devices the test adds 3s per test; much less on newer devices. Sample output: ``` 2022-05-24 23:20:41 UTC [INFO] pkill -TERM sof-logger 2022-05-24 23:20:42 UTC [INFO] nlines=1132 /home/mherber2/SOF/sof-test/logs/BOGUS-check-playback/2022-05-24-16:20:35-3288/slogger.txt + timeout -s CONT 5 sudo sync real 0m0.062s user 0m0.005s sys 0m0.019s + timeout -s CONT 10 dd if=/dev/zero of=/home/mherber2/HD_TEST_DELETE_ME bs=1M count=200 conv=fsync 200+0 records in 200+0 records out 209715200 bytes (210 MB, 200 MiB) copied, 2.0893 s, 100 MB/s + timeout -s CONT 5 sudo sync real 0m0.037s user 0m0.004s sys 0m0.018s 2022-05-24 23:20:44 UTC [INFO] Test Result: PASS! ``` Signed-off-by: Marc Herbert <[email protected]>
Thanks for @ujfalusi suggestion. @marc-hb and @plbossart Here is the first round of test result for 2 Dell platforms which are easy to hit this issue: https://sof-ci.sh.intel.com/#/result/planresultdetail/12875 So I triggered another round of test on all 4 Dell platforms and also remove the workaround(pagecache setting), the results are really good and consistent. This is really a good news, it seems that this issue has been fixed in latest upstream kernel. |
@keqiaozhang can you try with 5.18 to see if that already fixes the problem? linux-next includes stuff that will be provided later, or in 5.19-rc1 and that in itself will add other issues. |
Maybe a newer kernel has marginally better performance but that's IMHO only hiding the real, device-specific issue here. As noticed by @plbossart , we've observed these hangs only on certain TGL devices with (normally!) very high performance NVMe storage. I just discovered that the storage performance on some of our old BYT devices is ONE HUNDRED times lower than on those recent TGL devices: Yet we never ever noticed anything like this on BYT!? Something does not add up. Anyway we're not in the storage business so whatever hides this issue is good enough for me. EDIT: CML TIMEOUT in |
When graphics fail, a blank screen is not useful. Even when graphics work, without a framebuffer console chvt still works but it is "blind" / with a frozen GUI instead! Incredibly confusing. Linux distributions expect this. Also useful in case of hangs like thesofproject/linux#3387: text consoles are always more responsive and they may even have some logs Signed-off-by: Marc Herbert <[email protected]>
Be noticed that some BYT devices use the SD card for storage, not SSD or NVME. please check the storage size for the low performance BYT devices. the storage size should be 16G or 32G if it uses SD card. |
Very interesting but that's not my point. My point is: incredibly slow BYT storage never triggered any "hung task" or TIMEOUT. So this is not just a "slow storage" issue, something else is really wrong with these devices and we still have no idea what it is. Maybe storage hangs sometimes, or maybe it's not even storage. So neither the newer kernel versions or |
I tested with https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git/log/?h=for-5.18 kernel before, the hung task timeout issue is still reproducible. Then I tested with today's I also checked with So it seems that the fix only landed on |
Reproduced again in daily 12927?model=TGLU_SKU0A32_SDCA&testcase=multiple-pause-resume-50 I had a look at the results of the quick storage test I just added in thesofproject/sof-test@34ca191b92a32 and write speed was blazingly fast as usual before and after this failure, zero issue found with actual storage. Start Time: 2022-05-26 22:28:02 UTC
|
This issue is intermittent, so the lack of reproduction can be just luck. |
Whatever the problem is, it is definitely triggered by more intense logging. The In daily test run 13008?model=TGLH_SKU0A70_HDA&testcase=check-ipc-flood after 10,000 lines like this:
Same on the day before; 10,000 lines like these in only 2 seconds and then hang after
|
@marc-hb I don't recall having seen this in recent PR or daily tests, should we close or is this still current? |
It's been root-caused to failing hard drives, probably because of SSD firmware bugs, like for instance this one: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=754y5
(This is just random example; I'm not saying we use this particular model) See internal issue 233 for more details. Flash storage is complex, so it is buggy: It is still happening on some devices that have not been "serviced" yet, however the failure signature is usually very different since I added this quick storage test: thesofproject/sof-test#910 |
And now for something different:
Never a dull moment... |
https://sof-ci.sh.intel.com/#/result/planresultdetail/9293?model=CML_SKU0955_HDA&testcase=multiple-pause-resume-50
https://sof-ci.sh.intel.com/#/result/planresultdetail/9293?model=TGLH_SKU0A70_HDA&testcase=volume-basic-test-50
.===========================>>
[ 3318.829119] kernel: INFO: task kworker/u16:0:17963 blocked for more than 122 seconds.
[ 3318.829227] kernel: Not tainted 5.16.0-rc1-daily-default-20220111-0 #55744127
[ 3318.829239] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<<===========================
The text was updated successfully, but these errors were encountered: