Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][TGL][Chrome] DSP Panic with large config payload (was plugging/unplugging a headset while using "Online Voice Recorder") #4769

Closed
johnylin76 opened this issue Sep 15, 2021 · 58 comments
Assignees
Labels
bug Something isn't working as expected DSP panic DSP panic observed IGO NC Observed in IGO NC integration TGL Applies to Tiger Lake

Comments

@johnylin76
Copy link
Contributor

Describe the bug
When IGO Noise Reduction was enabled and releases/tgl/v13.0-hotfix3 is applied on firmware.
During recording with "Online Voice Recorder", there was DSP panic found when plugging/unplugging a headset.

To Reproduce

  1. Open "online voice recorder" on Chrome browser.
  2. Start recording and say something (input: internal mic)
  3. Plug in 3.5mm jack headset and say something (input: headset mic)
  4. Unplug headset and say something (input: internal mic)
  5. Repeat 3 and 4 again and again

Note that script is not reproducible.

Reproduction Rate
~1/20

Expected behavior
The recording is always valid.

Impact
DSP panic occurred and user needs reboot to recover.

Environment
Google private tgl-13 branch with the checkout of https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/4116509/1
And with the latest IGO proprietary library

Screenshots or console output
sof-audio-pci 0000:00:1f.3: error : DSP panic!
sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
sof-audio-pci 0000:00:1f.3: error: runtime exception
sof-audio-pci 0000:00:1f.3: error: trace point 00004000
sof-audio-pci 0000:00:1f.3: error: panic at :0
sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops
sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access
sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS 0x00060b25 SAR 0x00000000
sof-audio-pci 0000:00:1f.3: EPC1 0xbe02e437 EPC2 0xbe02d7e6 EPC3 0x00000000 EPC4 0x00000000
sof-audio-pci 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
sof-audio-pci 0000:00:1f.3: EPS2 0x00060120 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
sof-audio-pci 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40
sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000
sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4
sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff
sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4
sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff

messages_panic_at_intl_mic.txt
messages_panic_at_headset.txt

@johnylin76 johnylin76 added the bug Something isn't working as expected label Sep 15, 2021
@sathya-nujella sathya-nujella added TGL Applies to Tiger Lake IGO NC Observed in IGO NC integration labels Sep 15, 2021
@keyonjie
Copy link
Contributor

@bkokoszx @mwasko @abonislawski FYI. unfortunately another one.

@keyonjie
Copy link
Contributor

@johnylin76 do you have chance to capture the FW log?
@1994lwz can you try to reproduce it?

@lgirdwood
Copy link
Member

Prior to the crash we have a lot of IPC errors and attempts at setting a large config to component 57. What is component 57 ?

2021-09-15T05:38:36.475685Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:38:36.581500Z ERR kernel: [  853.364841] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:36.581518Z ERR kernel: [  853.364845] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:40.095015Z ERR kernel: [  856.876570] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:40.095029Z ERR kernel: [  856.876574] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:47.076017Z ERR kernel: [  863.854180] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:47.076031Z ERR kernel: [  863.854184] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:50.020019Z ERR kernel: [  866.796418] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:50.020033Z ERR kernel: [  866.796422] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:52.253016Z ERR kernel: [  869.028375] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:52.253032Z ERR kernel: [  869.028379] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:55.737015Z ERR kernel: [  872.510302] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:55.737029Z ERR kernel: [  872.510306] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:38:57.770043Z ERR kernel: [  874.542636] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:38:57.770057Z ERR kernel: [  874.542640] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:01.463014Z ERR kernel: [  878.233582] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:01.463028Z ERR kernel: [  878.233585] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.458016Z ERR kernel: [  880.227898] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:03.458029Z ERR kernel: [  880.227903] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:03.372964Z WARNING cras_server[1850]: message repeated 5 times: [ No chmap queried! Skip chmap set]
2021-09-15T05:39:04.521931Z INFO ARCVM(32)[4802]: [  349.525556] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:06.815017Z ERR kernel: [  883.582875] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:06.815035Z ERR kernel: [  883.582879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:08.682024Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:08.766045Z ERR kernel: [  885.532714] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:08.766057Z ERR kernel: [  885.532719] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:11.680015Z ERR kernel: [  888.445989] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:11.680029Z ERR kernel: [  888.445994] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.603026Z ERR kernel: [  890.367241] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:13.603040Z ERR kernel: [  890.367245] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:16.561020Z ERR kernel: [  893.323743] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:16.561042Z ERR kernel: [  893.323747] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:13.516979Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:17.221101Z INFO ARCVM(32)[4802]: [  362.218463] healthd: battery l=100 v=13080 t=0.0 h=1 st=4 c=0 fc=4085000 cc=0 chg=a
2021-09-15T05:39:19.047926Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:19.131016Z ERR kernel: [  895.892874] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:19.131032Z ERR kernel: [  895.892879] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:22.043015Z ERR kernel: [  898.803244] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:22.043030Z ERR kernel: [  898.803248] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.585019Z ERR kernel: [  901.344545] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:24.585036Z ERR kernel: [  901.344549] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:24.499112Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:29.857847Z NOTICE temp_logger[6248]:  x86_pkg_temp:43C INT3400_Thermal:20C TSR0:35C TSR1:33C TSR2:37C TSR3:35C TCPU:45C iwlwifi_1:34C PL1:17.000W
2021-09-15T05:39:30.401768Z WARNING cras_server[1850]: No chmap queried! Skip chmap set
2021-09-15T05:39:30.488021Z ERR kernel: [  907.244012] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:30.488034Z ERR kernel: [  907.244016] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:33.362018Z ERR kernel: [  910.116573] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:33.362032Z ERR kernel: [  910.116578] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:35.593991Z ERR kernel: [  912.347471] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
2021-09-15T05:39:35.594012Z ERR kernel: [  912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [  915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-15T05:39:38.566071Z ERR kernel: [  915.318070] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-15T05:39:38.566071Z ERR kernel: [  915.318158] sof-audio-pci 0000:00:1f.3: error: runtime exception
2021-09-15T05:39:38.566072Z ERR kernel: [  915.318159] sof-audio-pci 0000:00:1f.3: error: trace point 00004000
2021-09-15T05:39:38.566073Z ERR kernel: [  915.318160] sof-audio-pci 0000:00:1f.3: error: panic at :0
2021-09-15T05:39:38.566074Z ERR kernel: [  915.318162] sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops
2021-09-15T05:39:38.566080Z ERR kernel: [  915.318163] sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access
2021-09-15T05:39:38.566081Z ERR kernel: [  915.318164] sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS       0x00060b25 SAR     0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [  915.318165] sof-audio-pci 0000:00:1f.3: EPC1     0xbe02e437 EPC2     0xbe02d7e6 EPC3     0x00000000 EPC4    0x00000000
2021-09-15T05:39:38.566082Z ERR kernel: [  915.318167] sof-audio-pci 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [  915.318168] sof-audio-pci 0000:00:1f.3: EPS2     0x00060120 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
2021-09-15T05:39:38.566084Z ERR kernel: [  915.318169] sof-audio-pci 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
2021-09-15T05:39:38.566085Z ERR kernel: [  915.318170] sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40
2021-09-15T05:39:38.566085Z ERR kernel: [  915.318172] sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000
2021-09-15T05:39:38.566086Z ERR kernel: [  915.318173] sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566086Z ERR kernel: [  915.318174] sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000
2021-09-15T05:39:38.566087Z ERR kernel: [  915.318176] sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4
2021-09-15T05:39:38.566088Z ERR kernel: [  915.318177] sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff
2021-09-15T05:39:38.566088Z ERR kernel: [  915.318178] sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000
2021-09-15T05:39:38.566089Z ERR kernel: [  915.318179] sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4
2021-09-15T05:39:38.566090Z ERR kernel: [  915.318180] sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff

@lgirdwood
Copy link
Member

lgirdwood commented Sep 16, 2021

oh it appears the oops is after the set large config to comp 57 (see time delay)

2021-09-15T05:39:35.594012Z ERR kernel: [  912.347475] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
2021-09-15T05:39:38.566053Z ERR kernel: [  915.318051] sof-audio-pci 0000:00:1f.3: error : DSP panic!

The other crashes show similar delays with previous IPC errors.

Can we reproduce with kernel IPC logs ON and FW logs ON ?

@1994lwz
Copy link

1994lwz commented Sep 17, 2021

This issue is reproduce on the chrome platform at SHZ1 site.

[core dump]

# Core header:

# arch        00000001 # totalsize   0000017c # stackptr    be143d40 # stackoffset 000001a8
# configidhi  c2f3fbfe # configidlo  1cc6c29b # numaregs    00000040

# CPU registers:

# exccause    0000000d # excvaddr    0000bb88 # ps          00060225
# epc1        be02e437 # epc2        be02d7e6 # epc3        00000000 # epc4        00000000
# epc5        00000000 # epc6        00000000 # epc7        00000000
# eps2        00060a20 # eps3        00000000 # eps4        00000000 # eps5        00000000
# eps6        00000000 # eps7        00000000
# depc        00000000 # intenable   00000000 # interrupt   00000262 # sar         00000000
# debugcause  00000000
# windowbase  0000000e # windowstart 00005400
# excsave1    00000000
# ar56        be04a59f # ar57        be143d50 # ar58        be00881c # ar59        1cc6c29b
# ar60        c2f3fbfe # ar61        0000017c # ar62        00000001 # ar63        be143d40
# ar0         be04a51f # ar1         be143d30 # ar2         be008f28 # ar3         00000016
# ar4         00000028 # ar5         00000538 # ar6         9e266400 # ar7         00000004
# ar8         be03ec8b # ar9         be143d00 # ar10        be0089a8 # ar11        be143d40
# ar12        00000538 # ar13        00000800 # ar14        be2d2480 # ar15        00000003
# ar16        be03ebfb # ar17        be143ce0 # ar18        be0089a8 # ar19        be144278
# ar20        00000000 # ar21        be008ee0 # ar22        be02da67 # ar23        be146b30
# ar24        be03ebfb # ar25        be143cc0 # ar26        be00a670 # ar27        00000000
# ar28        00000000 # ar29        be00ae70 # ar30        00000005 # ar31        00000005
# ar32        be144170 # ar33        be144150 # ar34        be2d8e00 # ar35        00000000
# ar36        be144180 # ar37        00000000 # ar38        be03dc34 # ar39        00060826
# ar40        fe04fd68 # ar41        be143d90 # ar42        be143dc0 # ar43        0000000d
# ar44        00000000 # ar45        9e2664d0 # ar46        9e266480 # ar47        be2d8e4c
# ar48        be02e449 # ar49        be143d60 # ar50        0dead006 # ar51        00000000
# ar52        be143d90 # ar53        00000000 # ar54        00000001 # ar55        00000001


# windowbase: E
#               0 1 2
# windowstart: b101010000000000

#      reg         a0         a1
#                  (return)   (sptr)
#      ---         --------   -------
#  0 # ar56        be04a59f   be143d50
#  1 # ar48        be02e449   be143d60
#  2 # ar40        fe04fd68   be143d90

# Stack dumped from be143d40 dwords num 406

# *EXCEPTION*

# exccause: LoadStorePIFDataErrorCause: ynchronous PIF data error during LoadStore access
# excvaddr: 48008

@johnylin76
Copy link
Contributor Author

@johnylin76 do you have chance to capture the FW log?

Here are the logs with kernel IPC logs ON and FW logs ON:
sof-logger.txt
messages.txt

@lgirdwood
Copy link
Member

The logs show 2 things

  1. What appears to be heavy load on the DSP, as the scheduler cant shedule the agent on time ?
TIMESTAMP         (us)              DELTA  C# COMPONENT          LOCATION                      CONTENT	ktime=370.318s  @  2021-09-17 17:42:59 +0800 CST

Found valid LDC address after skipping 48 bytes (one line uses 20 + 0 to 16 bytes)
[           0.000000] (           0.000000) c0 unknown      0.0            src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51426
[   231528321.320724] (   231528320.000000) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51273
[   231568316.631635] (       39995.312500) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51290
[   231608321.109212] (       40004.476562) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51377
[   231648337.305539] (       40016.195312) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51574
[   231688318.553950] (       39981.250000) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51322
[   231728324.073194] (       40005.519531) c0 sa                          src/lib/agent.c:79   WARN validate(), ll drift detected, delta = 51435

But the delay is consistent, meaning we are likely scheduling agent in a timely manner but perhaps with the wrong clock or timebase.

  1. We are doing a lot (~200) IPC configuration messages prior to the crash.
2021-09-17T09:38:57.731047Z DEBUG kernel: [  128.874270] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732027Z DEBUG kernel: [  128.874692] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732037Z DEBUG kernel: [  128.874705] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.732040Z DEBUG kernel: [  128.875252] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.732050Z DEBUG kernel: [  128.875301] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733017Z DEBUG kernel: [  128.875705] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733030Z DEBUG kernel: [  128.875716] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.733037Z DEBUG kernel: [  128.876259] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.733040Z DEBUG kernel: [  128.876299] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734022Z DEBUG kernel: [  128.876699] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734030Z DEBUG kernel: [  128.876710] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.734040Z DEBUG kernel: [  128.877260] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.734042Z DEBUG kernel: [  128.877293] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.735041Z DEBUG kernel: [  128.877708] sof-audio-pci 0000:00:1f.3: ipc tx succeeded: 0x50030000
2021-09-17T09:38:57.735044Z DEBUG kernel: [  128.877730] sof-audio-pci 0000:00:1f.3: ipc tx: 0x50030000
2021-09-17T09:38:57.736014Z ERR kernel: [  128.878698] sof-audio-pci 0000:00:1f.3: error : DSP panic!
2021-09-17T09:38:57.736019Z DEBUG kernel: [  128.878702] sof-audio-pci 0000:00:1f.3: panic: dsp_oops_offset 788480 offset 788480
2021-09-17T09:38:57.736020Z ERR kernel: [  128.878718] sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
2021-09-17T09:38:57.736021Z ERR kernel: [  128.878806] sof-audio-pci 0000:00:1f.3: error: runtime exception

We are not doing anything else at this time so could be a race when we are copying config data.

@bkokoszx bkokoszx self-assigned this Sep 20, 2021
@mengdonglin mengdonglin added the DSP panic DSP panic observed label Sep 22, 2021
@johnylin76
Copy link
Contributor Author

This issue is reproduce on the chrome platform at SHZ1 site.

Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device?
Thanks.

@1994lwz
Copy link

1994lwz commented Sep 24, 2021

This issue is reproduce on the chrome platform at SHZ1 site.

Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device?
Thanks.

the device is Delbin EVT SKU2.

@lgirdwood
Copy link
Member

@1994lwz @bkokoszx any update or findings so far ?

@bkokoszx
Copy link
Collaborator

@lgirdwood
Not yet, I was debugging problem with D0i3->D0 multicore transition.

Is there any way to reproduce this issue by cmd line?

@johnylin76
Copy link
Contributor Author

Unfortunately I still cannot make a script to successfully reproduce the issue with "cras_test_client --plug ::<0|1>" commands. It seems like a complicated problem that cannot be created by the simple combination of commands.

However, I can still reproduce the issue on my Drobit device by the manual steps, although it is also not easy to make it happen. My strategy is rebooting the device and trying again when I failed to reproduce the issue after 30~40 times of headset plug-and-unplug iterations. (I rebooted 3 times and finally hit it...)

Attached logs:
messages.txt
sof-logger.txt

The log from sof-logger looks weird and unfinished, but it is indeed the whole log I could retrieve after hitting the issue.

@bkokoszx
Copy link
Collaborator

@johnylin76
Are you able to provide core dump file (after reproduction) with corresponding firmware .lst file (sof-tgl.lst - one of a building artifact)?

@johnylin76
Copy link
Contributor Author

@bkokoszx
Do you mean the core dump file generated by the method as mentioned in sof/issues/4648?
I have generated one. Please see as attached:

messages.txt
sof-logger.txt
oops_coredump.txt

@bkokoszx
Copy link
Collaborator

@johnylin76
Yes, but additionally *.lst file will be helpful (sof-tgl.lst). It contains disassembly of the code, so that we can associate addresses from coredump file provided by you with the code.
*.lst file is one of the build artifacts (in <build_dir>/src/arch/xtensa/).

@ranj063
Copy link
Collaborator

ranj063 commented Sep 30, 2021

here's a tentative fix for this issue: #4833

Still being tested

@cujomalainey
Copy link
Member

here's a tentative fix for this issue: #4833

Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.

@ranj063
Copy link
Collaborator

ranj063 commented Sep 30, 2021

Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size.

@cujomalainey I'm not certain what could cause the crash given that we're likely sending the same blob. So, theoretically the size should remain the same. But from the logs we see that there are 2 blobs (or the same blob senttwice) sent during headset removal/pug-in. Do you know what those 2 blobs are for?

@cujomalainey
Copy link
Member

@ranj063 we know they are the hotword models, but something is being weird in how they are sent. Previously on 012 this was not an issue. FYI i now have a script you can run to repro this. Repro times range from 5s to 10min

cras_test_client --capture_f /dev/null &
cras_test_client --listen /dev/null &
while true; do cras_test_client --set_hotword_model 15:0:null & cras_test_client --set_hotword_model 15:0:en_all; done

@ranj063
Copy link
Collaborator

ranj063 commented Sep 30, 2021

@cujomalainey do you mind trying my PR to see if the panic still hapens on your device?

@cujomalainey
Copy link
Member

I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.

@ranj063
Copy link
Collaborator

ranj063 commented Oct 1, 2021

I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon.

with my PR i'm able to hit the issue much sooner than without

@bkokoszx
Copy link
Collaborator

bkokoszx commented Oct 1, 2021

@ranj063
I've tried to reproduce panic remotely using @cujomalainey script and I did not hit reproduction. Only I saw ipc errors:

[  290.425084] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[  290.425086] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57
[  290.426537] sof-audio-pci 0000:00:1f.3: error: ipc error for 0x50030000 size 384
[  290.426539] sof-audio-pci 0000:00:1f.3: error: set/get large ctrl ipc comp 57

Update:
I think I hit the repro. I will double check and let you know.

@cujomalainey
Copy link
Member

According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.

@ranj063
Copy link
Collaborator

ranj063 commented Oct 1, 2021

According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure.

@cujomalainey do you think that instead of realloc'ing the blob_handler->data pointer to a new pointer everytime, allocating data with a max size and only updating it would help solve the issue?

@bkokoszx
Copy link
Collaborator

bkokoszx commented Oct 8, 2021

If you would like to check values of specific variable you can do it using e.g. trace_points:

diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index 8c46eaa785c8..1f3da320ebe3 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -343,6 +343,8 @@ static void ghd_detect(struct comp_dev *dev,
        if (cd->detected)
                return;

+       mailbox_sw_reg_write(SRAM_REG_DEBUG_WIN_1, bytes);
+
        /* Assuming 1 channel, verified in ghd_params.
         *
         * TODO Make the logic multi channel safe when new hotword library can
diff --git a/src/include/kernel/mailbox.h b/src/include/kernel/mailbox.h
index c9fad0885b69..f6e707c8be02 100644
--- a/src/include/kernel/mailbox.h
+++ b/src/include/kernel/mailbox.h
@@ -36,7 +36,8 @@
 #define SRAM_REG_R0_STATE_TRACE                        SRAM_REG_R_STATE_TRACE_BASE
 #define SRAM_REG_R1_STATE_TRACE                        0x38
 #define SRAM_REG_R2_STATE_TRACE                        0x40
-#define SRAM_REG_FW_END                                (SRAM_REG_R2_STATE_TRACE + 0x8)
+#define SRAM_REG_DEBUG_WIN_1        0x48
+#define SRAM_REG_FW_END                                (SRAM_REG_R2_STATE_TRACE + 0x4)

 /** @}*/

And dump it by command:

hexdump /sys/kernel/debug/sof/regs

You can define bigger or several debug "windows" and dump/check bigger part of memory.

@keyonjie
Copy link
Contributor

@sathya-nujella @johnylin76 @cujomalainey @bkokoszx I have a one line change which fixes quite some cache coherency issues when debugging multi-core + dynamic pipelines, can you help to give it quick try if it help:
keyonjie@c5d156d

@cujomalainey
Copy link
Member

@bkokoszx thanks for the helpful tip. Will definitely remember to try that out. @johnylin76 can you test @keyonjie's change?

@johnylin76
Copy link
Contributor Author

Unfortunately by applying @keyonjie's change it is still easy to reproduce DSP panic issue by script (less than running 1 minute).
Here are the logs:
messages.txt
sof-logger.txt

@keyonjie
Copy link
Contributor

Thanks for trying it out @johnylin76 .
From the FW logs we need PR: #4833 @bkokoszx

@johnylin76
Copy link
Contributor Author

When I tried I also included PR#4833.
According to #4769 (comment) DSP panic issue still can be reproduced by script with PR#4833

@keyonjie
Copy link
Contributor

When I tried I also included PR#4833. According to #4769 (comment) DSP panic issue still can be reproduced by script with PR#4833

Sorry didn't notice to this, so let's focus on @bkokoszx 's suggestion and debugging then.

@bkokoszx
Copy link
Collaborator

I can also reproduce panic issue with script with @keyonjie change.

@bkokoszx
Copy link
Collaborator

@cujomalainey @johnylin76
Did you find any time to look into this panic issue from Google perspective?

@cujomalainey
Copy link
Member

@bkokoszx I have not, I am sheriff for my team this week so my time is thin, @johnylin76 might have. I still think this is likely a data corruption issue caused by stressing malloc given the memory map works fine when not constantly being swapped. It might be worth re-enabling the CRC and seeing if we catch some CRC errors.

@bkokoszx
Copy link
Collaborator

@cujomalainey
Thanks for update

@cujomalainey
Copy link
Member

We are testing the CRAS workaround right now, but still underlying error exists

@cujomalainey
Copy link
Member

So our workaround prevents the crash, but we have a new crash that is blocking the release

@lgirdwood
Copy link
Member

@cujomalainey are both crashes solved now ? I've not seen any updates since the DMA fixes were merged so I'm assuming were are good ?
@sathyap-chrome @sathya-nujella fyi.

@cujomalainey
Copy link
Member

cujomalainey commented Dec 13, 2021

@lgirdwood i believe this crash can still be replicated with the script. We simply modified CRAS so it won't hit it. It might be worth testing with another component that has a large config payload. (Note this payload is about 60k)

@lgirdwood lgirdwood added this to the v2.1 milestone Jan 25, 2022
@lgirdwood lgirdwood changed the title [BUG][TGL][Chrome] DSP Panic on plugging/unplugging a headset while using "Online Voice Recorder" [BUG][TGL][Chrome] DSP Panic with large config payload (was plugging/unplugging a headset while using "Online Voice Recorder") Jan 25, 2022
@lgirdwood
Copy link
Member

Tracking for 2.1 - @cujomalainey @sathya-nujella pls append the script here if you have it. Thanks.

@cujomalainey
Copy link
Member

@lgirdwood

cras_test_client --capture_f /dev/null &
cras_test_client --listen /dev/null &
while true; do cras_test_client --set_hotword_model 15:0:null & cras_test_client --set_hotword_model 15:0:en_all; done

Usually takes a couple of minutes

@lgirdwood
Copy link
Member

@marc-hb @greg-intel @aiChaoSONG @XiaoyunWu6666 fyi - we need a stress/CI test that sends a sequence of large config messages to our WOV dummy component as we have a crash in the large message handling whilst doing a playback and capture.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 31, 2022

Can someone who worked on this bug and understand enough about it please file a more detailed test description in a new https://github.com/thesofproject/sof-test/issues? A detailed focused description of what CRAS does or did seems like a minimum.

@lgirdwood
Copy link
Member

@marc-hb The aim is to not test for this exact bug, but for the use case where it occurs. The active streams and large config IPC run at different contexts with xtos so we may have a race. Being able to do a test with running streams and sending large config IPCs would be helpful. Feature request added here thesofproject/sof-test#847
Fwiw, doing active streams and flood IPC would also be a good test too (and may hit this issue).

@kv2019i
Copy link
Collaborator

kv2019i commented Feb 15, 2023

@cujomalainey @lgirdwood I removed the v2.1 milestone as that ship has sailed. I think we should use bug priorities if this needs more attention. The thesofproject/sof-test#847 test issue is also tracked (we could use e.g. EQ to mimic a similar test in upstream).

@kv2019i kv2019i removed this from the v2.1 milestone Feb 15, 2023
@cujomalainey
Copy link
Member

@johnylin76 is this still relevant?

@johnylin76
Copy link
Contributor Author

I think this issue can be closed which should be fixed (or substantially alleviated) by the following module_handler improvements. I barely see this issue on ADL projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected DSP panic DSP panic observed IGO NC Observed in IGO NC integration TGL Applies to Tiger Lake
Projects
None yet
Development

No branches or pull requests