intermittent kernel.mutex sanitycheck with mps2_an385 #12352

pabigot · 2019-01-06T18:16:53Z

I'm finding that tests/kernel/mutex/mutex/kernel.mutex on mps2_an385 fails intermittently on current master at 330cbfa using SDK 0.9.5.

Instrumenting the test shows that although the test_mutex routine normally reaches the point of taking the mutex within the first 10 ms of runtime, in some situations thread_09 will complete its 500 ms delay first and take the mutex instead.

Reproduce with:

sanitycheck -p mps2_an385 --test tests/kernel/mutex/mutex/kernel.mutex
while grep '385,True' scripts/sanity_chk/last_sanity.csv ; do
  sanitycheck -p mps2_an385 --test tests/kernel/mutex/mutex/kernel.mutex
done

The text was updated successfully, but these errors were encountered:

galak · 2019-01-15T18:22:05Z

This w/QEMU.

andrewboie · 2019-02-08T05:15:28Z

also seen on qemu_cortex_m3

andyross · 2019-02-26T19:34:34Z

I can't make this happen on master as of commit cd65ed8. tests/kernel/mutex on mps2_an385 got through 106 sanitycheck iterations on my build machine with no failures. Can you retest? What kind of frequency were you seeing?

pabigot · 2019-02-26T20:12:26Z

Failed on the sixth run in the above loop, with SDK 0.9.5. I don't remember exactly what failed before, but the assertion below is consistent with thread_09 already holding the mutex.

***** Booting Zephyr OS v1.14.0-rc1-562-gcd65ed857c *****
Running test suite mutex_complex
===================================================================
starting test - test_mutex
***** MPU FAULT *****
  Data Access Violation
  MMFAR Address: 0x20000000
***** Hardware exception *****
Current thread ID = 0x20000290
Faulting instruction address = 0x668
Fatal fault in thread 0x20000290! Aborting.
starting test - Test kernel Mutex API
===================================================================

    Assertion failed at /scratch/zp/zephyr/tests/kernel/mutex/mutex/src/main.c:274: test_mutex: (rv not equal to 0)
Failed to lock mutex 0x20003e28

FAIL - test_mutex
===================================================================
Test suite mutex_complex failed.
===================================================================
PROJECT EXECUTION FAILED
***** MPU FAULT *****
  Data Access Violation
  MMFAR Address: 0x20000000
***** Hardware exception *****
Current thread ID = 0x20000188
Faulting instruction address = 0x56a
Fatal fault in thread 0x20000188! Aborting.

andyross · 2019-02-26T20:15:37Z

Odd. What's your host OS? Is the system otherwise unloaded? I was using 0.9.5 for that test but it seems similarly fine for me with 0.10.0-rc3

pabigot · 2019-02-26T20:25:22Z

Ubuntu 18.04.2, 64 GB with 12 cores of i7-6800K. There's one 2-core VM running on it plus my development environment: top shows it about 98% idle. I think that's pretty lightly loaded, but it certainly isn't absolutely idle, so that's probably a contributing factor.

andrewboie · 2019-02-26T20:34:22Z

Wondering if this is a real kernel bug, or another instance of us not having working icount in QEMU.

andyross · 2019-02-26T20:46:55Z

If @pabigot is getting failures on an unloaded system after 6 iterations, that's for sure not a nondeterminism issue. Qemu really is pretty solid as long as it's allowed to receive its timing signals by the host OS promptly; the tests that give us trouble in CI will (literally!) run for thousands of cycles for me if I just run sanitycheck on a single test in a loop like this.

I'll dig around. My own box is a F29 system (and a similar CPU). At the office, gale is a 18.04 install, but it's in use right now for static analysis runs. I have an increasingly stale docker image that reproduces the CI infrastructure (that I should really update for stuff like this). Surely something will reproduce...

andyross · 2019-03-01T17:25:59Z

Confirmed on the 18.04 box. No idea why it's sensitive to host system like this (the toolchain and qemu versions are identical and provided in the SDK). Could be the kernel scheduler on the host being different, maybe.

andyross · 2019-03-02T16:25:23Z

Just to update while it occurred to me: it's not the OS, it's the clock rate. The failing box is a server with cores that sustain 2.6GHz under load. The machine I couldn't reproduce on is an overclocked 5GHz desktop. I should clock it down to confirm.

So this is almost certainly timing related, where on one system qemu can race ahead of the timer ISRs and on the other it gets caught before it's finished. Strongly suspect this is the same root cause as the still-undiagnosed ARM userspace-vs-timer instability detailed in #12553.

andyross · 2019-03-07T21:44:53Z

I was close. It's not only the clock rate, it's the CONFIG_HZ parameter in the host kernel. Fedora sets HZ=1000, where Ubuntu uses a 250Hz tick rate, which isn't precise enough to wake up or signal a qemu process when the guest expects for all test cases. Running this test with a rebuilt kernel to fix that one setting (and slowing down the simulation rate because ARM userspace turns out to be a little too slow for the 2.6GHz box) makes this test reliable.

Closing this bug, as it's not an issue with Zephyr code or this particular test. We'll track the root cause and workarounds in #12553.

pabigot added the bug The issue is a bug, or the PR is fixing a bug label Jan 6, 2019

galak added the priority: low Low impact/importance bug label Jan 8, 2019

galak added the area: Toolchains Toolchains label Jan 15, 2019

nashif mentioned this issue Jan 19, 2019

List of tests that keep failing sporadically #12553

Closed

18 tasks

galak assigned andyross Jan 29, 2019

andyross mentioned this issue Feb 26, 2019

kernel/sched: More nonatomic swap fixes #13770

Merged

andyross closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

pabigot commented Jan 6, 2019

galak commented Jan 15, 2019 •

edited

Loading

andrewboie commented Feb 8, 2019

andyross commented Feb 26, 2019

pabigot commented Feb 26, 2019

andyross commented Feb 26, 2019

pabigot commented Feb 26, 2019

andrewboie commented Feb 26, 2019

andyross commented Feb 26, 2019

andyross commented Mar 1, 2019

andyross commented Mar 2, 2019

andyross commented Mar 7, 2019

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

Comments

pabigot commented Jan 6, 2019

galak commented Jan 15, 2019 • edited Loading

andrewboie commented Feb 8, 2019

andyross commented Feb 26, 2019

pabigot commented Feb 26, 2019

andyross commented Feb 26, 2019

pabigot commented Feb 26, 2019

andrewboie commented Feb 26, 2019

andyross commented Feb 26, 2019

andyross commented Mar 1, 2019

andyross commented Mar 2, 2019

andyross commented Mar 7, 2019

galak commented Jan 15, 2019 •

edited

Loading