Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

Closed
pabigot opened this issue Jan 6, 2019 · 11 comments
Closed

intermittent kernel.mutex sanitycheck with mps2_an385 #12352

pabigot opened this issue Jan 6, 2019 · 11 comments
Assignees
Labels
area: Toolchains Toolchains bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@pabigot
Copy link
Collaborator

pabigot commented Jan 6, 2019

I'm finding that tests/kernel/mutex/mutex/kernel.mutex on mps2_an385 fails intermittently on current master at 330cbfa using SDK 0.9.5.

Instrumenting the test shows that although the test_mutex routine normally reaches the point of taking the mutex within the first 10 ms of runtime, in some situations thread_09 will complete its 500 ms delay first and take the mutex instead.

Reproduce with:

sanitycheck -p mps2_an385 --test tests/kernel/mutex/mutex/kernel.mutex
while grep '385,True' scripts/sanity_chk/last_sanity.csv ; do
  sanitycheck -p mps2_an385 --test tests/kernel/mutex/mutex/kernel.mutex
done
@pabigot pabigot added the bug The issue is a bug, or the PR is fixing a bug label Jan 6, 2019
@galak galak added the priority: low Low impact/importance bug label Jan 8, 2019
@galak
Copy link
Collaborator

galak commented Jan 15, 2019

This w/QEMU.

@andrewboie
Copy link
Contributor

also seen on qemu_cortex_m3

@andyross
Copy link
Contributor

I can't make this happen on master as of commit cd65ed8. tests/kernel/mutex on mps2_an385 got through 106 sanitycheck iterations on my build machine with no failures. Can you retest? What kind of frequency were you seeing?

@pabigot
Copy link
Collaborator Author

pabigot commented Feb 26, 2019

Failed on the sixth run in the above loop, with SDK 0.9.5. I don't remember exactly what failed before, but the assertion below is consistent with thread_09 already holding the mutex.

***** Booting Zephyr OS v1.14.0-rc1-562-gcd65ed857c *****
Running test suite mutex_complex
===================================================================
starting test - test_mutex
***** MPU FAULT *****
  Data Access Violation
  MMFAR Address: 0x20000000
***** Hardware exception *****
Current thread ID = 0x20000290
Faulting instruction address = 0x668
Fatal fault in thread 0x20000290! Aborting.
starting test - Test kernel Mutex API
===================================================================

    Assertion failed at /scratch/zp/zephyr/tests/kernel/mutex/mutex/src/main.c:274: test_mutex: (rv not equal to 0)
Failed to lock mutex 0x20003e28

FAIL - test_mutex
===================================================================
Test suite mutex_complex failed.
===================================================================
PROJECT EXECUTION FAILED
***** MPU FAULT *****
  Data Access Violation
  MMFAR Address: 0x20000000
***** Hardware exception *****
Current thread ID = 0x20000188
Faulting instruction address = 0x56a
Fatal fault in thread 0x20000188! Aborting.

@andyross
Copy link
Contributor

Odd. What's your host OS? Is the system otherwise unloaded? I was using 0.9.5 for that test but it seems similarly fine for me with 0.10.0-rc3

@pabigot
Copy link
Collaborator Author

pabigot commented Feb 26, 2019

Ubuntu 18.04.2, 64 GB with 12 cores of i7-6800K. There's one 2-core VM running on it plus my development environment: top shows it about 98% idle. I think that's pretty lightly loaded, but it certainly isn't absolutely idle, so that's probably a contributing factor.

@andrewboie
Copy link
Contributor

Wondering if this is a real kernel bug, or another instance of us not having working icount in QEMU.

@andyross
Copy link
Contributor

If @pabigot is getting failures on an unloaded system after 6 iterations, that's for sure not a nondeterminism issue. Qemu really is pretty solid as long as it's allowed to receive its timing signals by the host OS promptly; the tests that give us trouble in CI will (literally!) run for thousands of cycles for me if I just run sanitycheck on a single test in a loop like this.

I'll dig around. My own box is a F29 system (and a similar CPU). At the office, gale is a 18.04 install, but it's in use right now for static analysis runs. I have an increasingly stale docker image that reproduces the CI infrastructure (that I should really update for stuff like this). Surely something will reproduce...

@andyross
Copy link
Contributor

andyross commented Mar 1, 2019

Confirmed on the 18.04 box. No idea why it's sensitive to host system like this (the toolchain and qemu versions are identical and provided in the SDK). Could be the kernel scheduler on the host being different, maybe.

@andyross
Copy link
Contributor

andyross commented Mar 2, 2019

Just to update while it occurred to me: it's not the OS, it's the clock rate. The failing box is a server with cores that sustain 2.6GHz under load. The machine I couldn't reproduce on is an overclocked 5GHz desktop. I should clock it down to confirm.

So this is almost certainly timing related, where on one system qemu can race ahead of the timer ISRs and on the other it gets caught before it's finished. Strongly suspect this is the same root cause as the still-undiagnosed ARM userspace-vs-timer instability detailed in #12553.

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

I was close. It's not only the clock rate, it's the CONFIG_HZ parameter in the host kernel. Fedora sets HZ=1000, where Ubuntu uses a 250Hz tick rate, which isn't precise enough to wake up or signal a qemu process when the guest expects for all test cases. Running this test with a rebuilt kernel to fix that one setting (and slowing down the simulation rate because ARM userspace turns out to be a little too slow for the 2.6GHz box) makes this test reliable.

Closing this bug, as it's not an issue with Zephyr code or this particular test. We'll track the root cause and workarounds in #12553.

@andyross andyross closed this as completed Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Toolchains Toolchains bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants