-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intermittent kernel.mutex sanitycheck with mps2_an385 #12352
Comments
This w/QEMU. |
also seen on qemu_cortex_m3 |
I can't make this happen on master as of commit cd65ed8. tests/kernel/mutex on mps2_an385 got through 106 sanitycheck iterations on my build machine with no failures. Can you retest? What kind of frequency were you seeing? |
Failed on the sixth run in the above loop, with SDK 0.9.5. I don't remember exactly what failed before, but the assertion below is consistent with
|
Odd. What's your host OS? Is the system otherwise unloaded? I was using 0.9.5 for that test but it seems similarly fine for me with 0.10.0-rc3 |
Ubuntu 18.04.2, 64 GB with 12 cores of i7-6800K. There's one 2-core VM running on it plus my development environment: top shows it about 98% idle. I think that's pretty lightly loaded, but it certainly isn't absolutely idle, so that's probably a contributing factor. |
Wondering if this is a real kernel bug, or another instance of us not having working icount in QEMU. |
If @pabigot is getting failures on an unloaded system after 6 iterations, that's for sure not a nondeterminism issue. Qemu really is pretty solid as long as it's allowed to receive its timing signals by the host OS promptly; the tests that give us trouble in CI will (literally!) run for thousands of cycles for me if I just run sanitycheck on a single test in a loop like this. I'll dig around. My own box is a F29 system (and a similar CPU). At the office, gale is a 18.04 install, but it's in use right now for static analysis runs. I have an increasingly stale docker image that reproduces the CI infrastructure (that I should really update for stuff like this). Surely something will reproduce... |
Confirmed on the 18.04 box. No idea why it's sensitive to host system like this (the toolchain and qemu versions are identical and provided in the SDK). Could be the kernel scheduler on the host being different, maybe. |
Just to update while it occurred to me: it's not the OS, it's the clock rate. The failing box is a server with cores that sustain 2.6GHz under load. The machine I couldn't reproduce on is an overclocked 5GHz desktop. I should clock it down to confirm. So this is almost certainly timing related, where on one system qemu can race ahead of the timer ISRs and on the other it gets caught before it's finished. Strongly suspect this is the same root cause as the still-undiagnosed ARM userspace-vs-timer instability detailed in #12553. |
I was close. It's not only the clock rate, it's the CONFIG_HZ parameter in the host kernel. Fedora sets HZ=1000, where Ubuntu uses a 250Hz tick rate, which isn't precise enough to wake up or signal a qemu process when the guest expects for all test cases. Running this test with a rebuilt kernel to fix that one setting (and slowing down the simulation rate because ARM userspace turns out to be a little too slow for the 2.6GHz box) makes this test reliable. Closing this bug, as it's not an issue with Zephyr code or this particular test. We'll track the root cause and workarounds in #12553. |
I'm finding that
tests/kernel/mutex/mutex/kernel.mutex
onmps2_an385
fails intermittently on current master at 330cbfa using SDK 0.9.5.Instrumenting the test shows that although the
test_mutex
routine normally reaches the point of taking the mutex within the first 10 ms of runtime, in some situationsthread_09
will complete its 500 ms delay first and take the mutex instead.Reproduce with:
The text was updated successfully, but these errors were encountered: