-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of tests that keep failing sporadically #12553
Comments
ests/kernel/threads/thread_apis/kernel.threads on qemu_x86
|
added a failure I had reported, and sorted the list so it's easier to see if something's already there. |
The bulk of these are almost certainly the known timer issue with qemu -- the way that the emulator works is that it tries to slave the returned cycle time in the guest to real/host time. But when run on a loaded host, it's possible that the qemu process gets timesliced out for "big" chunks of time (i.e. on the order of a host scheduling quantum, routinely around 10ms or more, around the size of our own OS ticks). So to the guest, it looks like the machine "froze" for a significant amount of time, and any interrupts that were supposed to have been delivered arrive very late. For unknown reasons, the recent update of the qemu version in the SDK seems to have made this problem a little worse. And tickless (which is default now, and relies on much longer intervals between tick interrupts and thus is more sensitive to delays) has made it more visible too. There's is a QEMU_TICKLESS_WORKAROUND kconfig that is designed to partially address this (it effectively forces the driver into a ticked mode internally where interrupts are delivered regularly) for specific tests. We should check to be sure that it's set for all the ones affected here. Likewise it would be helpful to see which of these tests fail spuriously on an unloaded CPU. Those are probably not timer related and deserve direct attention. Qemu has a -icount feature that is designed to provide deterministic behavior which would be the real gold standard fix, but in practice it works poorly and routinely enters a state where timer interrupts stop arriving. There's another github issue tracking this behavior somewhere. My suspicion is that this is a qemu bug that might be fixable, but seems likely to take a while as we don't have any qemu experts on staff. (Note that tests which don't care about timer interrupt delivery can run with this right now, but obviously that doesn't do anything for the situation at hand). |
FWIW: |
Yeah, finished 100 runs each of the tests above. With the exception of the x86_64 issue in #12478 (which I've confirmed already and which is almost certainly an architecture-specific failure) they all work flawlessly on an unloaded system. I'll let the rig run overnight just for fun, but these are for sure artifacts of the emulation environment, and the described timing behavior of qemu (which is known experimentally) would for sure explain it. Let me think for a bit. Maybe we can be more aggressive with the workaround in the hpet and systick drivers by simply assuming that the interrupt must have been delivered on time and adjusting the reported cycle count by whatever the detected offset is. But then that has to have detection for IRQ masking situations where the interrupt is supposed to be delayed (I'm pretty sure we have a few cases in there where this is done deliberately...) |
Just had to kill the script so I could use that system for something else. Other than one event where the host kernel reported that the linker sigsegv'd causing a build failure (it's an overclocked build host, sue me), all ~800 runs of each of the tests passed. |
try running a few other things in parallel maybe:
|
I variously see these failures all the time, FWIW. For the ones that are QEMU timing related, I don't see a way out unless we can get icount working and completely decouple from the system clock. As a short-term workaround, we may want to tag all these tests as "flaky" in testcase.yaml, and build them in a separate CI job which retries them a large number of times. |
sound like a good idea. |
Seems to fail for me every 6-7 attempts (and failed in #13366 - https://app.shippable.com/github/zephyrproject-rtos/zephyr/runs/34294/2/tests) |
I have a few tests that are consistently breaking on nrf52_blenano2:
You can see the errors here: https://ci.foundries.io/projects/zephyr/builds/4850/sanity-test-nrf52/tests/ |
tests/kernel/sched/schedule_api/kernel.sched.native_posix failing rarely in native_posix would point to an un-initialized memory issue(?) |
Some notes here regarding mps2_an385: this platform does sorta feel less stable than the others, and it seems to be down to the MPU code, or interactions thereof. When I run sanitycheck on that platform (in this case, with the fix in PR #13770), I routinely get 3-5 failures out of 155 tests. If I apply the following, the failure rate drops to just one failure in 20 iterations, or about two orders of magnitude of reliability:
Now, userspace does affect timing for sure, so this isn't a smoking gun. But any race/timing delta introduced by syscalls should be deterministic, and this really isn't. I'm wondering if something is up here and if @ioannisg might be interested in checking on things. For example, at one point I bisected an instability with tests/kernel/mem_pool/mem_pool_threadsafe on mps2_an385. It turns out that this test has been pretty reliably unstable on this platform going all the way back (exactly) to the introduction of MPU support. It turns out never to run well, even under the older timer subsystem. And one random theory (probably wrong, but...) what is the relative priority of the PendSV and system call exceptions? Are we guaranteed that a _Swap() done inside a system call is going to synchronously trap? Or does it leave it pending like an interrupt would? If that's the case, _Swap() in syscalls (i.e. almost every swap that isn't in a non-MMU thread like a workqueue or idle thread!) would have the effect of not only being nonatomic, but of being async. We'd return back from it every time, and really confuse things. Can someone prove me wrong really quickly here so I don't have to dig around for the priority handling? |
Yikes. That is EXACTLY what is happening. Check out arch/arm/include/cortrex_m/exc.h. The diff below shows the relevant code. Unless I'm completely misreading this, the PendSV exception is always set to a priority of 0xff (the lowest possible priority), where SVCall gets _EXC_SVC_PRIO instead, a higher priority. So when code running in a syscall handler tries to context switch by calling _Swap(), it does nothing and returns synchronously having scheduled a PendSV to happen whenever the CPU gets around to it. Apparently, that works more often than not because eventually you try to return to userspace and get swapped away. But any timer interrupt that arrives after this moment is going to see a completely bonkers state where _current thinks it got swapped away but never did. And we don't handle that well at all. Applying the patch below (again, on top of the nonatomic swap patch linked above) works almost as well as disabling userspace, I'm through 9 sanitycheck runs on mps2_an385 with no spurious failures seen yet. BUT it reliably breaks tests/kernel/fatal/kernel.common.stack_sentinel case, which apparently requires that _EXC_FAULT_PRIO be identical to _EXC_SVC_PRIO for reasons I haven't figured out (and changing the fault priority breaks other stuff). And there's another problem, in that the actual value of _EXC_SVC_PRIO is either 0 or 1 (depending on whether or not zero is claimed by zero latency irqs), meaning that (again unless I'm missing something) in our code right now syscalls cannot be preempted, which is sort of a latency disaster, I'd think. Basically, this isn't a fix yet, but we seem to have something pretty seriously wrong with our SVC/PendSV priority choices on ARM, and they seem to be the source of almost all the (non-qemu-nondeterminism) instability tracked here. @ioannisg @agross-linaro any thoughts?
|
Zero spurious failures (again, not including the reliably broken one due to fault priority) in 17 runs now. I'm going to leave this running overnight for kicks. |
FWIW: overnight it saw 35 total failures in 115 runs of 167 tests each (99.82% success rate), which is if anything rather better than we see on other qemu platforms due to the known time-warp-due-to-host-scheduling issue. Some of that may be due to the relative priorities. To bump pendsv above svc, I moved it all the way up to priority zero, and realized last night that this was effectively "fixing" the nonatomic swap problem by ensuring that pendsv interrupts would always be taken synchronously. Nonetheless we're clearly on to something here. Professional allegiances mean I should probably work on other bugs in the immediate future but I'll return to this in a few days if no one else gets to it. |
@andyross, just to answer your question, (FYI @agross-linaro) yes, the PendSV exception has the lowest priority level in ARM systems, while the SVC (Supervisor Calls) has the highest. |
@andyross I am not sure I follow 100%:
So the actual system call executes in Thread/PRIV mode, and if a swap is invoked the PendSV will be set to pending , which immediately switches to the exception handler.... |
So, it is not that the whole system call cannot be preempted. it is only the system-call-prepare code that cannot be preempted, which is really minimal. BTW, it can be preempted by zero-latency interrupts, which are the most critical ones. |
That said, @andyross , I am not sure about this patch:
PendSV won't work during the SVC, which only does the preparation for the sys-call. Is this a problem @andrewboie ? |
Yeah, that's the magic I was missing. Indeed the SVC has higher priority than PendSV, but what I missed was that the syscall does not actually happen in the context of an SVC exception, then handler plays with the stack frame such that it can "return from exception" into the kernel running in thread mode. What actually "fixed" things in the patch turns out not to be the relative pendsv priority at all, it's the relative priority of SVC and the timer interrupt that matters, you can take out the pendsv change and leave it at 0xff and it still addresses the stability problems.. And I see why now, or at least one reason why (the entry to kernel mode after the exception returns is racy vs. interrupts that can change _current). Thinking on how to fix it... |
Then, perhaps we should just disable interrupts temporarily inside the SVC Call, @agross-linaro ? I am going to think about it |
I don't think it's the SVC linkage to the return from EXC_RET. I think it's the timer hitting at the right point on the pendsv (right before the MSR BASEPRI gets set). That said, the SVC cannot be nested. But I can't see that being the problem either. We need to narrow this down some. |
OK, now we're getting somewhere! Bear with me, this is long, I know. First off, sanitycheck turns out to be a big part of the problem. It spawns a thread for every concurrent test (so literally dozens to hundreds of extra threads) to read the output of the qemu process. This thread has significant CPU work to do to parse the output, it runs simultaneously with the output, and it's not captured by the JOBS parameter at all and thus tends to overflow available cores. You have to cut the amount of work done (and thus the speed of the test run) by an extra factor of two to make this go away, and even then I'm still seeing some spurious failures. So screw it. Let's rewrite the qemu part of sanitycheck. This patch instruments the build and test to save off the qemu command line and list of qemu tests from sanitycheck: And this script (yes, it's in perl, I don't want to hear it) represents my attempt at a minimally invasive sanitycheck run, it's fed by the list of test directories spit out above. It supports only ztest output, it doesn't try to parse regexes from yaml files. But effectively all the spurious failures we see are in ztest cases. But critically, it's designed to never saturate the CPU cores. It is single threaded, leaves one core free for itself, dumps all output to files and parses them only after they stop being modified. https://gist.github.com/andyross/f6df418223e6224b214f74264fe0721d So I fired that up and let it run on my build box overnight on top of the tests/kernel directory and using the qemu platforms (qemu_x86/xtensa/riscv32/nios2 and mps2_an385). And IT WORKS GREAT, but not quite perfect. Over 122 separate runs of 300 test directories (36600 distinct tests), I saw 18 failures. And here's the breakdown:
That is, almost all of the failures are in just one test case on one platform, which is already captured in #12559. This almost certainly represents a real bug and I should fix it. The remaining five are so rare that I honestly don't know what to say about them. Remember that nothing in this test architecture actually guarantees reliable behavior, just makes sure we aren't making things worse. Certainly on this test host (a 6 core Coffee Lake PC at 5Ghz running F29) it looks like we can get very reliable and successful test runs. But then I tried it on my bigger build host (more to come)... |
And on that server (a 56 core Skylake at 2.5GHz running 18.04 LTS) things don't look nearly as neat. I get more testing done there (127021 total cases), but also see a failure rate almost an order of magnitude higher: 1068 total failures! That's too many to paste here, but we can look at the breakdown by test (there are others in the long tail of low rates, these are the worse):
And see that the mem_pool_threadsafe case is still a notable problem. And we can look at the breakdown by platform:
And here's that ARM userspace clustering we see, clear as day! 85% of all "non-load-related" test failures are on that one platform, with xtensa a distant second and altera/riscv showing rates that frankly aren't barely higher than the ones we saw on the "good" test host. Yet... those failures completely disappear when run on the other host. And both of these runs were in docker images using exactly the same tools built from the same checkout of the CI project. And I think I finally realized why: it's not the clock rate (at one point I downclocked my 5GHz host to check failures in #12352 and saw no effect), and it's not the architecture syscall code (despite the fact that I was once convinced it was), it's the host kernel. Fedora ships with a CONFIG_HZ of 1000:
But on Ubuntu it's 250:
So timer precision, as delivered via a signal or wakeup to a qemu process (but not polling of the TSC inside the interpreter loop -- qemu is known to do both) is going to be four times more accurate on Fedora than on Ubuntu. I'm going to build a new kernel for that server machine to verify. But if I'm right, I think we can put this whole issue to bed for now and work in 1.15 on fixing up the way the tests are run. Also, in a we should probably modify tests to work with a 2.5ms host tick -- in a few cases we have tests written to 1000Hz ticks which obviously won't ever be able to provide that precision). |
Executive summary (but please try to digest all that stuff above):
|
Yup, that did it. Building a kernel with CONFIG_HZ=1000 pushed the failure rate down to the level seen on the Fedora box[1] ... except for mps2_an385, which was seeing mostly unchanged rates. As it turns out, it was the clock rate for that particular platform. ARM userspace is just... slow, I guess. I cranked CONFIG_HW_CLOCK_CYCLES_PER_SEC from 25MHz up to 50Mhz (counterintuitive, but by making all the clock deltas twice as large we effectively slow down the simulation relative to the real world), and now it looks perfect. No failures yet. I'll continue to watch this, but I think @agross-linaro and @ioannisg are off the hook. For 1.15 we'll come up with some test and scaling solutions to make this reliable. I'm 100% certain now that it's possible. But the presumptive timeslicing bug in #12559 remains real. I officiall wash my hands of the reliability stuff and recommend closing this bug, reprioritizing and/or bumping to 1.15. [1] Roughly. Obviously these tests haven't been running long so I don't have overnight-scale data yet. |
This seems to have gone quiet. I'm convinced this is well-characterized and resolvable, but of course it won't be for 1.14. I've opened #14649 to track a firm "must be retry-free" requirement for the next release, and am going to close this one as a 1.14 bug. |
After a chat with @andyross I'm volunteering to add a new tag |
Fine with me. Anyone with spare development bandwith, someday I'd love this fixed once and for all, see #14173 |
We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see zephyrproject-rtos#12553, zephyrproject-rtos#14173 etc. Signed-off-by: Marc Herbert <[email protected]>
We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see #12553, #14173 etc. Signed-off-by: Marc Herbert <[email protected]>
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <[email protected]>
As we continue to see intermittent failures I'm reopening this in lieu of creating a new one. We can discuss its priority and assigness in the next triage meeting. I believe that #17155 will help with this, and should be merged before we confirm tests. I would use #18997 to verify that reported tests are still failing even if they seem to pass CI, and update the list of specific issues in the header. #14649 and #14173 may also contribute to resolving this "permanently". |
The issues here are fundamental. There's really nothing #17155 can do to fix them, though on the margins it does work to tighten up timeout behavior and might make failures less likely. Take a look at #18656. I'd really appreciate it if we could make testing with that a requirement for opening a new spurious failure bug. The tooling there runs sanitycheck output in a context where host load is carefully controlled (but still as parallel as possible). Using that, on a host that is fast enough (a 3+ GHz host clock seems OK, 2.5GHz shows occasional failures on mps2_an385 due to simple CPU overhead of ARM userspace emulation) and with a 1000 Hz host clock (i.e. not Ubuntu, which has CONFIG_HZ=250) results in perfectly reliable (up to the 10k or so test iterations I've been able to test in single runs) operation on existing tests. That's not "proof" that there are no bugs lurking. But if removing host clock interaction on systems where host clock interaction is inevitable works to resolve the issue, by definition we can't be sure that there is a real bug. Spurious failures on hardware, of course, are real issues that need to be addressed. But the qemu stuff is as good as we're going to get it at this point absent changes to our test tools. |
closing as this was resolved in different places. |
Few tests keep failing sporadically, here a list to make sure all are addressed
Issues/PRs that may assist with this task:
Issues seen on real hardware and are real bugs:
Issues ONLY seen on QEMU emulators and may be due to QEMU timekeeping problems:
The text was updated successfully, but these errors were encountered: