Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of tests that keep failing sporadically #12553

Closed
2 of 18 tasks
nashif opened this issue Jan 17, 2019 · 33 comments
Closed
2 of 18 tasks

List of tests that keep failing sporadically #12553

nashif opened this issue Jan 17, 2019 · 33 comments
Labels
bug The issue is a bug, or the PR is fixing a bug
Milestone

Comments

@nashif
Copy link
Member

nashif commented Jan 17, 2019

Few tests keep failing sporadically, here a list to make sure all are addressed

Issues/PRs that may assist with this task:

Issues seen on real hardware and are real bugs:

Issues ONLY seen on QEMU emulators and may be due to QEMU timekeeping problems:

@nashif nashif added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Jan 17, 2019
@nashif
Copy link
Member Author

nashif commented Jan 17, 2019

ests/kernel/threads/thread_apis/kernel.threads on qemu_x86

***** Kernel Panic! *****
Current thread ID = 0x00400020
eax: 0x00000206, ebx: 0x00000000, ecx: 0x00000000, edx: 0x004000a0
esi: 0x00000000, edi: 0x00001811, ebp: 0x004021dc, esp: 0x004021d8
eflags: 0x00000006 cs: 0x0008
call trace:
eip: 0x00001383
     0x00001820 (0x0)
Caught system error -- reason 8
test stack overflow - timer irq
posting 4096 bytes of junk to stack...
busy waiting...
FAIL - stack_thread1@132. should never see this
***** Stack Check Fail! *****
Current thread ID = 0x00400020
eax: 0x00401208, ebx: 0x00000216, ecx: 0x00000000, edx: 0x00400020
esi: 0x00400020, edi: 0x00001811, ebp: 0x004021d4, esp: 0x004021d0
eflags: 0x00000083 cs: 0x0008
call trace:
eip: 0x00003d6e
     0x00004089 (0x0)
     0x0000182e (0x400020)
Caught system error -- reason 4

    Assertion failed at /home/nashif/github/zephyr/tests/kernel/fatal/src/main.c:236: test_fatal: (rv equal to TC_FAIL)
thread was not aborted
FAIL - test_fatal
===================================================================
Test suite fatal failed.
===================================================================
PROJECT EXECUTION FAILED

@pabigot
Copy link
Collaborator

pabigot commented Jan 19, 2019

added a failure I had reported, and sorted the list so it's easier to see if something's already there.

@nashif nashif pinned this issue Jan 23, 2019
@nashif nashif added this to the v1.14.0 milestone Jan 23, 2019
@rveerama1 rveerama1 unpinned this issue Jan 24, 2019
@andyross
Copy link
Contributor

The bulk of these are almost certainly the known timer issue with qemu -- the way that the emulator works is that it tries to slave the returned cycle time in the guest to real/host time. But when run on a loaded host, it's possible that the qemu process gets timesliced out for "big" chunks of time (i.e. on the order of a host scheduling quantum, routinely around 10ms or more, around the size of our own OS ticks). So to the guest, it looks like the machine "froze" for a significant amount of time, and any interrupts that were supposed to have been delivered arrive very late.

For unknown reasons, the recent update of the qemu version in the SDK seems to have made this problem a little worse. And tickless (which is default now, and relies on much longer intervals between tick interrupts and thus is more sensitive to delays) has made it more visible too. There's is a QEMU_TICKLESS_WORKAROUND kconfig that is designed to partially address this (it effectively forces the driver into a ticked mode internally where interrupts are delivered regularly) for specific tests. We should check to be sure that it's set for all the ones affected here.

Likewise it would be helpful to see which of these tests fail spuriously on an unloaded CPU. Those are probably not timer related and deserve direct attention.

Qemu has a -icount feature that is designed to provide deterministic behavior which would be the real gold standard fix, but in practice it works poorly and routinely enters a state where timer interrupts stop arriving. There's another github issue tracking this behavior somewhere. My suspicion is that this is a qemu bug that might be fixable, but seems likely to take a while as we don't have any qemu experts on staff. (Note that tests which don't care about timer interrupt delivery can run with this right now, but obviously that doesn't do anything for the situation at hand).

@andyross
Copy link
Contributor

FWIW: for i in {1..100}; do scripts/sanitycheck -T tests/kernel/fatal -p qemu_x86; done reports no errors on an unloaded system. Will repeat for the others as time allows.

@andyross
Copy link
Contributor

Yeah, finished 100 runs each of the tests above. With the exception of the x86_64 issue in #12478 (which I've confirmed already and which is almost certainly an architecture-specific failure) they all work flawlessly on an unloaded system. I'll let the rig run overnight just for fun, but these are for sure artifacts of the emulation environment, and the described timing behavior of qemu (which is known experimentally) would for sure explain it.

Let me think for a bit. Maybe we can be more aggressive with the workaround in the hpet and systick drivers by simply assuming that the interrupt must have been delivered on time and adjusting the reported cycle count by whatever the detected offset is. But then that has to have detection for IRQ masking situations where the interrupt is supposed to be delayed (I'm pretty sure we have a few cases in there where this is done deliberately...)

@andyross
Copy link
Contributor

Just had to kill the script so I could use that system for something else. Other than one event where the host kernel reported that the linker sigsegv'd causing a build failure (it's an overclocked build host, sue me), all ~800 runs of each of the tests passed.

@nashif
Copy link
Member Author

nashif commented Jan 31, 2019

try running a few other things in parallel maybe:

for i in {1..100}; do scripts/sanitycheck -T tests/kernel/fatal -T tests/kernel/mem_pool/mem_pool_threadsafe/ -T tests/kernel/mem_protect/mem_protect/ -T tests/kernel/mutex/mutex/ -T tests/kernel/sched/schedule_api -T tests/kernel/threads/thread_apis -p qemu_x86; done

@andrewboie
Copy link
Contributor

I variously see these failures all the time, FWIW.

For the ones that are QEMU timing related, I don't see a way out unless we can get icount working and completely decouple from the system clock.

As a short-term workaround, we may want to tag all these tests as "flaky" in testcase.yaml, and build them in a separate CI job which retries them a large number of times.

@nashif
Copy link
Member Author

nashif commented Feb 3, 2019

As a short-term workaround, we may want to tag all these tests as "flaky" in testcase.yaml, and build them in a separate CI job which retries them a large number of times.

sound like a good idea.

@tgorochowik
Copy link
Member

tgorochowik commented Feb 13, 2019

./scripts/sanitycheck --platform mps2_an385 -v -s tests/kernel/mem_slab/mslab_threadsafe/kernel.memory_slabs

Seems to fail for me every 6-7 attempts (and failed in #13366 - https://app.shippable.com/github/zephyrproject-rtos/zephyr/runs/34294/2/tests)

@doanac
Copy link
Collaborator

doanac commented Feb 13, 2019

I have a few tests that are consistently breaking on nrf52_blenano2:

  • tests/benchmarks/sched
  • tests/kernel/interrupt
  • tests/kernel/sched/schedule_api

You can see the errors here: https://ci.foundries.io/projects/zephyr/builds/4850/sanity-test-nrf52/tests/

@aescolar
Copy link
Member

aescolar commented Feb 15, 2019

tests/kernel/sched/schedule_api/kernel.sched.native_posix failing rarely in native_posix would point to an un-initialized memory issue(?)

@andyross
Copy link
Contributor

Some notes here regarding mps2_an385: this platform does sorta feel less stable than the others, and it seems to be down to the MPU code, or interactions thereof.

When I run sanitycheck on that platform (in this case, with the fix in PR #13770), I routinely get 3-5 failures out of 155 tests. If I apply the following, the failure rate drops to just one failure in 20 iterations, or about two orders of magnitude of reliability:

diff --git a/soc/arm/arm/mps2/Kconfig.soc b/soc/arm/arm/mps2/Kconfig.soc
index 642f41cc3c..8258556a4d 100644
--- a/soc/arm/arm/mps2/Kconfig.soc
+++ b/soc/arm/arm/mps2/Kconfig.soc
@@ -11,6 +11,5 @@ depends on SOC_SERIES_MPS2
 config SOC_MPS2_AN385
        bool "ARM Cortex-M3 SMM on V2M-MPS2 (Application Note AN385)"
        select CPU_CORTEX_M3
-       select CPU_HAS_ARM_MPU
 
 endchoice

Now, userspace does affect timing for sure, so this isn't a smoking gun. But any race/timing delta introduced by syscalls should be deterministic, and this really isn't. I'm wondering if something is up here and if @ioannisg might be interested in checking on things.

For example, at one point I bisected an instability with tests/kernel/mem_pool/mem_pool_threadsafe on mps2_an385. It turns out that this test has been pretty reliably unstable on this platform going all the way back (exactly) to the introduction of MPU support. It turns out never to run well, even under the older timer subsystem.

And one random theory (probably wrong, but...) what is the relative priority of the PendSV and system call exceptions? Are we guaranteed that a _Swap() done inside a system call is going to synchronously trap? Or does it leave it pending like an interrupt would? If that's the case, _Swap() in syscalls (i.e. almost every swap that isn't in a non-MMU thread like a workqueue or idle thread!) would have the effect of not only being nonatomic, but of being async. We'd return back from it every time, and really confuse things. Can someone prove me wrong really quickly here so I don't have to dig around for the priority handling?

@andyross
Copy link
Contributor

what is the relative priority of the PendSV and system call exceptions?

Yikes. That is EXACTLY what is happening. Check out arch/arm/include/cortrex_m/exc.h. The diff below shows the relevant code. Unless I'm completely misreading this, the PendSV exception is always set to a priority of 0xff (the lowest possible priority), where SVCall gets _EXC_SVC_PRIO instead, a higher priority.

So when code running in a syscall handler tries to context switch by calling _Swap(), it does nothing and returns synchronously having scheduled a PendSV to happen whenever the CPU gets around to it. Apparently, that works more often than not because eventually you try to return to userspace and get swapped away. But any timer interrupt that arrives after this moment is going to see a completely bonkers state where _current thinks it got swapped away but never did. And we don't handle that well at all.

Applying the patch below (again, on top of the nonatomic swap patch linked above) works almost as well as disabling userspace, I'm through 9 sanitycheck runs on mps2_an385 with no spurious failures seen yet. BUT it reliably breaks tests/kernel/fatal/kernel.common.stack_sentinel case, which apparently requires that _EXC_FAULT_PRIO be identical to _EXC_SVC_PRIO for reasons I haven't figured out (and changing the fault priority breaks other stuff).

And there's another problem, in that the actual value of _EXC_SVC_PRIO is either 0 or 1 (depending on whether or not zero is claimed by zero latency irqs), meaning that (again unless I'm missing something) in our code right now syscalls cannot be preempted, which is sort of a latency disaster, I'd think.

Basically, this isn't a fix yet, but we seem to have something pretty seriously wrong with our SVC/PendSV priority choices on ARM, and they seem to be the source of almost all the (non-qemu-nondeterminism) instability tracked here. @ioannisg @agross-linaro any thoughts?

diff --git a/arch/arm/include/cortex_m/exc.h b/arch/arm/include/cortex_m/exc.h
index e111974f03..a51342de07 100644
--- a/arch/arm/include/cortex_m/exc.h
+++ b/arch/arm/include/cortex_m/exc.h
@@ -95,7 +95,9 @@ static ALWAYS_INLINE bool _IsInIsr(void)
  */
 static ALWAYS_INLINE void _ExcSetup(void)
 {
-       NVIC_SetPriority(PendSV_IRQn, 0xff);
+       BUILD_ASSERT_MSG(_EXC_PENDSV_PRIO < _EXC_SVC_PRIO,
+                        "PendSV MUST have higher priority than SVC, or swap won't work in syscalls!");
+       NVIC_SetPriority(PendSV_IRQn, _EXC_PENDSV_PRIO);
 
 #ifdef CONFIG_CPU_CORTEX_M_HAS_BASEPRI
        NVIC_SetPriority(SVCall_IRQn, _EXC_SVC_PRIO);
diff --git a/include/arch/arm/cortex_m/exc.h b/include/arch/arm/cortex_m/exc.h
index 23e4341ca3..cae51b3ad7 100644
--- a/include/arch/arm/cortex_m/exc.h
+++ b/include/arch/arm/cortex_m/exc.h
@@ -30,10 +30,12 @@ extern "C" {
 #define _EXC_FAULT_PRIO 0
 #ifdef CONFIG_ZERO_LATENCY_IRQS
 #define _EXC_ZERO_LATENCY_IRQS_PRIO 0
-#define _EXC_SVC_PRIO 1
+#define _EXC_PENDSV_PRIO 1
+#define _EXC_SVC_PRIO 2
 #define _IRQ_PRIO_OFFSET (_EXCEPTION_RESERVED_PRIO + 1)
 #else
-#define _EXC_SVC_PRIO 0
+#define _EXC_PENDSV_PRIO 0
+#define _EXC_SVC_PRIO 1
 #define _IRQ_PRIO_OFFSET (_EXCEPTION_RESERVED_PRIO)
 #endif

@andyross
Copy link
Contributor

Zero spurious failures (again, not including the reliably broken one due to fault priority) in 17 runs now. I'm going to leave this running overnight for kicks.

@andyross
Copy link
Contributor

FWIW: overnight it saw 35 total failures in 115 runs of 167 tests each (99.82% success rate), which is if anything rather better than we see on other qemu platforms due to the known time-warp-due-to-host-scheduling issue.

Some of that may be due to the relative priorities. To bump pendsv above svc, I moved it all the way up to priority zero, and realized last night that this was effectively "fixing" the nonatomic swap problem by ensuring that pendsv interrupts would always be taken synchronously.

Nonetheless we're clearly on to something here. Professional allegiances mean I should probably work on other bugs in the immediate future but I'll return to this in a few days if no one else gets to it.

@ioannisg
Copy link
Member

ioannisg commented Feb 28, 2019

And one random theory (probably wrong, but...) what is the relative priority of the PendSV and system call exceptions?

@andyross, just to answer your question, (FYI @agross-linaro) yes, the PendSV exception has the lowest priority level in ARM systems, while the SVC (Supervisor Calls) has the highest.

@ioannisg
Copy link
Member

ioannisg commented Feb 28, 2019

So when code running in a syscall handler tries to context switch by calling _Swap(), it does nothing and returns synchronously having scheduled a PendSV to happen whenever the CPU gets around to it.

@andyross I am not sure I follow 100%:
the code in the __SVC_Handler() - called when user invokes a system call - does not call any swap. It does the following:

  • switches to PRIV stack
  • switches to PRIV mode
  • modifies stacked frame, so we return to _arm_do_syscall, in Thread mode.

So the actual system call executes in Thread/PRIV mode, and if a swap is invoked the PendSV will be set to pending , which immediately switches to the exception handler....

@ioannisg
Copy link
Member

And there's another problem, in that the actual value of _EXC_SVC_PRIO is either 0 or 1 (depending on whether or not zero is claimed by zero latency irqs), meaning that (again unless I'm missing something) in our code right now syscalls cannot be preempted, which is sort of a latency disaster, I'd think.

So, it is not that the whole system call cannot be preempted. it is only the system-call-prepare code that cannot be preempted, which is really minimal.

BTW, it can be preempted by zero-latency interrupts, which are the most critical ones.

@ioannisg
Copy link
Member

That said, @andyross , I am not sure about this patch:

-       NVIC_SetPriority(PendSV_IRQn, 0xff);
+       BUILD_ASSERT_MSG(_EXC_PENDSV_PRIO < _EXC_SVC_PRIO,
+                        "PendSV MUST have higher priority than SVC, or swap won't work in syscalls!");
+       NVIC_SetPriority(PendSV_IRQn, _EXC_PENDSV_PRIO);

PendSV won't work during the SVC, which only does the preparation for the sys-call. Is this a problem @andrewboie ?

@andyross
Copy link
Contributor

andyross commented Feb 28, 2019

Yeah, that's the magic I was missing. Indeed the SVC has higher priority than PendSV, but what I missed was that the syscall does not actually happen in the context of an SVC exception, then handler plays with the stack frame such that it can "return from exception" into the kernel running in thread mode.

What actually "fixed" things in the patch turns out not to be the relative pendsv priority at all, it's the relative priority of SVC and the timer interrupt that matters, you can take out the pendsv change and leave it at 0xff and it still addresses the stability problems.. And I see why now, or at least one reason why (the entry to kernel mode after the exception returns is racy vs. interrupts that can change _current). Thinking on how to fix it...

@ioannisg
Copy link
Member

What actually "fixed" things in the patch turns out not to be the relative pendsv priority at all, it's the relative priority of SVC and the timer interrupt that matters, you can take out the pendsv change and leave it at 0xff and it still addresses the stability problems.. And I see why now, or at least one reason why (the entry to kernel mode after the exception returns is racy vs. interrupts that can change _current). Thinking on how to fix it...

Then, perhaps we should just disable interrupts temporarily inside the SVC Call, @agross-linaro ? I am going to think about it

@agross-oss
Copy link
Collaborator

I don't think it's the SVC linkage to the return from EXC_RET. I think it's the timer hitting at the right point on the pendsv (right before the MSR BASEPRI gets set). That said, the SVC cannot be nested. But I can't see that being the problem either. We need to narrow this down some.

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

OK, now we're getting somewhere! Bear with me, this is long, I know.

First off, sanitycheck turns out to be a big part of the problem. It spawns a thread for every concurrent test (so literally dozens to hundreds of extra threads) to read the output of the qemu process. This thread has significant CPU work to do to parse the output, it runs simultaneously with the output, and it's not captured by the JOBS parameter at all and thus tends to overflow available cores. You have to cut the amount of work done (and thus the speed of the test run) by an extra factor of two to make this go away, and even then I'm still seeing some spurious failures.

So screw it. Let's rewrite the qemu part of sanitycheck. This patch instruments the build and test to save off the qemu command line and list of qemu tests from sanitycheck:

andyross@1f144d4

And this script (yes, it's in perl, I don't want to hear it) represents my attempt at a minimally invasive sanitycheck run, it's fed by the list of test directories spit out above. It supports only ztest output, it doesn't try to parse regexes from yaml files. But effectively all the spurious failures we see are in ztest cases. But critically, it's designed to never saturate the CPU cores. It is single threaded, leaves one core free for itself, dumps all output to files and parses them only after they stop being modified.

https://gist.github.com/andyross/f6df418223e6224b214f74264fe0721d

So I fired that up and let it run on my build box overnight on top of the tests/kernel directory and using the qemu platforms (qemu_x86/xtensa/riscv32/nios2 and mps2_an385). And IT WORKS GREAT, but not quite perfect. Over 122 separate runs of 300 test directories (36600 distinct tests), I saw 18 failures. And here's the breakdown:

      1 FAIL (timeout) /home/andy/zephyr/sanity-out/mps2_an385/tests/kernel/fifo/fifo_timeout/kernel.fifo.timeout.poll
      1 FAIL (timeout) /home/andy/zephyr/sanity-out/qemu_nios2/tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool
      1 FAIL (timeout) /home/andy/zephyr/sanity-out/qemu_xtensa/tests/kernel/queue/kernel.queue.poll
      2 FAIL (timeout) /home/andy/zephyr/sanity-out/qemu_xtensa/tests/kernel/sleep/kernel.common.timing
     13 FAIL (timeout) /home/andy/zephyr/sanity-out/mps2_an385/tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool

That is, almost all of the failures are in just one test case on one platform, which is already captured in #12559. This almost certainly represents a real bug and I should fix it.

The remaining five are so rare that I honestly don't know what to say about them. Remember that nothing in this test architecture actually guarantees reliable behavior, just makes sure we aren't making things worse.

Certainly on this test host (a 6 core Coffee Lake PC at 5Ghz running F29) it looks like we can get very reliable and successful test runs. But then I tried it on my bigger build host (more to come)...

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

And on that server (a 56 core Skylake at 2.5GHz running 18.04 LTS) things don't look nearly as neat. I get more testing done there (127021 total cases), but also see a failure rate almost an order of magnitude higher: 1068 total failures!

That's too many to paste here, but we can look at the breakdown by test (there are others in the long tail of low rates, these are the worse):

     14 tests/kernel/fatal/kernel.common.stack_sentinel
     17 tests/kernel/mem_pool/mem_pool_api/kernel.memory_pool
     17 tests/kernel/timer/timer_api/kernel.timer
     22 tests/kernel/sched/deadline/kernel.sched.deadline
     27 tests/kernel/mem_protect/mem_protect/kernel.memory_protection
     31 tests/kernel/msgq/msgq_api/kernel.message_queue
     33 tests/kernel/timer/timer_api/kernel.timer.tickless
     34 tests/kernel/stack/stack_usage/kernel.stack.usage
     44 tests/kernel/mbox/mbox_usage/kernel.mailbox
     49 tests/kernel/mem_slab/mslab_threadsafe/kernel.memory_slabs
     76 tests/kernel/mutex/mutex/kernel.mutex
    124 tests/kernel/threads/thread_apis/kernel.threads
    227 tests/kernel/lifo/lifo_usage/kernel.lifo.usage
    262 tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool

And see that the mem_pool_threadsafe case is still a notable problem. And we can look at the breakdown by platform:

     10 qemu_riscv32
     41 qemu_nios2
    170 qemu_xtensa
    847 mps2_an385

And here's that ARM userspace clustering we see, clear as day! 85% of all "non-load-related" test failures are on that one platform, with xtensa a distant second and altera/riscv showing rates that frankly aren't barely higher than the ones we saw on the "good" test host.

Yet... those failures completely disappear when run on the other host. And both of these runs were in docker images using exactly the same tools built from the same checkout of the CI project. And I think I finally realized why: it's not the clock rate (at one point I downclocked my 5GHz host to check failures in #12352 and saw no effect), and it's not the architecture syscall code (despite the fact that I was once convinced it was), it's the host kernel.

Fedora ships with a CONFIG_HZ of 1000:

manx:~$ fgrep _HZ /boot/config-`uname -r`
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

But on Ubuntu it's 250:

andy@gale:~$ fgrep _HZ /boot/config-`uname -r`
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250

So timer precision, as delivered via a signal or wakeup to a qemu process (but not polling of the TSC inside the interpreter loop -- qemu is known to do both) is going to be four times more accurate on Fedora than on Ubuntu. I'm going to build a new kernel for that server machine to verify.

But if I'm right, I think we can put this whole issue to bed for now and work in 1.15 on fixing up the way the tests are run. Also, in a we should probably modify tests to work with a 2.5ms host tick -- in a few cases we have tests written to 1000Hz ticks which obviously won't ever be able to provide that precision).

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

Executive summary (but please try to digest all that stuff above):

  1. On a carefully unloaded test run, error rates (outside of the bug in tests/kernel/mem_pool/mem_pool_threadsafe/kernel.memory_pool fails sporadically #12559) are at an order of 1 in 10000, which is about as good as we can get with a rig like this. Trying to get beyond four nines of reliability requires more than host timing for sure.

  2. Our test suite puts requirements on the host tick rate that Ubuntu (and probably most CI environments) doesn't meet in its default kernel. Much of this can be addressed in 1.15 by changing test timings and tick rates appropriately (or by making -icount work).

  3. Sanitycheck needs to change the way it executes the qemu processes as long as we are going to try to use host-timed qemu. I'm simply not able to find a way to run them in parallel with the build such that it works reliably.

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

Yup, that did it. Building a kernel with CONFIG_HZ=1000 pushed the failure rate down to the level seen on the Fedora box[1] ... except for mps2_an385, which was seeing mostly unchanged rates.

As it turns out, it was the clock rate for that particular platform. ARM userspace is just... slow, I guess. I cranked CONFIG_HW_CLOCK_CYCLES_PER_SEC from 25MHz up to 50Mhz (counterintuitive, but by making all the clock deltas twice as large we effectively slow down the simulation relative to the real world), and now it looks perfect. No failures yet.

I'll continue to watch this, but I think @agross-linaro and @ioannisg are off the hook. For 1.15 we'll come up with some test and scaling solutions to make this reliable. I'm 100% certain now that it's possible.

But the presumptive timeslicing bug in #12559 remains real. I officiall wash my hands of the reliability stuff and recommend closing this bug, reprioritizing and/or bumping to 1.15.

[1] Roughly. Obviously these tests haven't been running long so I don't have overnight-scale data yet.

@andyross
Copy link
Contributor

This seems to have gone quiet. I'm convinced this is well-characterized and resolvable, but of course it won't be for 1.14. I've opened #14649 to track a firm "must be retry-free" requirement for the next release, and am going to close this one as a 1.14 bug.

@marc-hb
Copy link
Collaborator

marc-hb commented May 22, 2019

As a short-term workaround, we may want to tag all these tests as "flaky" in testcase.yaml, and build them in a separate CI job which retries them a large number of times.

After a chat with @andyross I'm volunteering to add a new tag emu_time to the tests that fail when and only when qemu or any other overloaded emulator fails to provide a "good time". Tags more specific than just "flaky" will be useful to anyone trying to fix or avoid one of these particular issues.
Please start bikeshedding the name of this tag (or its existence) in 3, 2, 1,...

@andrewboie
Copy link
Contributor

After a chat with @andyross I'm volunteering to add a new tag emu_time to the tests that fail when and only when qemu or any other overloaded emulator fails to provide a "good time". Tags more specific than just "flaky" will be useful to anyone trying to fix or avoid one of these particular issues.
Please start bikeshedding the name of this tag (or its existence) in 3, 2, 1,...

Fine with me.

Anyone with spare development bandwith, someday I'd love this fixed once and for all, see #14173

marc-hb added a commit to marc-hb/zephyr that referenced this issue Jun 21, 2019
We already have the info so let's show it. This helps spots intermittent
issues[*], gives an indication of the time --build-only saves, can help
spot an overloaded test system, highlights the most time-consuming tests
which may need a longer timeout in their config, shows the effective
timeout value when one occurs... all this for a dirt cheap screen estate
price and two extra lines of code.

Sample -v output:

32/81 board123  tests/testme              PASSED (qemu 2.049s)
33/81 board456  samples/hello             PASSED (build)
34/81 qemu_x3   tests/kernel.stack.usage  FAILED: timeout (qemu 60.029s)
     see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log
35/81 board456  tests/testme              PASSED (build)
36/81 qemu_x5   tests/kernel.queue        FAILED: failed (qemu 2.191s)
     see: sanity-out/qemu_x5/tests/kernel.queue/handler.log

[*] running qemu in heavily packed cloud virtual machines comes to mind,
    also see zephyrproject-rtos#12553, zephyrproject-rtos#14173 etc.

Signed-off-by: Marc Herbert <[email protected]>
nashif pushed a commit that referenced this issue Jun 21, 2019
We already have the info so let's show it. This helps spots intermittent
issues[*], gives an indication of the time --build-only saves, can help
spot an overloaded test system, highlights the most time-consuming tests
which may need a longer timeout in their config, shows the effective
timeout value when one occurs... all this for a dirt cheap screen estate
price and two extra lines of code.

Sample -v output:

32/81 board123  tests/testme              PASSED (qemu 2.049s)
33/81 board456  samples/hello             PASSED (build)
34/81 qemu_x3   tests/kernel.stack.usage  FAILED: timeout (qemu 60.029s)
     see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log
35/81 board456  tests/testme              PASSED (build)
36/81 qemu_x5   tests/kernel.queue        FAILED: failed (qemu 2.191s)
     see: sanity-out/qemu_x5/tests/kernel.queue/handler.log

[*] running qemu in heavily packed cloud virtual machines comes to mind,
    also see #12553, #14173 etc.

Signed-off-by: Marc Herbert <[email protected]>
marc-hb added a commit to marc-hb/zephyr that referenced this issue Jun 27, 2019
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173"

  We have been struggling for years with issues related to how QEMU
  attempts to synchronize guest timer interrupts with the host clock,
  for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck
  runs have tests spuriously failing due to timing related issues.

  This creates noise in our CI runs which masks true bugs in our system
  which manifest only intermittently, causing real issues that will
  happen all the time at scale to be 'swept under the rug'; right now
  any time a test fails sanitycheck retries it a few times and only
  consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" zephyrproject-rtos#12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
_partially_ reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert <[email protected]>
@pabigot pabigot removed the priority: high High impact/importance bug label Sep 9, 2019
@pabigot
Copy link
Collaborator

pabigot commented Sep 9, 2019

As we continue to see intermittent failures I'm reopening this in lieu of creating a new one. We can discuss its priority and assigness in the next triage meeting.

I believe that #17155 will help with this, and should be merged before we confirm tests. I would use #18997 to verify that reported tests are still failing even if they seem to pass CI, and update the list of specific issues in the header. #14649 and #14173 may also contribute to resolving this "permanently".

@pabigot pabigot reopened this Sep 9, 2019
@andyross
Copy link
Contributor

andyross commented Sep 9, 2019

The issues here are fundamental. There's really nothing #17155 can do to fix them, though on the margins it does work to tighten up timeout behavior and might make failures less likely.

Take a look at #18656. I'd really appreciate it if we could make testing with that a requirement for opening a new spurious failure bug. The tooling there runs sanitycheck output in a context where host load is carefully controlled (but still as parallel as possible). Using that, on a host that is fast enough (a 3+ GHz host clock seems OK, 2.5GHz shows occasional failures on mps2_an385 due to simple CPU overhead of ARM userspace emulation) and with a 1000 Hz host clock (i.e. not Ubuntu, which has CONFIG_HZ=250) results in perfectly reliable (up to the 10k or so test iterations I've been able to test in single runs) operation on existing tests.

That's not "proof" that there are no bugs lurking. But if removing host clock interaction on systems where host clock interaction is inevitable works to resolve the issue, by definition we can't be sure that there is a real bug.

Spurious failures on hardware, of course, are real issues that need to be addressed. But the qemu stuff is as good as we're going to get it at this point absent changes to our test tools.

@andyross andyross removed their assignment Sep 9, 2019
@nashif
Copy link
Member Author

nashif commented Dec 3, 2019

closing as this was resolved in different places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

No branches or pull requests

10 participants