Configure QEMU to run independent of the host clock #14173

andrewboie · 2019-03-07T22:48:20Z

We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example #12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues.

This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error.

QEMU does nominally support some options to decouple the guest from host clock:

-rtc [base=utc|localtime|date][,clock=host|rt|vm][,driftfix=none|slew]
                set the RTC base and clock, enable drift fix for clock ticks (x86 only)
-icount [shift=N|auto][,align=on|off][,sleep=on|off,rr=record|replay,rrfile=<filename>,rrsnapshot=<snapshot>]
                enable virtual instruction counter with 2^N clock ticks per
                instruction, enable aligning the host and virtual clocks
                or disable real time cpu sleeping

Unfortunately to date nobody has ever gotten this working.

The objective is as follows: for all architectures that use QEMU for sanitycheck runs (currently x86, x86_64, arm, riscv32, xtensa, nios2) configure QEMU to use icount.

Beware that there may be bugs in QEMU which prevent this from working, and whoever works on this should be prepared to dive into the QEMU source and/or work with QEMU upstream.

Also it's worth noting that moving ANY arch to use icount is progress, so even if we can get only some but not all arches working with this, that is worth doing. At this time I recommend x86 and ARM (the qemu_x86 and mps2_an385 targets) as priority. But eventually we need them all to finally get rid of these spurious sanitycheck failures.

If we manage to get this working for all arches, then we can remove the logic in CI which keeps retrying failed tests.

Known build targets that use QEMU emulation, each tickbox is a separate board.cmake that sets QEMU_FLAGS_${ARCH}:

Check these tickboxes as patches are merged (not submitted) to indicate that they are now using icount.

The text was updated successfully, but these errors were encountered:

wentongwu · 2019-06-04T04:21:31Z

I did some sanitychecks with "-icount shift=4,align=off,sleep=off" on qemu_x86 and mps2_an385, only tests/kernel/critical/kernel_critical failed on mps2_an385. And run that case(tests/kernel/critical/kernel_critical) on mps2_an385 specially, I can get this case run successfully, but it takes long time to execute, so it's why get the timeout result on sanitychecks. And it means icount mode produces deterministic output, but does not provide cycle accurate.
I will try to review the qemu code to try to understand icount mode well, and then do next plan.

andrewboie · 2019-06-06T01:10:45Z

@wentongwu thanks for the update :)

andrewboie · 2019-06-06T01:11:20Z

I remember tests/kernel/critical takes a while even without icount, so perhaps we need to change the test to not take so much time.

wentongwu · 2019-06-06T02:19:25Z

I run several rounds stress tests, founding more failed/timeout, and most of them are related to timer accurateness, I need do more debug and code review.

This test is already flaky, but becomes even flakier when coverage is enabled. Disable until we put a stake through the QEMU timing issues being worked on in zephyrproject-rtos#14173. Signed-off-by: Andrew Boie <[email protected]>

This test is already flaky, but becomes even flakier when coverage is enabled. Disable until we put a stake through the QEMU timing issues being worked on in #14173. Signed-off-by: Andrew Boie <[email protected]>

wentongwu · 2019-06-07T10:53:50Z

With below patch, I did some sanitychecks again with "-icount shift=4,align=off,sleep=off -rtc clock=vm" on qemu_x86 and mps2_an385, so far no error occured. Still reviewing qemu code...

diff --git a/tests/kernel/critical/src/main.c b/tests/kernel/critical/src/main.c
index 372f931fa1..38d26a987c 100644
--- a/tests/kernel/critical/src/main.c
+++ b/tests/kernel/critical/src/main.c
@@ -32,8 +32,8 @@
#include <linker/sections.h>
#include <ztest.h>

-#define NUM_MILLISECONDS 5000
-#define TEST_TIMEOUT 20000
+#define NUM_MILLISECONDS 500
+#define TEST_TIMEOUT 2000

static u32_t critical_var;
static u32_t alt_thread_iterations;

wentongwu · 2019-06-07T16:01:07Z

With below patch in every code path, four processes run " for i in {1..10000}; do ./scripts/sanitycheck
-p qemu_x86 -p mps2_an385 -T tests/kernel ; done " individually in four different code path on the same one machine (ubuntu 18.04), two processes with "-icount shift=4,align=off,sleep=off -rtc clock=vm" configuration, one process with "-icount shift=4,align=off,sleep=off" configuration, the last one with the default configuration (the original configuration, no changes on qemu configuration). So far the default one easily fail, others run correctly.

diff --git a/scripts/sanitycheck b/scripts/sanitycheck
index 662172cda2..12788cf555 100755
--- a/scripts/sanitycheck
+++ b/scripts/sanitycheck
@@ -1453,7 +1453,7 @@ testcase_valid_keys = {"tags": {"type": "set", "required": False},
                        "build_on_all": {"type": "bool", "default": False},
                        "skip": {"type": "bool", "default": False},
                        "slow": {"type": "bool", "default": False},
-                       "timeout": {"type": "int", "default": 60},
+                       "timeout": {"type": "int", "default": 600},
                        "min_ram": {"type": "int", "default": 8},
                        "depends_on": {"type": "set"},
                        "min_flash": {"type": "int", "default": 32},

andrewboie · 2019-06-07T16:22:16Z

five minutes is a lot for a default. how many / which tests start taking longer than 60 seconds with the icount change?

andrewboie · 2019-06-07T21:27:55Z

"-icount shift=4,align=off,sleep=off -rtc clock=vm"

Can you give some detail on how these parameters were selected? What do you observe for different shift values, and why did you settle on 4?

We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see zephyrproject-rtos#12553, zephyrproject-rtos#14173 etc. Signed-off-by: Marc Herbert <[email protected]>

We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see #12553, #14173 etc. Signed-off-by: Marc Herbert <[email protected]>

Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <[email protected]>

wentongwu · 2019-07-22T15:03:13Z

@andrewboie if 10 PRs submitted at the same time, what's the CI's behavior? run shippable for each PR one by one ? or run shippable for all the PRs at the same time ? Thanks

andrewboie · 2019-07-22T18:15:52Z

@wentongwu it runs it on the last patch in the series.
Having said that, your PRs need to be bisectable without breaking the build.

wentongwu · 2019-07-23T00:38:27Z

@wentongwu it runs it on the last patch in the series.
Having said that, your PRs need to be bisectable without breaking the build.

I mean if ten different developers submit their own PRs at the same time, will CI schedule it one by one or run it concurrently? If concurrently, icount mode can't guarantee the complete time of the test case, because first it will take longer time than normal mode, second more qemu processes in one machine will cause little time for each qemu process.

andrewboie · 2019-07-23T00:46:47Z

If concurrently, icount mode can't guarantee the complete time of the test case

I see what you mean. The build slaves are supposed to be load balanced, with the number of CPUs selected for the job matching the hardware it is running on.

Having said that, perhaps we need a more sophisticated accounting on how long tests are taking to complete to determine whether they have gotten stuck somewhere.

Are you at the point where icount is working but we just need longer test timeouts?

wentongwu · 2020-02-20T02:26:31Z

quick testing, seems most of other QEMU platforms also can work, will pick-up little timeout or failed cases.

This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904. Signed-off-by: Stephanos Ioannidis <[email protected]>

This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues #14173 and #22904. Signed-off-by: Stephanos Ioannidis <[email protected]>

maksimmasalski · 2020-03-23T08:20:30Z

@andrewboie Are you going to develop test case for that QEMU configuration?

carlescufi · 2020-04-28T17:23:19Z

The current manifestation of the lack for this enhancement is the following:
tests/counter/counter_basic_api failing on mps2_an385 in CI occasionally.

This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904. Signed-off-by: Stephanos Ioannidis <[email protected]>

andrewboie · 2020-05-01T00:28:16Z

I found and fixed two issues that were exposed by enabling icount on x86 targets:

#24879
#24880

After #24879 is merged, I have found icount works well on 32-bit x86, and I think we should enable it. I tested with -icount shift=7,align=off,sleep=off -rtc clock=vm

x86_64 unfortunately has other issues. With icount turned on and the above fix applied, I still see:

Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 176 test configurations selected, 631 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   39/ 176  22%  skipped:   12, failed:    0
ERROR   - qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox.usage  FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox.usage/handler.log
INFO    - Total complete:   48/ 176  27%  skipped:   12, failed:    1
ERROR   - qemu_x86_64               tests/kernel/lifo/lifo_usage/kernel.lifo.usage     FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_usage/kernel.lifo.usage/handler.log
INFO    - Total complete:   97/ 176  55%  skipped:   17, failed:    2
ERROR   - qemu_x86_64               tests/kernel/smp/kernel.multiprocessing.smp        FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing.smp/handler.log
INFO    - Total complete:  106/ 176  60%  skipped:   19, failed:    3
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:  107/ 176  60%  skipped:   19, failed:    4
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log
INFO    - Total complete:  108/ 176  61%  skipped:   19, failed:    5
ERROR   - qemu_x86_64               tests/kernel/spinlock/kernel.multiprocessing.spinlock FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/spinlock/kernel.multiprocessing.spinlock/handler.log
INFO    - Total complete:  109/ 176  61%  skipped:   19, failed:    6
ERROR   - qemu_x86_64               samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
INFO    - Total complete:  121/ 176  68%  skipped:   19, failed:    7
ERROR   - qemu_x86_64               tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace/handler.log
INFO    - Total complete:  176/ 176  100%  skipped:   20, failed:    8
INFO    - 148 of 156 tests passed (94.87%), 8 failed, 20 skipped with 0 warnings in 116.42 seconds
INFO    - In total 818 test cases were executed on 1 out of total 243 platforms (0.41%)

wentongwu · 2020-05-01T08:47:18Z

INFO    - 20/86 qemu_x86_64               tests/kernel/smp/kernel.multiprocessing.smp        FAILED Failed (qemu 16.006s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing.smp/handler.log

this problem should be "with icount mode enabled, although configure Qemu multi cpu, there is only one vCPU thread to handle guest code", for case test_coop_resched_threads current cooperative thread doesn't give up the only vCPU and other high priority cooperative threads don't get vCPU time to run, so there is assert.

wentongwu · 2020-05-01T12:53:02Z

INFO    - 74/86 qemu_x86_64               tests/kernel/lifo/lifo_usage/kernel.lifo.usage     FAILED Failed (qemu 7.419s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_usage/kernel.lifo.usage/handler.log

INFO    - 61/86 qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox.usage  FAILED Failed (qemu 2.324s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox.usage/handler.log

INFO    - 71/86 qemu_x86_64               tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace FAILED Exited with 2 (qemu)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace/handler.log

these three issues are all related to only one vCPU existed when icount enabled, but actually run time CONFIG_MP_NUM_CPUS=2

wentongwu · 2020-05-01T12:55:57Z

INFO    - 50/86 qemu_x86_64               tests/kernel/spinlock/kernel.multiprocessing.spinlock FAILED Timeout (qemu 61.099s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/spinlock/kernel.multiprocessing.spinlock/handler.log

this case requires

tests:
  kernel.multiprocessing.spinlock:
    filter: CONFIG_SMP and CONFIG_MP_NUM_CPUS > 1

but we only have one core with icount enabled.

wentongwu · 2020-05-01T13:03:57Z

I see only above 5 failed cases with icount enabled on qemu_x86_64 platform, if config CONFIG_MP_NUM_CPUS=1 for these 5 cases, qemu_x86_64 can work well with -icount shift=5,align=off,sleep=off -rtc clock=vm

andrewboie · 2020-05-01T20:13:32Z

OK. I think we should open a separate issue specifically for x86_64 SMP. We may need a completely different emulator to do this right, or this may have to be a documented limitation. Let's reduce the scope of this ticket to non-SMP targets.

abrodkin · 2020-07-06T09:56:11Z

Solution for ARC should be in #26646.

andrewboie added Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug labels Mar 7, 2019

andrewboie changed the title ~~Configure QEMU to run independent of the host clock for all arches~~ Configure QEMU to run independent of the host clock Mar 7, 2019

andrewboie added the area: Testing label Mar 8, 2019

andyross mentioned this issue Mar 18, 2019

CI testing must be retry-free #14649

Closed

andrewboie assigned wentongwu May 3, 2019

andrewboie mentioned this issue May 22, 2019

List of tests that keep failing sporadically #12553

Closed

18 tasks

marc-hb mentioned this issue Jun 21, 2019

sanitycheck: show handler_time in -v output (qemu 2.032s) #16969

Merged

marc-hb mentioned this issue Jun 27, 2019

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

Closed

andrewboie mentioned this issue Jul 3, 2019

sanitycheck: don't multiply CPU count #17276

Merged

wentongwu mentioned this issue Feb 19, 2020

Enable icount mode for failure-prone platforms #22904

Merged

7 tasks

wentongwu mentioned this issue Feb 25, 2020

Extend qemu_cortex_r5 test coverage #20217

Closed

9 tasks

stephanosio mentioned this issue Apr 23, 2020

tests/counter/counter_basic_api fails on mps2_an385 #24635

Closed

bwasim mentioned this issue Apr 29, 2020

net: lib: Adding support for TFTP Client. #23478

Merged

stephanosio mentioned this issue Apr 29, 2020

boards: arm: mps2_an385: Enable QEMU icount mode #24811

Merged

carlescufi added area: Test Framework Issues related not to a particular test, but to the framework instead and removed area: Testing labels Apr 30, 2020

stephanosio mentioned this issue May 1, 2020

boards: arm: mps2_an521: Enable QEMU icount mode #24886

Merged

andrewboie mentioned this issue May 10, 2020

qemu test failures when running sanitycheck #25174

Closed

carlescufi closed this as completed in #22904 May 14, 2020

stephanosio mentioned this issue Jul 3, 2020

boards: qemu_arc: disable as default test platform #26623

Merged

GroteGnoom mentioned this issue Dec 22, 2020

Sleep calls are off on qemu_x86 #30964

Closed

galak mentioned this issue Feb 11, 2021

twister failing locally - fails to link native_posix w/lld #32237

Closed

jharris-intel mentioned this issue Mar 1, 2021

qemu timer change introducing new CI failures #32724

Closed

ghaerr mentioned this issue Dec 1, 2021

[net] First pass fixing ftpd on QEMU ghaerr/elks#1039

Merged

povergoing mentioned this issue Mar 7, 2023

ARM64/MMU small bits #55501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure QEMU to run independent of the host clock #14173

Configure QEMU to run independent of the host clock #14173

andrewboie commented Mar 7, 2019 •

edited by abrodkin

Loading

wentongwu commented Jun 4, 2019

andrewboie commented Jun 6, 2019

andrewboie commented Jun 6, 2019

wentongwu commented Jun 6, 2019

wentongwu commented Jun 7, 2019

wentongwu commented Jun 7, 2019 •

edited

Loading

andrewboie commented Jun 7, 2019

andrewboie commented Jun 7, 2019

wentongwu commented Jul 22, 2019

andrewboie commented Jul 22, 2019

wentongwu commented Jul 23, 2019

andrewboie commented Jul 23, 2019

wentongwu commented Feb 20, 2020

maksimmasalski commented Mar 23, 2020

carlescufi commented Apr 28, 2020

andrewboie commented May 1, 2020

wentongwu commented May 1, 2020 •

edited

Loading

wentongwu commented May 1, 2020

wentongwu commented May 1, 2020

wentongwu commented May 1, 2020

andrewboie commented May 1, 2020 •

edited

Loading

abrodkin commented Jul 6, 2020

Configure QEMU to run independent of the host clock #14173

Configure QEMU to run independent of the host clock #14173

Comments

andrewboie commented Mar 7, 2019 • edited by abrodkin Loading

wentongwu commented Jun 4, 2019

andrewboie commented Jun 6, 2019

andrewboie commented Jun 6, 2019

wentongwu commented Jun 6, 2019

wentongwu commented Jun 7, 2019

wentongwu commented Jun 7, 2019 • edited Loading

andrewboie commented Jun 7, 2019

andrewboie commented Jun 7, 2019

wentongwu commented Jul 22, 2019

andrewboie commented Jul 22, 2019

wentongwu commented Jul 23, 2019

andrewboie commented Jul 23, 2019

wentongwu commented Feb 20, 2020

maksimmasalski commented Mar 23, 2020

carlescufi commented Apr 28, 2020

andrewboie commented May 1, 2020

wentongwu commented May 1, 2020 • edited Loading

wentongwu commented May 1, 2020

wentongwu commented May 1, 2020

wentongwu commented May 1, 2020

andrewboie commented May 1, 2020 • edited Loading

abrodkin commented Jul 6, 2020

andrewboie commented Mar 7, 2019 •

edited by abrodkin

Loading

wentongwu commented Jun 7, 2019 •

edited

Loading

wentongwu commented May 1, 2020 •

edited

Loading

andrewboie commented May 1, 2020 •

edited

Loading