-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure QEMU to run independent of the host clock #14173
Comments
I did some sanitychecks with "-icount shift=4,align=off,sleep=off" on qemu_x86 and mps2_an385, only tests/kernel/critical/kernel_critical failed on mps2_an385. And run that case(tests/kernel/critical/kernel_critical) on mps2_an385 specially, I can get this case run successfully, but it takes long time to execute, so it's why get the timeout result on sanitychecks. And it means icount mode produces deterministic output, but does not provide cycle accurate. |
@wentongwu thanks for the update :) |
I remember tests/kernel/critical takes a while even without icount, so perhaps we need to change the test to not take so much time. |
I run several rounds stress tests, founding more failed/timeout, and most of them are related to timer accurateness, I need do more debug and code review. |
This test is already flaky, but becomes even flakier when coverage is enabled. Disable until we put a stake through the QEMU timing issues being worked on in zephyrproject-rtos#14173. Signed-off-by: Andrew Boie <[email protected]>
This test is already flaky, but becomes even flakier when coverage is enabled. Disable until we put a stake through the QEMU timing issues being worked on in #14173. Signed-off-by: Andrew Boie <[email protected]>
With below patch, I did some sanitychecks again with "-icount shift=4,align=off,sleep=off -rtc clock=vm" on qemu_x86 and mps2_an385, so far no error occured. Still reviewing qemu code... diff --git a/tests/kernel/critical/src/main.c b/tests/kernel/critical/src/main.c -#define NUM_MILLISECONDS 5000 static u32_t critical_var; |
With below patch in every code path, four processes run " for i in {1..10000}; do ./scripts/sanitycheck
|
five minutes is a lot for a default. how many / which tests start taking longer than 60 seconds with the icount change? |
Can you give some detail on how these parameters were selected? What do you observe for different shift values, and why did you settle on 4? |
We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see zephyrproject-rtos#12553, zephyrproject-rtos#14173 etc. Signed-off-by: Marc Herbert <[email protected]>
We already have the info so let's show it. This helps spots intermittent issues[*], gives an indication of the time --build-only saves, can help spot an overloaded test system, highlights the most time-consuming tests which may need a longer timeout in their config, shows the effective timeout value when one occurs... all this for a dirt cheap screen estate price and two extra lines of code. Sample -v output: 32/81 board123 tests/testme PASSED (qemu 2.049s) 33/81 board456 samples/hello PASSED (build) 34/81 qemu_x3 tests/kernel.stack.usage FAILED: timeout (qemu 60.029s) see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log 35/81 board456 tests/testme PASSED (build) 36/81 qemu_x5 tests/kernel.queue FAILED: failed (qemu 2.191s) see: sanity-out/qemu_x5/tests/kernel.queue/handler.log [*] running qemu in heavily packed cloud virtual machines comes to mind, also see #12553, #14173 etc. Signed-off-by: Marc Herbert <[email protected]>
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <[email protected]>
@andrewboie if 10 PRs submitted at the same time, what's the CI's behavior? run shippable for each PR one by one ? or run shippable for all the PRs at the same time ? Thanks |
@wentongwu it runs it on the last patch in the series. |
I mean if ten different developers submit their own PRs at the same time, will CI schedule it one by one or run it concurrently? If concurrently, icount mode can't guarantee the complete time of the test case, because first it will take longer time than normal mode, second more qemu processes in one machine will cause little time for each qemu process. |
I see what you mean. The build slaves are supposed to be load balanced, with the number of CPUs selected for the job matching the hardware it is running on. Having said that, perhaps we need a more sophisticated accounting on how long tests are taking to complete to determine whether they have gotten stuck somewhere. Are you at the point where icount is working but we just need longer test timeouts? |
quick testing, seems most of other QEMU platforms also can work, will pick-up little timeout or failed cases. |
This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904. Signed-off-by: Stephanos Ioannidis <[email protected]>
This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues #14173 and #22904. Signed-off-by: Stephanos Ioannidis <[email protected]>
@andrewboie Are you going to develop test case for that QEMU configuration? |
The current manifestation of the lack for this enhancement is the following: |
This commit enables the QEMU icount emulation mode for improved timing stability. In normal emulation mode (without icount), the emulation timing of the TTC system timer is particularly unstable and this results in a high CI failure rate. For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904. Signed-off-by: Stephanos Ioannidis <[email protected]>
I found and fixed two issues that were exposed by enabling icount on x86 targets: After #24879 is merged, I have found icount works well on 32-bit x86, and I think we should enable it. I tested with x86_64 unfortunately has other issues. With icount turned on and the above fix applied, I still see:
|
this problem should be "with icount mode enabled, although configure Qemu multi cpu, there is only one vCPU thread to handle guest code", for case |
these three issues are all related to only one vCPU existed when icount enabled, but actually run time CONFIG_MP_NUM_CPUS=2 |
this case requires
but we only have one core with icount enabled. |
I see only above 5 failed cases with icount enabled on qemu_x86_64 platform, if config CONFIG_MP_NUM_CPUS=1 for these 5 cases, qemu_x86_64 can work well with |
OK. I think we should open a separate issue specifically for x86_64 SMP. We may need a completely different emulator to do this right, or this may have to be a documented limitation. Let's reduce the scope of this ticket to non-SMP targets. |
Solution for ARC should be in #26646. |
We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example #12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues.
This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error.
QEMU does nominally support some options to decouple the guest from host clock:
Unfortunately to date nobody has ever gotten this working.
The objective is as follows: for all architectures that use QEMU for sanitycheck runs (currently x86, x86_64, arm, riscv32, xtensa, nios2) configure QEMU to use icount.
Beware that there may be bugs in QEMU which prevent this from working, and whoever works on this should be prepared to dive into the QEMU source and/or work with QEMU upstream.
Also it's worth noting that moving ANY arch to use icount is progress, so even if we can get only some but not all arches working with this, that is worth doing. At this time I recommend x86 and ARM (the qemu_x86 and mps2_an385 targets) as priority. But eventually we need them all to finally get rid of these spurious sanitycheck failures.
If we manage to get this working for all arches, then we can remove the logic in CI which keeps retrying failed tests.
Known build targets that use QEMU emulation, each tickbox is a separate board.cmake that sets QEMU_FLAGS_${ARCH}:
Check these tickboxes as patches are merged (not submitted) to indicate that they are now using icount.
The text was updated successfully, but these errors were encountered: