Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boards: arm: mps2_an385: Enable QEMU icount mode #24811

Merged

Conversation

stephanosio
Copy link
Member

@stephanosio stephanosio commented Apr 29, 2020

This commit enables the QEMU icount mode for `mps2_an385`, in order to
decouple the host clock from the emulated guest clock.

This prevents guest timing instability from causing test failures when
the host CPU load is very high.

The icount `shift` value of 7 was empirically chosen to allow the tests
to complete in both realistic and reasonable amount of time.

The following are quick notes on the parameters used:

* -icount shift=7: Execute one instruction every 128ns of virtual time
* -icount align=off: Do not synchronise the host and guest clocks
* -icount sleep=off: Advance virtual time without sleeping/waiting
* -rtc clock=vm: Isolate the guest RTC time from the host

Signed-off-by: Stephanos Ioannidis <[email protected]>

This fixes the infamous mps2_an385-related "intermittent" CI failures.

I have locally run mps2_an385 sanitycheck to verify that this fixes the current problem -- it passed 10/10 times. Without this fix, it is practically impossible to get the sanitycheck to pass in the first run.

https://gist.github.com/stephanosio/03c38bb39deca2efa96eda6bc68de5b2

Provides a partial fix for #14173.

Copy link
Collaborator

@galak galak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind updating the commit message to explain what the various arguments to -icount mean.

@stephanosio
Copy link
Member Author

Do you mind updating the commit message to explain what the various arguments to -icount mean.

well, one should refer to the QEMU documentation for that ... the semantic of what it does is already explained in the commit message: "decouple the host clock from the emulated guest clock."

This commit enables the QEMU icount mode for `mps2_an385`, in order to
decouple the host clock from the emulated guest clock.

This prevents guest timing instability from causing test failures when
the host CPU load is very high.

The icount `shift` value of 7 was empirically chosen to allow the tests
to complete in both realistic and reasonable amount of time.

The following are quick notes on the parameters used:

* -icount shift=7: Execute one instruction every 128ns of virtual time
* -icount align=off: Do not synchronise the host and guest clocks
* -icount sleep=off: Advance virtual time without sleeping/waiting
* -rtc clock=vm: Isolate the guest RTC time from the host

Signed-off-by: Stephanos Ioannidis <[email protected]>
@andrewboie
Copy link
Contributor

I don't think we need to duplicate what's explained in the QEMU documentation.

@stephanosio
Copy link
Member Author

Added quick notes on the parameters used; for more details, one should reference the QEMU documentation.

@galak
Copy link
Collaborator

galak commented Apr 29, 2020

I don't think we need to duplicate what's explained in the QEMU documentation.

I guess I meant more why align=off,sleep=off

@andrewboie
Copy link
Contributor

I don't think we need to duplicate what's explained in the QEMU documentation.

I guess I meant more why align=off,sleep=off

Gotcha, yeah that's useful.

@galak galak merged commit e7297c1 into zephyrproject-rtos:master Apr 30, 2020
@stephanosio stephanosio deleted the fix_ci_intermittent_failures branch April 30, 2020 03:12
@wentongwu
Copy link
Contributor

just see this PR, but why the shift value is 7?

@stephanosio
Copy link
Member Author

just see this PR, but why the shift value is 7?

@wentongwu the shift value of 5 was making tests run too slow (e.g. 1000ms sleep was taking over 5000ms IIRC) and causing them to time out. 7 turned out to be the closest to the expected time.

@wentongwu
Copy link
Contributor

@stephanosio The virtual time of shift = 5 is the most closet clock compared with the real platform, there will be failed cases when running in heavy loaded host with shift = 7 (e.g. with 10 different sanity-check processes running at the same time on 10 different Zephyr source code folders). But it's true as I said in #14173 comment that it will take long wall time to complete the tests if shift value is low. So I add some changes in sanity-check in #22904 to calculate QEMU's cpu time.

@stephanosio
Copy link
Member Author

The virtual time of shift = 5 is the most closet clock compared with the real platform

Since the guest does not interact with external time sources and there are no external time constraints, the shift value used does not really matter -- we should choose a value that is the most advantageous for testing. Moreover, the "virtual clock" produced by the shift value of 5 is also off by tens of percent from the actual clock (25MHz).

with 10 different sanity-check processes running at the same time on 10 different Zephyr source code folders

I am pretty sure sanitycheck is not meant to be run that way. If you run 10 instances of anything that is supposed to utilise 100% of CPU, that will likely fail from various complications (e.g. timeout).

I have tested sanitycheck with 32 parallel "jobs" (equal to the CPU thread count) and did not see any failures; the CPU load was at 100% for the entire time.

@andrewboie
Copy link
Contributor

@stephanosio did you attempt/run into any issues making the same change to mps2_an521, qemu_cortex_m3, qemu_cortex_m0, or qemu_cortex_a53? Seems like we ought to apply this to all ARM QEMU targets.

@stephanosio
Copy link
Member Author

@stephanosio did you attempt/run into any issues making the same change to mps2_an521, qemu_cortex_m3, qemu_cortex_m0, or qemu_cortex_a53? Seems like we ought to apply this to all ARM QEMU targets.

@andrewboie It does seem that mps2_an521 is another offender.. I will check tomorrow.

@andrewboie
Copy link
Contributor

@stephanosio this issue actually appears to affect, to varying degrees, all QEMU targets. See #14173

@stephanosio
Copy link
Member Author

@stephanosio this issue actually appears to affect, to varying degrees, all QEMU targets. See #14173

@andrewboie sure, I meant another "major" offender :)

I locally ran sanitycheck for mps2_an521 and was able to reproduce the errors in the CI with a quite high failure rate (definitely more than 50%), though not as high as mps2_an385. I will open a PR to fix mps2_an521 since it is another platform with a high failure rate.

As for the rest of ARM platforms, I was thinking @wentongwu's #22904 could address them, but if he does not have a short-term plan to get that PR in, then I could look into them as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM ARM (32-bit) Architecture area: Boards area: QEMU QEMU Emulation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants