-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/kernel/sched/schedule_api fails on nrf5* boards #11721
Comments
@inakypg @nashif @kumarvikash1: JFYI |
@pizi-nordic can you take a look? |
It seems that this failure is caused by the same issue that triggered the #11722. I was not able to reproduce it on unmodified code, however after altering the FYI: @carlescufi, @pdunaj, @andyross |
On recent commit: a7afdc3 the test case: test_slice_scheduling is failing with assertion failure. Execution log:
|
This was working as of the most recent timer work. This is still on nrf52840_pca100xx only? Any bisection info for when it was last working in the validation tree? |
@andyross: git bisect shows that the below commit is the culprit.
|
0cc362f confirmed to be the culprit |
This test is failing very seldom in native_posix too. So I do not think this is a platform specific issue. Neither valgrind or address sanitizer detect the problem. |
It seems this issue caused by nrf5* timer delay which will trigger same priority thread time slice schedule. And can we check timer delay issue from platform respective? Actually I do see the difference of system tick between nrf5* board and other Arm boards. Normal Arm boards use Systick interrupt to calculate system tick. And nrf5* board use RTC to do it with IRQ num NRF5_IRQ_RTC1_IRQn = 17, maybe a little lower priority than other arm boards. And from the code I also see that nrf5* boards has some other interrupts(like NRF5_IRQ_POWER_CLOCK_IRQn and NRF5_IRQ_TEMP_IRQn) with higher priority than RTC's. |
@wentongwu: Actually it's simpler than that. The patch @spoorthik and @nashif bisected to was just an emulator workaround. The original code would spin waiting on a specified number of ticks to pass, which was tickling bugs on x86_64 so I replaced it with a simpler busy wait, which worked everywhere except on nRF. The difference in behavior is tick alignment: the old code was always exiting the wait loop immediately after a tick had expired, where the new one is exactly N milliseconds[1], and likely to be anywhere between two tick boundaries. Later code is then checking time deltas in ticks and sees a +/- 1 tick aliasing issue where the aligned code does not. nRF tends to see these issues more easily because it's underlying clock counter is running at a very different rate than other platforms; it makes a great bug magnet for timer stuff. In this case I think the right solution is to revert the patch (I just checked and indeed a simple reversion fixes the issue) and come up with a more targetted workaround for x86_64. We've actually swapped qemu versions in CI since this was added; it's possible the bug isn't visible any more... [1] "Exact" to the precision of the underlying timer, anyway. |
Commit 0cc362f ("tests/kernel: Simplify timer spinning") was added to work around a qemu bug with dropped interrupts on x86_64. But it turns out that the tick alignment that the original implementation provided (fundamentally, it spins waiting on the timer driver to report tick changes) was needed for correct operation on nRF52. The effectively revert that commit (and refactors all the spinning into a single utility) and replaces it with a workaround targeted to qemu on x86_64 only. Fixes zephyrproject-rtos#11721 Signed-off-by: Andy Ross <[email protected]>
Commit 0cc362f ("tests/kernel: Simplify timer spinning") was added to work around a qemu bug with dropped interrupts on x86_64. But it turns out that the tick alignment that the original implementation provided (fundamentally, it spins waiting on the timer driver to report tick changes) was needed for correct operation on nRF52. The effectively revert that commit (and refactors all the spinning into a single utility) and replaces it with a workaround targeted to qemu on x86_64 only. Fixes #11721 Signed-off-by: Andy Ross <[email protected]>
Describe the bug
The test case
test_slice_reset
in tests/kernel/sched/schedule_api failed on CI run but couldn't reproduce locally.To Reproduce
Steps to reproduce the behavior:
Expected behavior
All test cases should pass without any assert failure.
Screenshots or console output
Environment (please complete the following information):
Since the failure is not seen locally, I couldn't git bisect to find on what commit the failure was first seen.
The text was updated successfully, but these errors were encountered: