-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/kernel/timer/timer_api fails on nrf52_pca10040 board #11722
Comments
@pizi-nordic can you take a look? |
Interesting... We are calling |
@andyross: |
Hm. That code is by design, actually. The intent is indeed that (for an example) if you have a callback that is supposed to have fired at timer T (but is actually being handled at T+3 because the interrupt arrived late), and it sets a timeout of its own to fire in 1 tick, that that tick gets handled synchronously inside the same ISR at wherever "T+2" timeouts are being issued. Doing it in a separate list would defeat that. Will take a look. I have a 10056 board, not 10040, though it seems like they're behaving identically? |
Yes, both 10040 and 10056 uses the same timer. |
Hi @andyross , this behavior may trigger serious problems as module scheduling a timeout may require separation of The problem I see is that even if the callback is called with a delay, it will not know about it as I don't think timeout system should hide from the other modules a fact that delay happened. Collecting expired timeouts to the local list and fireing them after the loop is done would ensure that delay is not hidden as |
I think we have to choose between two modes of the operation:
Also this bug shows me, that the "MIN_DELAY" concept in the RTC timer is not working perfectly, as every call to |
I changed priority to high, as the problem is located in the system timers. |
I would ask a question here - which modules depend on the first approach (i.e. approach where you are interested in average number of calls, not timings). Delayed works can execute periodically but they set new timeout from workqueue thread context (so out of this case). Also in this case you want to be called not sooner then. For waking up threads (stopping sleep, breaking on timeouts, etc.) we also want to wait at least a given time. Plus timeout is set from thread context (so out of this case). There are timers. We have two types of timers - one-shot and periodic. Is there anything else? |
@andyross: Do you have an idea how to solve this? |
Anything involving k_uptime_get(). Ticks are assumed to be regular, otherwise the wall clock time you retrieve for computing delays and human-space timeouts are going to be wrong. I guess I'm not seeing the problem here. We only show a difference between the two schemes in our code when an interrupt has been delayed for a significant fraction of a tick. That's thousands and thousands of cycles, and almost certainly means there's some other realtime promise being broken. But if we really needed to, we could expose the difference in the API somehow. There would be "here is the time your timeout callback thinks it is" and "this is the time the hardware counter actually reflects for the current instant". Drivers and subsystems that care about the difference between "time from REALTIME NOW" and "time from when this callback was scheduled" could then correct on their own. It would be a pretty special purpose API, and sort of hard to document in a way that wouldn't make it into a foot-cannon. But it would be easy to implement. There's an entirely separate issue of "setting a timeout when one is about to expire can unexpectedly push it back one tick because of the MIN_DELAY check". That is fixable (I mean, yeah, don't change the timeout if we have one expiring at the next tick -- should be simple enough) and is a bug right now. I think I should be able to fix this pretty quickly. |
I think that there is significant problem in
Problem: High level effect: I hope you see the problem now. |
Oh oh, I see it now. Actually this has nothing to do with adding timeout entries inside a handler (which really is as intended, I swear[1]). This affects the loop any time there are two or more timeouts scheduled to happen in the same tick. The loop will adjust the notion of "current time" at the end of the loop iteration, meaning the next timeout which "should" have happened at the same time and with the same state gets adjusted incorrectly. Pretty sure I know what to do. Need to find a cleaner way to express that logic than testing at both the top and bottom of the loop though. Should have something ready today. [1] We really do want to adjust times of timeouts added inside the handler if the handler was late. Otherwise periodic timers like k_timer will reflect late timer interrupts and not tick regularly. We're just not doing the adjustment correctly. |
I think the problem is just here. The fact is that due to whatever caused scheduling delay, periodic timer is late. It will be late whatever SW trick is done. We can lie about it as we do now or allow the timer callback to accept the reality. |
I thought about it for a minute and found two examples with slightly different needs.
If we skip missed callbacks but keep correct information about time, the phase change problem could be corrected as state can be derived from system time (i.e. ticks).
If we skip missed callbacks but check time we will know some samples are missed. |
We may just need to table this discussion for now. The cases you mention are clearly realtime applications. This behavior happens only when you miss more than a whole tick due to interrupt delay, which is going to wreck any waveform generator or ADC sampler right out of the gate. The case I'm looking at is someone implementing a "clock" (or clock-like thingy) with k_timer, which is (1) not realtime and (2) clearly "wrong" in one case and "right" in the other. It's also (3) somewhat simpler to implement and (4) actually required by some of our tests right now (which is why the feature got added, though I forget exactly where, and wouldn't cry if we wanted to take that case out). That said:
We can't skip a missed callback ever, a periodic timer (the k_timer API one level up) might do that, I guess, but the low level k_timeout interface is oneshot, once registered you have to call it when it expires, otherwise the user never knows what happened. The question we're arguing about is whether or not the "current tick" seen by the callback (e.g. when setting new timeouts) is (1) the current "real" time or (2) a simulated time that corresponds to when the timer was scheduled. Currently we do 2, for the reasons listed above. |
I see a bit different problem than @pdunaj: Due to the "corrections" timers scheduled in some specific point of time are fired prematurely. As result, the time flow in application is not "steady" - sometimes 5 ticks takes as little as 4 ticks ;-). While delayed handler firing is something we should expect, the premature fire is something we cannot tolerate. I think that might have to move handling periodic timers down to |
Right, the premature timeouts are a bug. Fixing that right now. |
Please try the attached patch, which should fix the bug reported. We'll have to continue arguing about the proper behavior in the face of very late interrupt delivery -- I'm hopeful this can be addressed with clearer documentation and maybe a simple API option somewhere. |
Unfortunately the problem still exists on top of the #12204.
|
Do you have a rig or something that you use to get those measurements? Does it work to hack the test to execute the particular case in a loop? All I'm know to do right now on a nrf52840_pca10056 is to hit the reset button hundreds of times.... |
I am using simple shell one liner ( |
Running it in a loop (by editing the test program) works. As another clue: when the test fails, it fails on the first iteration (of the for loop over Which probably means |
Yes. Feel free to look into #11722 (comment) where I traced this. |
I'm still having zero luck reproducing this on a 10056 board (which of course I left at home post-vacation and am going to have to run back at some point today to continue testing). @pabigot - Yeah, it makes sense that repeated loops might end up aligning the same way on tick boundaries and show the same behavior. If you have the rig ready and have seen the error, can you try stuffing a locked random delay (e.g. "irq_lock(); k_busy_wait(nnn); irq_unlock();" with a randomized delay of between 0 and 1 ticks (should be 0-10000 us on this app unless it's changed the tick rate). Note that I think I stumbled on a plausible issue though -- I managed to confuse the previous fix from @pdunaj (synchronizing set-last-time) with the other issue we were discussing, where a timeout set very close to a timeout expiration would push it forward by a full tick, thus entering the "late interrupt" case (which @pizi-nordic has seen) erroneously. We don't have a fix for that even though my brain thought we did, and we for sure need one. Will do that today and maybe we'll get lucky... |
I think we still haven't fully prised apart all the pieces of this issue and how they relate to the semantics of interval timers. With the reproducing modification described here I only ever see the failure on the first iteration of the loop in The comment above by @pizi-nordic is tempting because it explains the failure as a situation where the delay would be short because the hardware timer was asked to wait for less time than it should, because the delay was reduced by time that had already passed before when the timeout was rescheduled. The That "undesirable extra" reduction could be removed by ensuring newly scheduled delays were not affected by adjustments intended for subsequent expirations. So I did this in pabigot/zephyr:wip/2019010a, splitting the timeout_list so that everything that was due at or before the tick-to-come was removed before any callbacks were invoked, so any new timeouts added during callbacks would not be adjusted. I believe this has an effect similar to changing the interval increment behavior to the alternate choice described in #12247. The timer_api test that fails for this issue passes with flying colors with that change, though several other tests fail. Which suggests that both update models may be required to handle all use cases, or that there's a different error in that solution. |
I'm telling you guys again: the issue here is going to turn out to be a delayed interrupt or callback. It's not an API bug. There's a valid argument to be had about the API, but you can't fix it that way (c.f. the tests you discovered that assume the current behavior -- I bumped into those too during the rewrite, which is why this feature exists at all). The root cause is unexpected timing of the z_clock_announce() call. I promise. Still struggling with seeing a reproduction. I'm guessing at this point it's my programming methodology: I'm using the MBED filesystem interface, and not the SEGGER/J-link/nrfjprog layer. My guess is that this boots the system with a more reliably synchronized clock or something, where the direct stuff might do the programming and reboot synchronously with the clock in an arbitrary state? I'm currently torn trying to figure out whether it's easier to install new tooling or to start doing invasive random waits at the start of the test... |
Got it. See comment in #12204 |
Excellent. I'm leaving my reproducer on top of your PR running overnight, and will take a look tomorrow when I'm fresh. Assuming there isn't anything obvious, I'll sign off on the PR. These points are outside the scope of this issue but remain to be addressed:
@andyross Thanks for tracking this down, and well-done. It was a nasty one. |
No, this is sort of a generic thing with time APIs. "Time" is a real function of the universe and can't be made atomic. Here's POSIX code that exhibits exactly the same bug:
If you play with that enough, you'll eventually hit a situation where the current time in seconds changes between the first call to time() and the entry to sleep(), and the assert will fail. This is exactly what our test code was doing. That's not a bug in the API, it's just how time works. But if you're careful to arrange things by ensuring you start just after a seconds boundary, you (might[1]) can make that assertion always true by controlling the initial conditions:
And that's fundamentally what the patch is doing. Except that we work in units of ticks and not seconds, and have different APIs for "wait until the next tick" than crude spinning. Heh, actually spinning would work too, but I happen to know that tickles qemu bugs on x86. And in fact we could have just thrown in a k_sleep() to do the same thing, but this is a lower level test of the k_timer API and I felt that sleeping would be a layering violation of the test intent. [1] POSIX itself doesn't guarantee that kind of precision of course. On most systems there's still an internal tick granularity like ours. But this would surely work in practice. |
@carlescufi @pizi-nordic I propose that the specific problem covered by this issue has been resolved in #12204 by adding the required synchronization code to make the test pass reliably, so this should be closed. I don't think keeping it open is going to contribute to addressing the other anomalous behavior and functional gaps identified in the discussion here. But it's assigned to you two so I won't presume to close it. |
@pizi-nordic I am closing this since the test itself has been fixed. Additional bugs have their own GitHub issues. Please let me know if you want me to reopen this. |
Describe the bug
The CI run caught
test_timer_periodicity
test case failure in tests/kernel/timer/timer_api on nrf52_pca10040 board. This issue is sporadic and I couldn't reproduce locally.To Reproduce
Steps to reproduce the behavior:
Expected behavior
All test cases should pass without any assertion failure or crash.
Screenshots or console output
Environment (please complete the following information):
Since the failure is not seen locally, I couldn't git bisect to find on what commit the failure was first seen.
The text was updated successfully, but these errors were encountered: