-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1 #12332
Comments
@andyross, @pabigot, @carlescufi: FYI |
@pizi-nordic, this is getting out of hand. Not. A. Bug. I don't know how else to explain it. Existing tests require the behavior as implemented. What you are worrying about is an untested edge case that results from an an API convention issue and happens only in latency crisis situations where interrupts have been delayed more than a tick. It can be fixed via API (and worked around right now), but it requires design and ideas and discussion. I tried to isolate that discussion in #12247. But now it's leaked out as a "bug" and I genuinely don't know what to do. If you have code that actually fails and has no workaround, submit it as a test first and then submit the bug. But for now... I just don't know how else to deal with this. Stop this senseless argument. Please do not resubmit this without an exercising test case (and not a whiteboxed one!) |
We need consensus on whether this is a bug, since @pizi-nordic and I disagree with your assessment. If it is a bug, it may very well require API and other changes to resolve. That's fine, and not a motivation for closing it until we have consensus. |
(And yes, I know you think you understand this better and that I'm missing something important. But without a failing test case it's not a bug, and I'm tired of typing about it) |
Without a failing test case, it's not a bug. Please stop this senseless fight. There's a perfectly good discussion elsewhere. |
I'm reopening this after changing it to an enhancement and changing the description slightly. I asked @pizi-nordic to open this as an issue because it had previously been discussed in comments to #11722, #12204, and #12247. At least two collaborators involved in those discussions believe the behavior of the current system does not meet our needs in cases of the sort described here, and that working around the existing behavior is not easy, if it is possible at all. The concern raised is possibly but not necessarily related to #12247; for the #11722/#12204 it was unrelated to what turned out to be the real issue, but a result of the same code. Putting it in its own issue allows us to remember to address it if and when the timer architecture is revisited, and make it visible to others who might have an opinion. I understand there is disagreement on whether the behavior is a bug. I hope changing the issue to an enhancement is acceptable to everybody. If not, we have a triage meeting once a week where disagreements on categorization and validity of issues can be raised, discussed, and resolved in a less abrupt manner. |
I've been doing some work on a path toward alarms that use the system clock rather than the tick clock. In this context it is almost certain that the corresponding clock increment will be greater than one when carving off ready-to-run timeouts. We don't have a test case that demonstrates the expected behavior described in this issue, but the adjustment to a newly added alarm can be eliminated without introducing a failure in any existing test (i.e., delays for rescheduled alarms are still relative to the original deadline). Code to do so can be seen on pabigot/zephyr:wip/20190106a. This of course isn't an argument that the change should be made, but does show making the change is not difficult if it does appear to be desirable. [Edit: Actually, no, the work-in-progress basically switches the implementation to #12247's reschedule management policy (b): delay timers reschedule with a delay. There's a second patch that's intended to restore latency-ignoring behavior within the current API, but it fails eight tests.] |
@andyross: The test you want, the test you have :) This test validates only one thing: It fails if timer expired too early. On my board the result is following (it looks that the z_clock_announce() issue is covered by something else):
|
@pizi-nordic As an equal-opportunity complainer I have to say that test doesn't do what you think. 2ms of that 3ms lateness is due to the time it takes to printk the timer1 expiration within the callback. If the printk is removed it's 49.83 ms (1633 32 KiHz ticks), which is still 6 ticks too early to reach 50 ms. Also the fact the check always fails on the first call suggests that either the test isn't correct or the timer system really is broken and a 50 ms timeout will normally complete in less than 50 ms. Adding the lets-synchronize-to-the-tick dance before timer1 is started doesn't make it succeed either. So I remain convinced timers in Zephyr are weird, but not that we've reproduced the theoretical problem. |
That'd be it. A tick is 327 increments of the 32 KiHz clock. So 1 s as measured by Zephyr's tick clock on Nordic hardware is 997.92 ms in the real world. (So 50 "Zephyr-ms" is 49.896 ms). |
Does the test fix itself with TICKLESS_KERNEL=n? That's the easiest way to rule out whether or not we're looking at the multi-tick expiry issue. |
No,
where the displayed 32700 is the difference between the The test is unlikely to produce a multi-tick expiry issue; to do so there'd have to be an interrupt-blocking delay in excess of something like 20 ms somewhere. |
Right right, what I meant is that if this test is exercising the bug that it thinks then it will always pass with tickless disabled, becasue all ticks come one at a time and the ambiguity we're discussing can't exist. I buy that there may be precision bugs too. |
That run I displayed is from |
Then I'm confused. You can't hit the situation @pizi-nordic is complaining about unless z_clock_announce() is called with a ticks argument of 2 or greater. The issue is whether or not it is correct to decrement timeouts that were "just added" to the list by another timeout handler. If the argument was one tick, then announce_remaining must have been decremented to zero inside the loop during the execution of the handler, and the bug won't happen. This is isomorphic to our old ticked behavior, thus it cannot happen if TICKLESS_KERNEL is unset. If it is happening, then the problem is somewhere else. You can verify with some whiteboxed asserts with z_clock_announce() if you like to see that the decrement being called out above never happens. |
Yes. What I've been trying to say all along is that this test is faulty: it never hits a condition where multiple ticks are being announced. It fails because of accumulated error in measuring the passage of time, not because the deadline was reduced by unaccounted ticks. |
I think we should. There are other reasons why the existing timer infrastructure doesn't meet my requirements, including that the API only supports 1 ms resolution. For low-power devices I should be able to sleep for periods significantly less than 1 ms, and I have no platform-independent way to do that. |
FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals). The idea is the call takes two timeout values, a "duration" argument (IMHO poorly named) that indicates how long after the current time the first callback should arrive, and a "period" argument that indicates how long should pass between subsequent callbacks. So we should make "duration" specify a "realtime" value that is based on current real time irrespective of what the current state of the timeout system (i.e. whether or not there is a currently-active multi-tick z_clock_announce() call in progress). But the "period" time will honor the intent of the caller to have a steady beat over time by adjusting its scheduling to account for late/missed interrupts. That will be mildly hard to document, but probably not awful (realistically most users aren't going to care about the distinction). And it shouldn't be hard to implement. @pabigot has WIP code to be submitted that does something like this, I believe. My only worry is that I'll wake up tomorrow and realize this scheme misses another edge case or has compatibility issues, though. I'll think on it. |
Sorry, thought I had posted this but didn't. To repeat what was said in the call:
That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems. Really the only value to disabling tickless on these drivers is code size. Given that we've never been able to detect of handle late interrupts, there's no reason to start now. We don't want to be doing that math if it isn't a requirement. |
Note that my solution does not change the API at all; it just clarifies how the existing arguments are interpreted (a documentation-only change). The behavioral modification is done internal through coordination between k_timer and timeout. If that's not what people expect, please let's wait until the PR is submitted and hash it out there.
I remain skeptical that it's that simple, but that can be hashed out in a separate issue. |
I think you chosen good direction making timer drivers common. If we implement the
|
Honestly I don't care whether this really gets fixed or not. The point is that the current state represents a regression from the way timekeeping was done prior to the rewrite, which honored single ticks as the sole reference for timekeeping. With the current architecture, the only reason you'd ever choose to disable TICKLESS_KERNEL is for code size reasons. And if that's the case, then you almost certainly don't care about theoretical behavior and just want the way Zephyr (and basically everything) has worked since the invention of timer interrupts in the 1960's. |
@andyross @pabigot @pizi-nordic The proposal to solve the now months-old debate about the kernel vs real time that was discussed has been described here:
I just want to make sure that, regardless of other considerations, this is still considered a good solution for all the people involved here. Can we get an acknowledgement in that direction? |
I think that the proposed solution will fix |
tl;dr: the proposal is still a good solution; it does not address the general question. I have many concerns with "kernel time" and "real time" are poorly defined. We know from #12409 that there is a non-linear relation between the system clock ( At this time I'm considering prototyping a |
@pabigot: I fully support removal of tick abstraction and sticking to (optionally prescaled) hardware cycles. The same was suggested months ago. However I am not sure if we should implement another subsystem aside from existing timeout management (I assume you would like to have a clean start :)). In my opinion such direction will consume much more effort. |
The existing loop over timeouts in z_clock_announce adjusts the delay of the first timeout that has not been expired by the number of ticks remaining. This adjustment is (by some interpretations) incorrect when the timeout was added while processing a previous timeout and the adjustment is not zero, which can happen if multiple ticks are announced in one call. Resolve this by changing how the tick clock is managed: * curr_ticks/k_uptime_get() is adjusted as soon as the z_clock_announce() is invoked. This means that timer callbacks are able to observe the best available estimate of the current time, rather than the time at which each callback was supposed to be invoked. The new policy also ensures that k_uptime_get() is consistent with k_cycle_get(). * Prior to invoking any callbacks the sequence of expired timeouts is removed from the timeout queue, ensuring new timeouts are scheduled without the risk of inadvertently reducing their delay by unprocessed ticks. * Prior to a callback the timeout dticks field is updated to be relative to the current time, i.e. a non-positive value corresponding to how late the timeout was invoked. * The tick offset used when scheduling a recurring timeout is adjusted by the lateness of the invocation, ensuring that the expected callback times are at a fixed interval even if individual callbacks are late. * The dticks field is zeroed when a timer is initialized, aborted, started, or stopped, ensuring that non-recurring timeouts respect the offset passed as the initial duration. The new policy also enables a future enhancement where each k_timer can be configured to select between policies where repeating timeouts should be relative to the scheduled time of the last callback (historical behavior) or the actual time of the last callback (useful for certain cases). Closes zephyrproject-rtos#12332. Signed-off-by: Peter A. Bigot <[email protected]>
This verifies that when a second timer is scheduled within a delayed timer callback the lateness of the callback (as reflected by unprocessed announced ticks) does not impact the timeout of the newly scheduled timer. Relates to issue zephyrproject-rtos#12332. Signed-off-by: Peter A. Bigot <[email protected]>
Hi @nordic-krch, This issue, marked as an Enhancement, was opened a while ago and did not get any traction. Please confirm the issue is correctly assigned and re-assign it otherwise. Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason. @pizi-nordic you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you. Thanks! |
The
z_clock_announce()
unconditionally adjustdticks
of first timeout in the list:However this adjustment is not always valid:
If the adjusted timeout was on the list before call to
z_clock_announce()
, we have to do such correction in order to not delay next timeout.If the adjusted timeout belongs to periodic k_timer which was handled during current
z_clock_announce()
execution, we should do such correction if we would like to not accumulate error.If the adjusted timeout was added during current
z_clock_announce()
execution but it is not related periodic timer which was handled, we have to skip the adjustment. If we perform it, the newly scheduled timeout will fire earlier than expected.This issue has been initially mentioned in #11722.
Affected use case is presented here: #12247 (comment)
The text was updated successfully, but these errors were encountered: