Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1 #12332

Open
pizi-nordic opened this issue Jan 4, 2019 · 38 comments
Assignees
Labels
area: Kernel area: Timer Timer Enhancement Changes/Updates/Additions to existing features

Comments

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Jan 4, 2019

The z_clock_announce() unconditionally adjust dticks of first timeout in the list:

if (first() != NULL) {
	first()->dticks -= announce_remaining;
}

However this adjustment is not always valid:

  1. If the adjusted timeout was on the list before call to z_clock_announce(), we have to do such correction in order to not delay next timeout.

  2. If the adjusted timeout belongs to periodic k_timer which was handled during current z_clock_announce() execution, we should do such correction if we would like to not accumulate error.

  3. If the adjusted timeout was added during current z_clock_announce() execution but it is not related periodic timer which was handled, we have to skip the adjustment. If we perform it, the newly scheduled timeout will fire earlier than expected.

This issue has been initially mentioned in #11722.
Affected use case is presented here: #12247 (comment)

@pizi-nordic pizi-nordic added bug The issue is a bug, or the PR is fixing a bug area: Kernel labels Jan 4, 2019
@pizi-nordic pizi-nordic changed the title Improper behaviour of z_clock_anounce() when ticks argument is geather by 1 Improper behaviour of z_clock_announce() when ticks argument is greater by 1 Jan 4, 2019
@pizi-nordic
Copy link
Collaborator Author

@andyross, @pabigot, @carlescufi: FYI

@andyross
Copy link
Contributor

andyross commented Jan 4, 2019

@pizi-nordic, this is getting out of hand. Not. A. Bug. I don't know how else to explain it. Existing tests require the behavior as implemented. What you are worrying about is an untested edge case that results from an an API convention issue and happens only in latency crisis situations where interrupts have been delayed more than a tick.

It can be fixed via API (and worked around right now), but it requires design and ideas and discussion. I tried to isolate that discussion in #12247. But now it's leaked out as a "bug" and I genuinely don't know what to do.

If you have code that actually fails and has no workaround, submit it as a test first and then submit the bug. But for now... I just don't know how else to deal with this. Stop this senseless argument. Please do not resubmit this without an exercising test case (and not a whiteboxed one!)

@andyross andyross closed this as completed Jan 4, 2019
@pabigot pabigot added the TSC Topics that need TSC discussion label Jan 4, 2019
@pabigot
Copy link
Collaborator

pabigot commented Jan 4, 2019

We need consensus on whether this is a bug, since @pizi-nordic and I disagree with your assessment. If it is a bug, it may very well require API and other changes to resolve. That's fine, and not a motivation for closing it until we have consensus.

@pabigot pabigot reopened this Jan 4, 2019
@andyross
Copy link
Contributor

andyross commented Jan 4, 2019

(And yes, I know you think you understand this better and that I'm missing something important. But without a failing test case it's not a bug, and I'm tired of typing about it)

@andyross
Copy link
Contributor

andyross commented Jan 4, 2019

Without a failing test case, it's not a bug. Please stop this senseless fight. There's a perfectly good discussion elsewhere.

@andyross andyross closed this as completed Jan 4, 2019
@pabigot pabigot added Enhancement Changes/Updates/Additions to existing features and removed bug The issue is a bug, or the PR is fixing a bug labels Jan 6, 2019
@pabigot pabigot changed the title Improper behaviour of z_clock_announce() when ticks argument is greater by 1 Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1 Jan 6, 2019
@pabigot
Copy link
Collaborator

pabigot commented Jan 6, 2019

I'm reopening this after changing it to an enhancement and changing the description slightly.

I asked @pizi-nordic to open this as an issue because it had previously been discussed in comments to #11722, #12204, and #12247. At least two collaborators involved in those discussions believe the behavior of the current system does not meet our needs in cases of the sort described here, and that working around the existing behavior is not easy, if it is possible at all.

The concern raised is possibly but not necessarily related to #12247; for the #11722/#12204 it was unrelated to what turned out to be the real issue, but a result of the same code. Putting it in its own issue allows us to remember to address it if and when the timer architecture is revisited, and make it visible to others who might have an opinion.

I understand there is disagreement on whether the behavior is a bug. I hope changing the issue to an enhancement is acceptable to everybody. If not, we have a triage meeting once a week where disagreements on categorization and validity of issues can be raised, discussed, and resolved in a less abrupt manner.

@pabigot pabigot reopened this Jan 6, 2019
@pabigot pabigot removed the TSC Topics that need TSC discussion label Jan 6, 2019
@pabigot
Copy link
Collaborator

pabigot commented Jan 6, 2019

I've been doing some work on a path toward alarms that use the system clock rather than the tick clock. In this context it is almost certain that the corresponding clock increment will be greater than one when carving off ready-to-run timeouts.

We don't have a test case that demonstrates the expected behavior described in this issue, but the adjustment to a newly added alarm can be eliminated without introducing a failure in any existing test (i.e., delays for rescheduled alarms are still relative to the original deadline). Code to do so can be seen on pabigot/zephyr:wip/20190106a.

This of course isn't an argument that the change should be made, but does show making the change is not difficult if it does appear to be desirable.

[Edit: Actually, no, the work-in-progress basically switches the implementation to #12247's reschedule management policy (b): delay timers reschedule with a delay. There's a second patch that's intended to restore latency-ignoring behavior within the current API, but it fails eight tests.]

@pizi-nordic
Copy link
Collaborator Author

pizi-nordic commented Jan 7, 2019

@andyross: The test you want, the test you have :)
https://github.com/pizi-nordic/zephyr/tree/timer-issue-12332/tests/kernel/timer/timer_issue_12332

This test validates only one thing: It fails if timer expired too early.
The test scenario is described here: #12247 (comment)

On my board the result is following (it looks that the z_clock_announce() issue is covered by something else):

***** Booting Zephyr OS zephyr-v1.13.0-3011-ge98e6a8a87 *****
Running test suite timer_api
===================================================================
starting test - test_issue_12332
Test Run 1 of 300...
 -> Timer 1 expired!
 -> Timer 2 expired! (after 47 ms)

    Assertion failed at /home/pizi/repos/zephyr/tests/kernel/timer/timer_issue_12332/src/main.c:52: timer2_expire: 50 <= elapsed_ms is false

@pabigot
Copy link
Collaborator

pabigot commented Jan 7, 2019

@pizi-nordic As an equal-opportunity complainer I have to say that test doesn't do what you think. 2ms of that 3ms lateness is due to the time it takes to printk the timer1 expiration within the callback. If the printk is removed it's 49.83 ms (1633 32 KiHz ticks), which is still 6 ticks too early to reach 50 ms.

Also the fact the check always fails on the first call suggests that either the test isn't correct or the timer system really is broken and a 50 ms timeout will normally complete in less than 50 ms. Adding the lets-synchronize-to-the-tick dance before timer1 is started doesn't make it succeed either. So I remain convinced timers in Zephyr are weird, but not that we've reproduced the theoretical problem.

@pabigot
Copy link
Collaborator

pabigot commented Jan 7, 2019

... a 50 ms timeout will normally complete in less than 50 ms.

That'd be it. A tick is 327 increments of the 32 KiHz clock. So 1 s as measured by Zephyr's tick clock on Nordic hardware is 997.92 ms in the real world. (So 50 "Zephyr-ms" is 49.896 ms).

@andyross
Copy link
Contributor

andyross commented Jan 7, 2019

Does the test fix itself with TICKLESS_KERNEL=n? That's the easiest way to rule out whether or not we're looking at the multi-tick expiry issue.

@pabigot
Copy link
Collaborator

pabigot commented Jan 7, 2019

No, TICKLESS_KERNEL has no effect. The test can never pass because 50 ms isn't 50 ms. With enhanced output I get:

T1 1000 ms, T2 50 ms, 32768 sysclock per tick                                   
 -> Timer 2 run 2 expired 32700 ; 35643 - 34009 = 1634 ! (after 49865 us)       
 -> Timer 2 run 3 expired 32700 ; 68343 - 66709 = 1634 ! (after 49865 us)       
 -> Timer 2 run 4 expired 32700 ; 101043 - 99409 = 1634 ! (after 49865 us)      

where the displayed 32700 is the difference between the k_cycle_get_32() values captured in successive callbacks of timer 1, which is configured to callback every 1000 ms (32768 cycles). 49.865 ms is within one rounded-up system clock tick (30.5175... us) of the 49.896 ms interval that the timeout is set for when 50 ms is requested. So it's as accurate as it's going to be.

The test is unlikely to produce a multi-tick expiry issue; to do so there'd have to be an interrupt-blocking delay in excess of something like 20 ms somewhere.

@andyross
Copy link
Contributor

andyross commented Jan 7, 2019

Right right, what I meant is that if this test is exercising the bug that it thinks then it will always pass with tickless disabled, becasue all ticks come one at a time and the ambiguity we're discussing can't exist. I buy that there may be precision bugs too.

@pabigot
Copy link
Collaborator

pabigot commented Jan 7, 2019

That run I displayed is from # CONFIG_TICKLESS_KERNEL is not set, so it doesn't get fixed. The 1634 system clock ticks measured as latency is the same as the steady-state when CONFIG_TICKLESS_KERNEL=y.

@andyross
Copy link
Contributor

andyross commented Jan 7, 2019

Then I'm confused. You can't hit the situation @pizi-nordic is complaining about unless z_clock_announce() is called with a ticks argument of 2 or greater. The issue is whether or not it is correct to decrement timeouts that were "just added" to the list by another timeout handler. If the argument was one tick, then announce_remaining must have been decremented to zero inside the loop during the execution of the handler, and the bug won't happen. This is isomorphic to our old ticked behavior, thus it cannot happen if TICKLESS_KERNEL is unset.

If it is happening, then the problem is somewhere else. You can verify with some whiteboxed asserts with z_clock_announce() if you like to see that the decrement being called out above never happens.

@pabigot
Copy link
Collaborator

pabigot commented Jan 7, 2019

Yes. What I've been trying to say all along is that this test is faulty: it never hits a condition where multiple ticks are being announced. It fails because of accumulated error in measuring the passage of time, not because the deadline was reduced by unaccounted ticks.

@pabigot
Copy link
Collaborator

pabigot commented Jan 9, 2019

I think we should. There are other reasons why the existing timer infrastructure doesn't meet my requirements, including that the API only supports 1 ms resolution. For low-power devices I should be able to sleep for periods significantly less than 1 ms, and I have no platform-independent way to do that.

@andyross
Copy link
Contributor

andyross commented Jan 9, 2019

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals).

The idea is the call takes two timeout values, a "duration" argument (IMHO poorly named) that indicates how long after the current time the first callback should arrive, and a "period" argument that indicates how long should pass between subsequent callbacks.

So we should make "duration" specify a "realtime" value that is based on current real time irrespective of what the current state of the timeout system (i.e. whether or not there is a currently-active multi-tick z_clock_announce() call in progress). But the "period" time will honor the intent of the caller to have a steady beat over time by adjusting its scheduling to account for late/missed interrupts.

That will be mildly hard to document, but probably not awful (realistically most users aren't going to care about the distinction). And it shouldn't be hard to implement. @pabigot has WIP code to be submitted that does something like this, I believe.

My only worry is that I'll wake up tomorrow and realize this scheme misses another edge case or has compatibility issues, though. I'll think on it.

@andyross
Copy link
Contributor

andyross commented Jan 9, 2019

Sorry, thought I had posted this but didn't. To repeat what was said in the call:

CONFIG_TICKLESS_KERNEL=n does not affect the theoretical bug: all it does is ensure the system attempts to make sure ticks is always one when z_clock_announce() is invoked.

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

Really the only value to disabling tickless on these drivers is code size. Given that we've never been able to detect of handle late interrupts, there's no reason to start now. We don't want to be doing that math if it isn't a requirement.

@pabigot
Copy link
Collaborator

pabigot commented Jan 9, 2019

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals).

Note that my solution does not change the API at all; it just clarifies how the existing arguments are interpreted (a documentation-only change). The behavioral modification is done internal through coordination between k_timer and timeout. If that's not what people expect, please let's wait until the PR is submitted and hash it out there.

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

I remain skeptical that it's that simple, but that can be hashed out in a separate issue.

@pizi-nordic
Copy link
Collaborator Author

That's really a bug in the timer drivers; I got too cute reusing the ISR implementations between tickless/non-tickless. The ticks argument to the z_clock_announce() call should always be 1 if tickless is not enabled, because that matches the behavior of true non-tickless systems.

I think you chosen good direction making timer drivers common. If we implement the z_clock_announce(1) requirement on non-tickless system we will end in one of 2 situation:

  • Either timer drivers will have to detect and handle missed ticks by calling z_clock_announce(1) multiple times, which will result in the same problems we observing now, but bit harder to solve as kernel will not have knowledge about missed ticks, or ...
  • ... we just ignore missed ticks and out system clock will be always slower than the real time.

@andyross
Copy link
Contributor

Honestly I don't care whether this really gets fixed or not. The point is that the current state represents a regression from the way timekeeping was done prior to the rewrite, which honored single ticks as the sole reference for timekeeping.

With the current architecture, the only reason you'd ever choose to disable TICKLESS_KERNEL is for code size reasons. And if that's the case, then you almost certainly don't care about theoretical behavior and just want the way Zephyr (and basically everything) has worked since the invention of timer interrupts in the 1960's.

@carlescufi
Copy link
Member

@andyross @pabigot @pizi-nordic

The proposal to solve the now months-old debate about the kernel vs real time that was discussed has been described here:

FWIW: per a not-as-contentious-as-it-might-have-been call with most of the stakeholders just now, there's an emerging consensus for a fix that might make everyone happy. It works at the level of the k_timer_start() API (which is the higher level API, not the timeout internals).
The idea is the call takes two timeout values, a "duration" argument (IMHO poorly named) that indicates how long after the current time the first callback should arrive, and a "period" argument that indicates how long should pass between subsequent callbacks.

I just want to make sure that, regardless of other considerations, this is still considered a good solution for all the people involved here. Can we get an acknowledgement in that direction?

@pizi-nordic
Copy link
Collaborator Author

I think that the proposed solution will fix k_timer. However the same problem could affect scheduling (causing premature thread wake-up) and we will have to fix that too.

@pabigot
Copy link
Collaborator

pabigot commented Jan 15, 2019

tl;dr: the proposal is still a good solution; it does not address the general question.

I have many concerns with k_timer. Once #12400 is merged, then #12248 can be rebased and merged, then I can submit the solution for the particular problem outlined in this issue. (Got a nice test and implementation, just waiting on the blockers....)

"kernel time" and "real time" are poorly defined. We know from #12409 that there is a non-linear relation between the system clock (k_uptime_get()) and the hardware clock (k_cycle_get()) when tickless is disabled; this probably cannot/should not be changed as it's historical and has a reasonable justification. In tickless mode the relationship should be linear, but it has an inherent 0.2% error for Nordic at 100 Hz ticks, 0.02% at 1000 Hz ticks. The error reduces to zero at 32768 Hz ticks, but the maximum documented Zephyr typical tick rate is 1000 Hz and somewhere there's a comment warning about exceeding that.

At this time I'm considering prototyping a k_alarm infrastructure that will use the undivided hardware clock and take more care to deal with any remaining truncation due to unit conversion. The main driver for that effort is to eliminate the continuing impact of the concept of a "tick" on timer operations, which currently cannot express sub-millisecond sleeps, a feature lack that for me blocks use of Zephyr in low-power applications. k_timer could be implemented on top of the new infrastructure. All that, though, is a different issue.

@pizi-nordic
Copy link
Collaborator Author

pizi-nordic commented Jan 15, 2019

@pabigot: I fully support removal of tick abstraction and sticking to (optionally prescaled) hardware cycles. The same was suggested months ago. However I am not sure if we should implement another subsystem aside from existing timeout management (I assume you would like to have a clean start :)). In my opinion such direction will consume much more effort.

pabigot added a commit to pabigot/zephyr that referenced this issue Jan 15, 2019
The existing loop over timeouts in z_clock_announce adjusts the delay of
the first timeout that has not been expired by the number of ticks
remaining.  This adjustment is (by some interpretations) incorrect when
the timeout was added while processing a previous timeout and the
adjustment is not zero, which can happen if multiple ticks are announced
in one call.

Resolve this by changing how the tick clock is managed:
* curr_ticks/k_uptime_get() is adjusted as soon as the
  z_clock_announce() is invoked.  This means that timer callbacks are
  able to observe the best available estimate of the current time,
  rather than the time at which each callback was supposed to be
  invoked.  The new policy also ensures that k_uptime_get() is
  consistent with k_cycle_get().
* Prior to invoking any callbacks the sequence of expired timeouts is
  removed from the timeout queue, ensuring new timeouts are scheduled
  without the risk of inadvertently reducing their delay by unprocessed
  ticks.
* Prior to a callback the timeout dticks field is updated to be relative
  to the current time, i.e. a non-positive value corresponding to how
  late the timeout was invoked.
* The tick offset used when scheduling a recurring timeout is adjusted
  by the lateness of the invocation, ensuring that the expected callback
  times are at a fixed interval even if individual callbacks are late.
* The dticks field is zeroed when a timer is initialized, aborted,
  started, or stopped, ensuring that non-recurring timeouts respect the
  offset passed as the initial duration.

The new policy also enables a future enhancement where each k_timer can
be configured to select between policies where repeating timeouts should
be relative to the scheduled time of the last callback (historical
behavior) or the actual time of the last callback (useful for certain
cases).

Closes zephyrproject-rtos#12332.

Signed-off-by: Peter A. Bigot <[email protected]>
pabigot added a commit to pabigot/zephyr that referenced this issue Jan 15, 2019
This verifies that when a second timer is scheduled within a delayed
timer callback the lateness of the callback (as reflected by unprocessed
announced ticks) does not impact the timeout of the newly scheduled
timer.

Relates to issue zephyrproject-rtos#12332.

Signed-off-by: Peter A. Bigot <[email protected]>
@carlescufi carlescufi added the area: Timer Timer label Mar 27, 2019
@pabigot pabigot assigned nordic-krch and unassigned pabigot Mar 13, 2020
@zephyrbot
Copy link
Collaborator

Hi @nordic-krch,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. Please confirm the issue is correctly assigned and re-assign it otherwise.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@pizi-nordic you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Timer Timer Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants