-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the proper timer "now" convention for late interrupts? #12247
Comments
In the context of the current clock architecture where timeouts are tightly coupled to clock management, I think the current implementation is the best alternative. However, I do not like that the timeout is invoked without information about the true passage of time. If Zephyr were to transition to a tickless system clock driven by a free-running monotonic fixed-rate counter that will not overflow in the life of the device (which I hope would be a design goal for #12068), I would prefer an alarm (timeout) infrastructure based on deadlines as absolute system times. Alarm callbacks should be invoked with access to the true system time and the deadline at which they were scheduled. Repeating alarms that are not cancelled before the callback returns should be re-scheduled, nominally at the interval after the original deadline, but at a different deadline if one is set before the callback returns. See how systemd does this as an example. |
So would we all, and that's what the API exposes. The problem is that we can't have that here because an interrupt was delayed. The question is how to recover from the failure. |
I don't see that. I see both k_timer and _timeout using deadlines as a relative offset from "now", where "now" means "whatever value the kernel last managed to write into the
Yes, because we're relying on a software-maintained tick timer. For systems where (a) the system time is driven by a free-running live monotonic counter that can't be delayed by interrupt processing, (b) deadlines are absolute values that can be compared with the live counter, and (c) the callback can compare the live counter against the time it was supposed to be invoked, I believe the problem largely just goes away (or is delegated to the callback, which is the same thing). For systems where some intervention is required (e.g. to extend a 24-bit RTC counter by using an overflow counter to provide more bits) there'd be some overhead, but that can still be decoupled from processing timeouts. "Which of the two times should we expose to the callback?" is the wrong question. A better question is "how can we give the callback both times?" An answer is to make system time track real-world elapsed time more accurately, then provide the users with both the "scheduled at" time and access to the (constantly updating) "true" time. #12068 is an opportunity to pursue that approach. |
You kind of lost me here. You have a mechanism in mind on how to invoke a timeout callback in a context other than a timer interrupt handler? I mean, I get your API criticism, and it's valid (I don't see any reason not to have a global monotonic time_t kind of thing at the timeout API level). I just don't see how it amounts to anything but shuffling of proverbial deck chairs. The problem still exists: you have a callback that was supposed to have been run earlier but is actually being run late, and someone needs to make the decision as to which time to use for "six ticks later". Here's some code to focus the discussion:
Under the current scheme (and assuming no pressure from other threads in the system that would preempt it), this code works correctly and ticks for one hour even if some driver or hypervisor or qemu environment (yes, qemu has exactly this behavior) acts to block interrupts or halt the system for some amount of time greater than one tick. If the callback that was being used by k_sleep() to wake up the calling thread instead used "current real time" as the base, then it would be late. And an API change like you propose can't fix that, because this code isn't using that API, it's using a more primitive one and you'd have to know whether or not the calling code was trying to implement a "1 second or greater delay" (in which case late/skewed delivery is fine) or the one hour counter it actually implements. I'm fine with having an alternative API available for timeout callbacks that understand the difference to use to discriminate things (which is what this whole issue is about). But it's not an API problem. |
(Actually no, k_sleep() won't have that behavior because obviously the calling thread won't be awake after a late interrupt until after all ticks have been announced and current time updated. I was too cute with the example. But in fact k_timer does work exactly like this, naively scheduling another iteration in N ticks every time the callback fires. It's the natural way to implement a clock, and it breaks if you try to use realtime as the base for scheduled callbacks.) |
I don't seem to be getting my thoughts across clearly.
No, this is exactly what I'm saying should not be decided. I envision a system where I can do:
where that code waves its hands wildly in hopes it conveys the idea. From what I see it's the coupling of clock increment with alarm processing in Or maybe I'm wrong and that's not how it works. With respect to I never intended to imply that the API I proposed could work with the current implementation (although, looking at the example above, maybe it could...). What I propose is that by changing how the system clock is maintained, which is perhaps on the table in #12068, that API could become feasible. And I propose that API would eliminate the problem because even when a callback is late it has enough information to detect its lateness and handle it in the way it feels is necessary: i.e., it gets both times. In any case I'm going to try to shut up now and let others express a view. |
I fully support that 👍 |
I am still not fully satisfied with current implementation. What I am afraid is this
Here is why:
|
I'm not really sure I buy that logic. I mean, we know that k_timer relies on the behavior specified. Yet you're asserting that all other users... don't? Remember again that there are no "normal" circumstances where this even matters. You only hit it if there is a delayed interrupt (or software bug) that causes the tick announcement to fall more than a tick behind where it should have been. So, sure, it may seem "unexpected", but everything is going to be unexpected when that happens. As far as changing the API to use absolute time units, I'm not at all opposed. I just don't think it's going to have the benefits advertised; someone still needs to worry about this distinction. It also has the potential to start making more elaborate demands on the driver layer to support that kind of realtime counter, which is IMHO a bad place to be putting complexity. Not all hardware has such a realtime counter, they have funny interactions with idle, and driver code is just generally a rough place to be working. SysTick lacks a proper free running counter, and APIC's is somewhat loosely coupled to the TSC, stuff like that. |
We've been touching at least three distinct questions here. 1: Behavior when z_clock_announce is passed ticks > 1Most recently summarized above as a potential reduction in the requested delay for a timer when something delays tick processing for more than one tick duration. I think you (@andyross) are acknowledging that this is undesirable behavior (aka a bug), but suggesting it only occurs in situations where the system is already unstable so we shouldn't worry about it:
I don't think that the fact the system is slightly unstable removes the responsibility of Zephyr to take low-cost steps to avoid destabilizing it further. If the cost of maintaining the requested interval correctly is "too high" then it shouldn't be applied, but we can't judge that until we see a working solution. This should be recorded as a separate issue, since it's behavior that can be demonstrated in the current implementation. 2: Should alarm deadlines be relative or absoluteThis shouldn't matter if it's done right. Which representation is more maintainable and usable depends on implementation specifics of how alarms are managed internally and configured externally. Let's move that elsewhere too, once there's context where it can be discussed (e.g. a proposal to support absolute deadlines). 3: "Which now is now" in an alarm callbackI believe this is what we're supposed to be addressing here. I propose two general statements that should be interpreted without involving Zephyr or any other system:
I hope there's no disagreement with either statement. If so, let's pause and clean that up first; I've labeled them so they can be discussed independently. Moving on, much of the discussion here is because:
Again, pause for consensus on this statement.
An observation, hopefully not controversial.
Here consensus doesn't matter: that's a subjective statement, true at a minimum for me, and I believe also for @pizi-nordic.
Follows from [3] and [4]. SummaryI don't claim the alarm management scheme (a), currently implemented, has no use. I don't even claim that the alarm management scheme (a) should not be the default. But any suggestion (intentional, inadvertent, or perceived) that we are mistaken in our belief that we have a use for the other management scheme doesn't contribute helpfully to the discussion. This issue is appears to be based on an assumption we have to pick one management scheme or the other. I reject that assumption, and believe we need a proposal for an alarm infrastructure that supports both schemes as well as additional capabilities. At some point I will create such a proposal, assuming nobody does it first. |
@pabigot: Nice summary 👍 Let assume that we have an external temperature sensor and we would like to measure the ambient temperature. The measurement will be taken every minute and slight delay is not a big problem for us, as we just want to display current temperature. The sensor needs 40ms to perform conversion. Temperature could be read after that time. What is the most obvious method to implement that?
For me such solution looks good. Rule [1] guarantees, that we always fetch valid data from the sensor. For periodic timer, the delay introduced by [1] is not a problem as we know that ambient temperature will not change in dozens of us (we do not have real-time requirement here). Also, both periodic timer management scheme (a) and (b) are working for us (user will not notice accumulated error on display update). Where is the problem: This is why I see the Behavior when z_clock_announce is passed ticks > 1 issue a critical one. Everything else is either feature request (different periodic timer handling schemes, passing information about now() to handlers, etc) or affects maintenance cost and resource usage (absolute/relative timeouts). |
Thanks.
I don't see an issue specifically for this, though it was also noted in #11722. Could you open one? I agree it's important[*], but I think that discussion should be separate from this one. [*]: I'm working with SHT21/HTU21D/Si7021 which has exactly the behavior you describe and drives several of my alarm-related expectations that Zephyr can't currently support. |
Please stop submitting bugs guys. Submitting a bug is aggressive. It's an attempt to get my management to force me to do what you want (yes, we track bug counts and the people who pay me care whether or not I'm fixing ones I'm assigned), when we have genuine design disagreement here. Discuss, don't escalate. It sounds like @pizi-nordic's complaint is addressable in the immediate term with an API to get the "timeout shortfall" or whatever. Or via a more elaborate API change per @pabigot that swaps the time units to absolute times (though again I warn there are edge cases and driver complexity issues with doing that). Or just by building without tickless enabled, which eliminates the ambguity (not the late interrupts, of course!). I'm more than happy to see code submitted along either lines. But please stop going around my back, especially when I deliberately carved out a space for this argument and am genuinely trying to hear your opinions. |
Closing this one. I think at this point everyone seems to agree that the issue I describe way above and the one in #12332 are, in fact, exactly the same problem. And in any case all current discussion is there and not here. |
Opening this new issue to get the giant (and IMHO mostly irrelevant) bike shed out of #11722 and #12204.
With the advent of pervasive tickless, we have a new API convention to argue about: What happens when a multi-tick timer interrupt is delayed?
Inside delayed timer callbacks, code may (and usually does) set more
callbacks for future events. Should the "current" time used as the reference for these callbacks be (1) the "true" current time or (2) the time the tick was expected to be delivered?
Right now the code implements choice (2). The reasoning is that often you have periodic processing where a timeout callback always sets a new timeout N ticks in the future (this is exactly the way k_timer works, for example). Under choice (1), an interrupt delayed by 1 tick will add that tick to the sequence, and all further ticks will slowly fall behind the intended pace. With choice (2), we can recover the lost time and keep the timekeeping steady even in the face of giant latency problems. With choice (1), the default latency of the "next" interrupt can be held more constant[1], making it a better fit for code trying to do "realtime" control using the tick timer[2]. It's also worth pointing out that choice (2) results in a simpler implementation, and that the "saved expired timers list" implementation used in the old code had multiple bugs fixed during the rewrite.
Basically, the question is this: do we want a separate API variant (or maybe kconfig) available to allow code to use the convention it wants, or can we achieve what we want with the existing code?
[1] But IMHO this is of marginal value as we're by definition experiencing catastrophic latency problems already!
[2] IMHO a poor API fit if the code or interrupt environment can cause latencies like this. Such apps should be using separate hardware timers (most SoCs have a few available) or a higher priority tick timer.
The text was updated successfully, but these errors were encountered: