Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System timer handling with low-frequency timers #9904

Closed
carlescufi opened this issue Sep 11, 2018 · 44 comments
Closed

System timer handling with low-frequency timers #9904

carlescufi opened this issue Sep 11, 2018 · 44 comments
Assignees
Labels
area: Kernel area: Timer Timer Enhancement Changes/Updates/Additions to existing features platform: nRF Nordic nRFx priority: high High impact/importance bug
Milestone

Comments

@carlescufi
Copy link
Member

carlescufi commented Sep 11, 2018

Nordic ICs use a low-power, 32KHz timer to driver the Zephyr kernel.
This results in several precision errors that have an effect on real-life application timer handling.

Sub-issues:
#12332 Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1
#16195 k_uptime_delta(): Defective by design

Closed issues related:
#16238 k_cycle_get_32() API is useless in some Zephyr configurations
#12509: Fix rounding in _ms_to_ticks()
#12542 nrf timers unstable with ticks faster than 100 Hz
#9708: tests/kernel/tickless/tickless_concept fails on nRF5x with 1000msec boot delay
#8159: tests/kernel/fifo/fifo_timeout fails on nrf51_pca10028 and nrf52_pca10040
#9843: tests sched/deadline fails on nRF boards

@carlescufi carlescufi added Enhancement Changes/Updates/Additions to existing features area: Kernel platform: nRF Nordic nRFx area: Timer Timer labels Sep 11, 2018
@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Sep 11, 2018

First, I would like to present some background information about this complex problem:

Issue 1: Tick time calculation.
The sys_clock_hw_cycles_per_tick, which is used as a reference for RTC programming, is calculated as (CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC / sys_clock_ticks_per_sec), where sys_clock_ticks_per_sec equals to CONFIG_SYS_CLOCK_TICKS_PER_SEC.

In nRF5x series such approach leads to the following kernel settings (assuming CONFIG_SYS_CLOCK_TICKS_PER_SEC = 100):

  • sys_clock_hw_cycles_per_tick = 32768 / 100 = 327.
  • Real tick time = 327/32768 s = 9.979 ms

Problem: CONFIG_SYS_CLOCK_TICKS_PER_SEC is no longer true. In fact we have 32768/32 = 100.2 ticks per second. All code using CONFIG_SYS_CLOCK_TICKS_PER_SEC for timing calculations directly or indirectly produces invalid output as the fractional part is not taken under account. However many tests still calculates timeouts basing on this value and hidden assumption that CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC divides by CONFIG_SYS_CLOCK_TICKS_PER_SEC without remainder.

Of course, we could try to mitigate this problem, for example by setting CONFIG_SYS_CLOCK_TICKS_PER_SEC to 128, however this will not eliminate bigger issue described below.

Issue 2: Tick <-> ms conversion

Issue 3: Tickless mode
This is not even touched as I planned to fix simpler tick-based mode first :)

Summary:
I see only one solution to fix timing issues on nRF5x platform. However it is drastic one: We have to eliminate tick idea from Zephyr kernel (this include removal of tick-based operation as well as using system clock cycles as timing unit in tickles mode).

BTW: We can also "nicely" hide all of these issues just by rounding up sys_clock_hw_cycles_per_tick calculation. The tick will be longer than 10ms (POSIX will be happy), and the difference will be small enough to allow us pass all tests, as they usually use relatively small timeouts ;-)

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Sep 11, 2018

@ioannisg, @nashif : Could you please look at this? Maybe you will see better solution than the one above.

@carlescufi
Copy link
Member Author

I see only one solution to fix timing issues on nRF5x platform. However it is drastic one: We have to eliminate tick idea from Zephyr kernel (this include removal of tick-based operation as well as using system clock cycles as timing unit in tickles mode).

CC @andrewboie @andyross

@ioannisg
Copy link
Member

I see only one solution to fix timing issues on nRF5x platform. However it is drastic one: We have to eliminate tick idea from Zephyr kernel (this include removal of tick-based operation as well as using system clock cycles as timing unit in tickles mode).

This is a bit confusing to me. The system clock is the tick-based "soft" clock (as opposed to the 32-bit HW clock). "System clock cycles as timing unit": you mean 32-bit HW clock cycles as timing unit?

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Sep 11, 2018

I see only one solution to fix timing issues on nRF5x platform. However it is drastic one: We have to eliminate tick idea from Zephyr kernel (this include removal of tick-based operation as well as using system clock cycles as timing unit in tickles mode).

This is a bit confusing to me. The system clock is the tick-based "soft" clock (as opposed to the 32-bit HW clock). "System clock cycles as timing unit": you mean 32-bit HW clock cycles as timing unit?

Even in tickless mode, the system still is using "ticks" to manage time slices

#ifdef CONFIG_TICKLESS_KERNEL
#define sys_clock_ticks_per_sec \
                (1000000 / (CONFIG_TICKLESS_KERNEL_TIME_UNIT_IN_MICRO_SECS))
(...)

Without eliminating whole idea of ticks, we will always end fighting with ms<->tick conversion issues.

@andyross
Copy link
Contributor

I agree with @carlescufi for the most part. There's an idea floating around locally here to try to make TICKLESS_KERNEL the default on all platforms that support it, which will go most of the way to what you want.

Note that there's still a need for something like a "tick". There are two time units internal to the Zephyr kernel: the high speed one returned by k_cycle_get_32(), and a lower speed one that is needed for scheduling timeouts (because the cycle counter rolls over too fast) -- that's our "tick" right now, and it will still be around in any design that needs to use a 32 bit timeout counter, even if we don't call it a "tick".

@carlescufi
Copy link
Member Author

carlescufi commented Sep 11, 2018

@andyross I assume you meant that you agree with @pizi-nordic and his comments here and here

@andyross
Copy link
Contributor

Heh, indeed. Misread the attribution in your quote :)

@pdunaj
Copy link
Collaborator

pdunaj commented Sep 12, 2018

@pizi-nordic maybe correction applied once per second (or per n ticks), that would make one of the ticks shorter (or longer)? Should be relatively cheap.

@pizi-nordic
Copy link
Collaborator

Note that there's still a need for something like a "tick". There are two time units internal to the Zephyr kernel: the high speed one returned by k_cycle_get_32(), and a lower speed one that is needed for scheduling timeouts (because the cycle counter rolls over too fast) -- that's our "tick" right now, and it will still be around in any design that needs to use a 32 bit timeout counter, even if we don't call it a "tick".

In my opinion the sole place where we need "low speed" timer is scheduler slices (as we cannot swap threads every hw clock cycle). All other things (sleep times, timeouts etc) could be directly specified using full hardware timer resolution.

Now we are converting everything into ticks, which gives us only coarse timing.

@pizi-nordic
Copy link
Collaborator

@pizi-nordic maybe correction applied once per second (or per n ticks), that would make one of the ticks shorter (or longer)? Should be relatively cheap.

@pdunaj: I thought about that, but this solution introduces jitter.

@pdunaj
Copy link
Collaborator

pdunaj commented Sep 12, 2018

@pizi-nordic , sure but why not accept the jitter? I don't think ticks should be used for precise time accounting (as you mentioned above sleep may take more then requested) and the biggest problem I see here is the time shift introduced in the long period.

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Sep 12, 2018

@pdunaj: Let's assume the following situation:

  • We are using rounded up 'sys_clock_hw_cycles_per_tick' (tick time > 10ms).
  • To correct timing we from time to time introduce shorter tick (< 10ms). In this example every 3rd tick is shortened.
+--------------------------+--------------------------+--------------------------+
|        Tick #1 > 10ms    |      Tick #2 > 10ms      |     Tick #3 < 10ms       |
+--------------------------+--------------------------+--------------------------+-----> t
  • Thread sleeping over ticks 1-3: Perfect timing
  • Thread sleeping over ticks 1-2: Too long - acceptable.
  • Thread sleeping over ticks 2-3, or just over tick 3: Too short - not acceptable.

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Sep 12, 2018

I agree with @carlescufi for the most part. There's an idea floating around locally here to try to make TICKLESS_KERNEL the default on all platforms that support it, which will go most of the way to what you want.
Note that there's still a need for something like a "tick". There are two time units internal to the Zephyr kernel: the high speed one returned by k_cycle_get_32(), and a lower speed one that is needed for scheduling timeouts (because the cycle counter rolls over too fast) -- that's our "tick" right now, and it will still be around in any design that needs to use a 32 bit timeout counter, even if we don't call it a "tick".

In my opinion making TICKLESS_KERNEL the default is only half-way solution. If we keep tick abstraction we will not gain precision, which is easy achievable in tickless kernel. From the user perspective lack of the precision is the real problem: The hardware timer could schedule event using its full resolution but tick abstraction enforces bigger time unit. If we add to that error introduced by ms to tick conversion, this issue becomes really painful.

Of course such approach will be not easy. We will have to hold 64-bit timestamps, handle hardware timer wraps, etc. Fortunately the direction of the changes is already drawn in #8340 (please see this discussion: #8340 (comment)). If we follow the ideas presented there and create generic abstraction of 64-bit timer, then kernel part will be easy.

In my opinion the effort is justified in this case, as we could gain several advantages:

  • Precise timekeeping, as timer resolution is not reduced by software.
  • Simplified kernel and applications, as there will be only one time source and representation.
  • No more issues like POSIX clock_gettime() is discontinuous #8009 (please note that the applied fix just removes problematic part scarifying precision).

@andyross
Copy link
Contributor

The "tick" doesn't have to be a ~100 Hz number. It's reasonable to increase precision on a tickless system to whatever the underlying timer can support. So if you want a 1MHz tick rate, you're limited to a longest timeout of about an hour, etc... You don't have to go all the way to 64 bit to get a reasonable tradeoff, though specific numbers are likely going to have to be chosen per-app and not per-system.

@andyross
Copy link
Contributor

To be clear: I'm not opposed to the idea of a fully 64 bit timer subsystem. I just don't think the benefit is going to be that high vs. a well-chosen SYS_CLOCK_TICKS_PER_SEC value on an existing tickless kernel.

@pizi-nordic
Copy link
Collaborator

Basically we would like to have SYS_CLOCK_TICKS_PER_SEC = SYS_CLOCK_HW_CYCLES_PER_SEC.
As RTC clock in nRF5x is slow, 32-bits are enough to hold more than 36 hours.

I just thought about a more generic solution which enable all platforms to use full hw clock precision.
I am also not sure if about scheduler behaviour in such case. How scheduler manages time slicing in tickless mode?

@andyross
Copy link
Contributor

That doesn't work in all systems, though. On a PC, the TSC runs in GHz and will overflow the 32 bit counter in seconds, requiring either a fully 64 bit system or some other kind of "intermediate representation" (i.e. represent timeouts in "TSC >> 10" units or something) for timeout scheduling, at which point we've simply reinvented the tick.

@pizi-nordic
Copy link
Collaborator

I agree, that we have to represent timeouts longer than hw counter overflow time (this is why I proposed 64-bit). I also agree, that direct usage of TSC value for event scheduling is pointless (we cannot efficiently handle time-outs separated by few TSC cycles).

In such case intermediate representation is unavoidable. However we can hide this representation in the timer driver, virtually reducing hardware timer frequency.

@pizi-nordic
Copy link
Collaborator

Conclusions from off-line meeting between @nashif, @andyross, @cvinayak, @carlescufi and @pizi-nordic:

  • Tickless mode should be default operating mode of Zephyr.
  • After transition to tickless, we should consider removing other operating modes.
  • The "tick" should remain as a time unit inside kernel (all kernel timers will have tick-limited granularity).
    This is needed as some architectures have fast timers and typical timeouts does not fit in 32-bit variable. On architectures with slower timers, for example the nRF5x, the tick rate might be equal to hardware timer clock rate, eliminating loss of precision.
  • All time-related computation inside kernel should be done using ticks in order to avoid excessive time unit conversion. Exceptions:
    • Kernel API: The time given by application (in ms) has to be converted to ticks as early as possible, ideally in _impl_k_xxxx()
    • System timer driver, which has to convert ticks to hardware cycles.

@andyross
Copy link
Contributor

Now that all the timer/timeout/tickless/driver work is merged, we should go back and retest all of these. I think all of these bugs should be addressed now, but don't have access to all the hardware to check.

@carlescufi
Copy link
Member Author

Another way to think about it: nothing you're asking for is impossible for an application to provide on its own. The tick rate is available via kconfig, and you know timeout events are always delivered at tick boundaries. I mean... is this really such a hardship? What is it you are trying to do that you can't make work? Maybe we can help.

@andyross I think we cannot expect every nRF-based application to implement this by itself. This has been raised by several users both upstream and downstream so it is my impression that there is justification to solve this problem in a way that is reusable across applications. By that I am not suggesting any change in particular, just for timer accuracy with low power, low frequency timers to be part of the common codebase so that every user can benefit from it.

@pizi-nordic
Copy link
Collaborator

pizi-nordic commented Mar 28, 2019

@andyross, @carlescufi, @nashif: As you see the whole discussion about timers is continuously repeating. Could we eventually fix these issues?

Just short recap of the things that should be changes in my opinion (any comments welcome):

  • Get rid of tick alignment from the kernel. The API specifies that k_sleep() sleeps from now() to now() + timeout. Not from magic_tick_alignment(now()) to magic_tick_alignment(now() + timeout).

  • When we do the above, the tick abstraction will be reduced to simple prescaller. I would like to remove this prescalling too, as it reduces precision, but basing on the previous discussions I assume that we still not want using 64-bit timeouts inside kernel (with 32-bit timeout the maximum possible timeout is really reduced on systems with fast timer). So if the prescalling layer has to remain, I would name it correctly (system clock prescaller) and locate it outside the kernel (see Use counter drivers as a system clock source #12068).

  • I still think that we should use absolute timeouts in the timeout list. This needs some code to handle wrapping, but the timeout will be calculated once and we will not introduce errors during timeout management (this will be important when we will eliminate ticks, as readout of the timer takes time and with fast running timer the result we get will be outdated).

@kubiaj: Will these changes fix your issues?

@pabigot
Copy link
Collaborator

pabigot commented Mar 28, 2019

@kubiaj I agree with much of your first comment here, but from the second I don't believe Zephyr is internally perceived to be in its early stages. There's a core of Intel engineers who have been working on it for at least four years, and there's a lot of corporate investment and marketing that is inconsistent with accepting a description of it as immature.

Which it is. I spent three months in late 2018 and early 2019 doing a full-time deep evaluation of Zephyr's suitability for applications like you describe on Nordic hardware. By the end I was unable to identify enough stable functionality to support a low-power multi-sensor environmental beacon application. The gaps were in timer/alarm/sleep/power features, GPIO/ADC/I2C, and sensor APIs. I didn't even get to the beacon part.

My conclusion was that Zephyr is a too-big-to-fail initiative that is loved by the people who have invested in it, especially for applications that involve multi-core DSP engines and x86-64--class hardware, but perhaps not so much by people with other needs and perspectives. As @pizi-nordic notes these issues keep recurring and keep getting dismissed. I don't see a path to evolving Zephyr to be suitable for ultra-low-power sensor applications on Cortex-M--class hardware, but I wish everyone the very best of good luck in proving me wrong.

@leapslabs
Copy link

Hi All, I am TDK (Kai) and will help my colleague @kubiaj to take over this discussion on behalf of our team. He will be out of the office in the upcoming days.

@pizi-nordic So far I can give these comments

  1. "Get rid of timeout alignment" - yes, I think it is important to not introduce uncertainty. It can use ticks or time but not something in between.

  2. "Regarding the the tick abstraction" - we are new in Zephyr so I am not sure if I do understand all your ideas, give me a bit of time to digest them. But it is important not to introduce rounding errors in the kernel core. These errors can be introduced at the final HW clock set point, so that error would not scale.

  3. "Absolute timeouts" - yes, I think this is the right approach. A kind of monotonic timer is needed.

@pabigot The last comments posted by @kubiaj came from me. I think there was misunderstanding and apologies if it was offending. What I meant with early stage is related market adoption. I think Zephyr is still new for the community and it will need time to prove itself to be mature.

We have been monitoring Zephyr since 2016 and later last year we have decided to switch to it. I think it has all the ingredients to become the "Linux" for RTOS and MCU world. The works which have been done is really amazing and it is great that finally there is a community project which will make a real difference and has real potential to become a standard. We have tested or used different OS/RTOS in the past (RTEMS, Contiki, FreeRTOS, eCos, RIOT, mynewt, ..) but I think none really have the potential as Zephyr. I think and I hope that it will become a standard in a couple of years.

Our comments are not to undermine the works the Zephyr community have done. We really appreciate what you have done and want to contribute back. Not much as we are a small team, but I hope it would be helpful.

I would not be skeptical about the ultra-low-power sensor applications on Cortex-M class hardware. We have been using eCos in our products. eCos kernel was a great RTOS but I think it did not have the investment and marketing force to make it successful. We had to make a couple of changes in its kernel to achieve ultra-low-power application. In fact we have created a kind of Tickless mode in eCos with a fix time base.

So Zephyr is a few steps further already. But the way the sleep timer in Zephyr is done at the moment I think is not correct. We can work toward that and hopefully in a couple of months we might prove you were wrong ;). Let's see.

@andyross
Copy link
Contributor

andyross commented Apr 1, 2019

Can someone point to an example of an OS that implements timing the way you want? I recognize that I've become sort of a whipping boy for issues of timing of all sorts, so maybe my input isn't really desired here. Some of the criticisms just don't make sense to me (in particular the ones about "tick abstractions" -- in the presence of pervasive tickless, ticks are just a scaling factor, you can set them to whatever rate you want!).

All I can say at this point is just to repeat:

  1. Timing is hard
  2. Zephyr's timing behavior is broadly similar to other OSes that have been successful in the market
  3. Timing is hard

@andyross
Copy link
Contributor

andyross commented Apr 1, 2019

More broadly, I think I'm going to take myself out of the timer design discussions. The goals of the rewrite in 1.14 were to shrink the code size, simplify driver implementation, unify the kernel timeout handling, fix a bunch of existing bugs (especially on nRF), and make CONFIG_TICKLESS_KERNEL the pervasive default. And as I see it (and yeah, I may be the only one) that's been an outrageous success and I'm very pleased with the result. I'm absolutely on the hook for fixing bugs with that code.

Issues about changing the underlying behavior aren't really in scope for that stuff, may require a third timing rewrite, and quite frankly may be beyond my capability. I genuinely don't see an easy path to getting everyone what they want given the breadth of the comments above. Timing is, y'know, hard.

@andrewboie
Copy link
Contributor

Is this really a bug?

@rljordan-zz
Copy link

@carlescufi Does this remain to be a bug?

@andrewboie
Copy link
Contributor

andrewboie commented Apr 2, 2019

Put another way: This is set to High priority, blocking the upcoming release of 1.14. What are your expectations for resolving this for 1.14, and of what has been discussed here, what falls under wishlist/feature request/enhancements as opposed to actual bugs?

@carlescufi
Copy link
Member Author

carlescufi commented Apr 2, 2019

@andrewboie @rljordan this is an issue that has come up several times. The idea behind setting it as a high-priority bug was agreed with @nashif so that we would raise awareness about the fact that it impacts users of Nordic boards in general and to discuss a path forward. I am happy to talk this over tomorrow at the TSC and discuss what we can do about it in the short and mid term.
Since this is unlikely to be addressed in any way for 1.14, I am happy to defer the conversation until after the release, I would just like @nashif to weigh in here.

@carlescufi
Copy link
Member Author

@andyross

Thanks for the feedback.

Can someone point to an example of an OS that implements timing the way you want? I recognize that I've become sort of a whipping boy for issues of timing of all sorts, so maybe my input isn't really desired here.

I am certainly not a expert in the area, so I will delegate to @pizi-nordic, @ioannisg and @leapslabs to discuss the fine technical details, but I would certainly disagree with the statement about your input. You are the author of a timer rewrite that you accurately describe as highly successful, and have the overview of the different requirements of the myriad of platforms supported by the kernel and its timer subsystem. Your input is certainly not only desired, but essential.

Some of the criticisms just don't make sense to me (in particular the ones about "tick abstractions" -- in the presence of pervasive tickless, ticks are just a scaling factor, you can set them to whatever rate you want!).

Does this means that, as long as timeslicing is not enabled, the precision errors would disappear if we were to set the ticks to 32768 in our platforms?

@leapslabs
Copy link

@andyross @andrewboie @rljordan @carlescufi I think we should separate the discussion into 2 issues. One is related with using k_sleep and the rounding errors. I think this one is a bug. This has been observed on nrf52_pca10040 when k_sleep is used
k_sleep(15) - the real sleep is 30 ms
k_sleep(10) - the real sleep is 20 ms
k_sleep(20) - the real sleep is 30 ms
k_sleep(100) - the real sleep is 110 ms
k_sleep(1000) - the real sleep is 1010 ms etc.
I agree timing is hard, but this is IMO a fixable issue. We don't suggested to remove the tick abstraction but the current implementation introduce errors. Please see the first post of @kubiaj. If we compare with RTOS like VxWorks, FreeRTOS, eCos, RTEMS then either they offer sleep with ticks as time base or some higher resolution sleep like nano-seconds. My point here is k_sleep says it sleeps in milliseconds, but the user will not get even close to milliseconds resolution because of the design. There will be 3 sources of errors introduced in the way: related with conversion from milliseconds to ticks, related with ticks rounding and related with real resolution of the HW clock. Can the user really determine the sum of those errors for any given value? This does not sounds like OK to me.

Regarding the second requirement of high resolution sleep I agree is too difficult to do at this stage. We can park it for now. Andy, perhaps you could please point me to some documentation explaining more in details the timer design so I could do first some PoC before we would have further discussion about this. I hope this might be more helpful for you guys since I understand you are very busy at the moment to complete other high priority stuffs. Thanks!

@nashif
Copy link
Member

nashif commented Apr 2, 2019

@rljordan

@carlescufi Does this remain to be a bug?

yes, we know this is difficult to fix in 1.14 and will end up transferred to the next release, however, it should not be "hidden" as a future item until we agree on next steps.

@nashif nashif added this to the future milestone Apr 2, 2019
@nashif
Copy link
Member

nashif commented Apr 2, 2019

moving to a future milestone: Nordic will provide an RFC and a fix to all issues listed after 1.14.

@andrewboie
Copy link
Contributor

We plan on using a 32khz timer for the Microchip boards instead of SYSTICK, we have an interest in helping resolve these issues as well.

@andrewboie
Copy link
Contributor

@nashif can we scope this for 2.0?

@nashif nashif modified the milestones: future, v2.0.0 May 21, 2019
@carlescufi
Copy link
Member Author

This is well in progress for 2.0. The relevant PR in question is #17155 by @andyross

@ioannisg ioannisg added Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug and removed priority: high High impact/importance bug bug The issue is a bug, or the PR is fixing a bug labels Aug 20, 2019
@ioannisg ioannisg modified the milestones: v2.0.0, v2.1.0 Aug 20, 2019
@ioannisg
Copy link
Member

Needs to be pushed to 2.1 release - there is no point to keep it as a bug at this stage of the release cycle.

@carlescufi
Copy link
Member Author

This issue is not really applicable as a roll-up anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Timer Timer Enhancement Changes/Updates/Additions to existing features platform: nRF Nordic nRFx priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests