Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: timer: nrf: Optimize k_cycle_get_32(). #9275

Merged
merged 2 commits into from
Aug 13, 2018

Conversation

pizi-nordic
Copy link
Collaborator

The previous implementation of _timer_cycle_get_32() (which is directly
mapped to k_cycle_get_32()) taken from 87 to 132 cycles. As result it
was too heavy for using it as source of time for logger.

This commit makes this function faster by removing redundant access
to the RTC register (each access consumed 28 CPU cycles) as well as
loop, which made this call non-deterministic.

After these changes the k_cycle_get_32() needs only 50-52 cycles
in to calculate 32-bit timestamp.

This PR fixes #8798.

@codecov-io
Copy link

codecov-io commented Aug 3, 2018

Codecov Report

Merging #9275 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9275   +/-   ##
=======================================
  Coverage   52.65%   52.65%           
=======================================
  Files         202      202           
  Lines       25330    25330           
  Branches     5301     5301           
=======================================
  Hits        13337    13337           
  Misses       9894     9894           
  Partials     2099     2099

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18cec24...481011a. Read the comment docs.

@carlescufi
Copy link
Member

recheck

RTC_MASK;
rtc_prev = rtc_now;
rtc_now = RTC_COUNTER;
} while (rtc_now != rtc_prev);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop ensures that at any realtime the returned value is not off by 1 system clock tick.

Copy link
Collaborator Author

@pizi-nordic pizi-nordic Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that such "protection" does not give any value:

int get_time()
{
    int time;

/*
 * If RTC_REGISTER changes here, code below will get updated value.
 */
    do {
        time = RTC_REGISTER;
        /*
         * If RTC_REGISTER changes here, the loop will actually delay
         * return from the function and returned value will include
         * effect of the delay.
         */
     } while (time != RTC_REGISTER)

/*
 * If RTC_REGISTER changes here, value returned by this function will
 *  be off by 1 system clock cycle.
 */

return time;
}

In both versions, the one with loop and the one without, there is risk that a "stale" value will be returned. It is unavoidable, as even return from the function takes time and RTC counter could be updated during that period. I personally do not see any benefit from this loop (especially as in this case access to RTC is atomic and there is no risk that we get inconsistent value of the register).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are not considering is the one clock tick that elapses, and the kernel cannot tolerate a stale clock tick. You can use your implementation on nRF52 boards and run a typical peripheral sample, it will stall eventually when the system clock ticks between the reading of the kernel variable and reading of the RTC counter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you mentioning here problem you pointed there: #9275 (comment) ? If so, please look on my comment posted here: #9275 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your explanation. except for:
/*

  • If RTC_REGISTER changes here, value returned by this function will
  • be off by 1 system clock tick.
    */

Here the returned value is only off by 1 cycle, not 1 system clock tick.

But as long as, tests cases pass as before, I am ok with the changes (because I am not very well informed on the kernel's use of this function).

Copy link
Collaborator Author

@pizi-nordic pizi-nordic Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. I had 1 clock cycle in my mind, not tick. Example updated.

ticked_cycles = sys_clock_tick_count * sys_clock_hw_cycles_per_tick;

/* Number of timer cycles since last announced tick we know about. */
elapsed_cycles = rtc_count - (ticked_cycles & RTC_MASK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rtc_count rollover needs handling. Mask the entire RHS to get the absolute difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you.
Feel free to look at the updated code.

* outdated at this point. However the value of rtc_count will
* include missed ticks.
*/
rtc_count = RTC_COUNTER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _sys_clock_tick_count is incremented by the ISR between lines 559 and 571, then at that realtime, system clock ticks versus the returned value from this function is off by 1 clock tick. (kernel will setup a wake up in the past, causing a hang).

Copy link
Collaborator Author

@pizi-nordic pizi-nordic Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is compensated by rtc_count. Please note that the _sys_clock_tick_count is read before the RTC_COUNTER register (this order is enforced by compiler barrier). As RTC_COUNTER is not reset on tick, it will include the missed time.

Please look at the following example:

RTC_COUNTER value:...............100..............200...............300
tick:..............................1................2.................3
_sys_clock_tick_count read:                ^
RTC_COUNTER read:                                            ^

In this example we we read _sys_clock_tick_count somewhere between tick 1 and 2. The read value was 1. Then we got an interrupt, which delayed RTC_COUNTER read (which returned 250, as RTC is counting all the time). The final calculation uses cached value of _sys_clock_tick_count (which is stale in this example) and current value of RTC_COUNTER:

ticked_cycles = cached _sys_clock_tick_count * RTC_COUNTS_PER_TICK = 1 * 100 = 100
elapsed_cycles = RTC_COUNT - ticked_cycles = 250 - 100 = 150
(please note that elapsed_cycles is larger than one tick).

total_time = 100 + 150 = 250 (correct)

rtc_now = RTC_COUNTER;
} while (rtc_now != rtc_prev);
/* Get number of announced ticks. */
sys_clock_tick_count = _sys_clock_tick_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much value in having the number of sys ticks in a local variable. Actually, it may be a bit confusing.
What would be wrong in computing timer cycles directly here? Like this:

/* Number of timer cycles announced as ticks so far. */
ticked_cycles = _sys_clock_tick_count * sys_clock_hw_cycles_per_tick;

I would also simplify the following part. To me seeing just this:

/* Make sure that compiler will not reverse access to RTC and
 * _sys_clock_tick_count.
 */
compiler_barrier();

/* Number of timer cycles since last announced tick we know about. */
elapsed_cycles = ((RTC_COUNTER - ticked_cycles) & RTC_MASK);

return ticked_cycles + elapsed_cycles;

would make it easier to get what is actually done here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain the whole idea, however you are right that the code was a bit too long for the job.
Feel free to look at the updated code.


/* Number of timer cycles since last announced tick we know about.
*
* The value of RTC_COUNTER is not reset on tick, so it could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only it could, but even it will compensate this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* which could have happen between the ticked_cycles calculation
* and the code below.
*/
elapsed_cycles = (RTC_COUNTER - (ticked_cycles & RTC_MASK)) & RTC_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to do this masking twice? I think that only the final one will be sufficient (the upper bits are discarded anyway, so it doesn't matter if they are all zeros, all ones, or a mix of these).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cvinayak: What is your opinion about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @anangl, single overall masking is sufficient.

Copy link
Collaborator Author

@pizi-nordic pizi-nordic Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, however I wanted to hear your opinion before I do the change. Fixed.

@pizi-nordic pizi-nordic force-pushed the faster_k_cycle_get_32 branch 2 times, most recently from 531c87d to 06321e7 Compare August 6, 2018 13:49
@cvinayak
Copy link
Contributor

cvinayak commented Aug 6, 2018

@pizi-nordic could you check #9310 , which is failing even before your PR over master, this was the test suite that used to fail prior to the changes done for #3665 in #374

@pizi-nordic
Copy link
Collaborator Author

pizi-nordic commented Aug 6, 2018

@cvinayak: #9137 should fix #9310. You also might need changes from #9073.

@cvinayak
Copy link
Contributor

cvinayak commented Aug 6, 2018

@pizi-nordic could you also change the implementation in lines 278, 336, and 379 in a separate commit.

@pizi-nordic
Copy link
Collaborator Author

@cvinayak: Done, please check.

rtc_prev = rtc_now;
rtc_now = RTC_COUNTER;
} while (rtc_prev != rtc_now);
rtc_elapsed = (RTC_COUNTER - rtc_past) & RTC_MASK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to read the rtc_past before RTC_COUNTER, just like what you did in the other commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please note, that old code did the opposite (RTC_COUNTER was read before rtc_prev).


rtc_elapsed = (RTC_COUNTER - rtc_past) & RTC_MASK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, rtc_past needs to be read before RTC_COUNTER, as rtc_past is modified by ISR. Is it ok if rtc_past changes between the statements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Similarly like in the previous case, the old code did the opposite (RTC_COUNTER was read before rtc_prev).

return (sys_clock_tick_count * sys_clock_hw_cycles_per_tick) +
elapsed_cycles;
/* Number of timer cycles announced as ticks so far. */
ticked_cycles = _sys_clock_tick_count * sys_clock_hw_cycles_per_tick;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I realized that this information was stored in rtc_past, so actually there's no need to do this multiplication here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is subtle difference:

rtc_past = (_sys_clock_tick_count * sys_clock_hw_cycles_per_tick) & RTC_MASK;
ticked_cycles = (_sys_clock_tick_count * sys_clock_hw_cycles_per_tick);

We are doing this multiplication because we want 32-bit timestamp which will not wrap up when RTC do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I missed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, ;-)
it seems that rtc_past could be stored without prior masking (then it could be used instead of the multiplication mentioned above), as in all places where it is later used the masking would be done anyway (either by using RTC_MASK or in hardware when rtc_compare_set is called).

But I guess this can be left as a field for future optimizations.


/*Convert number of Machine cycles to SYS TICS*/
/* Convert number of Machine cycles to SYS TICS */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYS_TICKS. Otherwise, PR LGTM now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

The previous implementation of _timer_cycle_get_32() (which is directly
mapped to k_cycle_get_32()) taken from 87 to 132 cycles. As result it
was too heavy for using it as source of time for logger.

This commit makes this function faster by removing redundant access
to the RTC register (each access consumed 28 CPU cycles) as well as
loop, which made this call non-deterministic.

After these changes the k_cycle_get_32() needs only 50-52 cycles
in to calculate 32-bit timestamp.

Signed-off-by: Piotr Zięcik <[email protected]>
@cvinayak
Copy link
Contributor

cvinayak commented Aug 8, 2018

@pizi-nordic @carlescufi LGTM

/* Read _sys_clock_tick_count and rtc_past before RTC_COUNTER */
elapsed = _sys_clock_tick_count;
rtc_past_copy = rtc_past;
compiler_barrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the compiler_barrier here? should we add an inline comment that justifies it?

Copy link
Collaborator Author

@pizi-nordic pizi-nordic Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @cvinayak, the rtc_past must be read before RTC_COUNTER. Barrier ensures the correct order.

Comment regarding barrier usage is in other function in this file. I have not seen a reason for placing same comment in other places. Do you think, that it is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the problem of placing the same comment in 2 locations in a file. It is really educative in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@ioannisg ioannisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only few minor comments from my side.

This commit removes redundant reads of RTC_COUNTER register
propagating optimizations made in the k_cycle_get_32() function.

Signed-off-by: Piotr Zięcik <[email protected]>
@nashif nashif merged commit 24abedb into zephyrproject-rtos:master Aug 13, 2018
@pizi-nordic pizi-nordic deleted the faster_k_cycle_get_32 branch August 20, 2018 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

k_cycle_get_32() implementation on nrf series is too slow.
7 participants