Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel/sched: Correct k_sleep() return value when a 32-bit tick wraparound occurs #82272

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

akasona
Copy link

@akasona akasona commented Nov 28, 2024

Fixes #79863

The expected_wakeup_ticks and sys_clock_tick_get_32() are uint32_t values, and may wrap around individually.
If the expected_wakeup_ticks has a wraparound and sys_clock_tick_get_32() doesn't, so expected_wakeup_ticks < sys_clock_tick_get_32(), the API return value will be corrupted.

Example: try to sleep 1000 ticks, but wakeup occurs in 10 ticks

sys_clock_tick_get_32() before sleep expected_wakeup_ticks sys_clock_tick_get_32() after wakeup return value (remaining ticks)
10000 11000 10010 990 (OK)
0xffffffff (4,294,967,295) 0x000003e7 (999) ↩️ 0x00000009 (9) ↩️ 990 (OK)
0xffffff00 (4,294,967,040) 0x000002eb (744) ↩️ 0xFFFFFF0A (4,294,967,050) -4,294,966,300 😱 (returns 0)

The API return value should be calculated in 32bit-unsigned-integer manner, and any wraparound will be treated properly.

sys_clock_tick_get_32() before sleep expected_wakeup_ticks sys_clock_tick_get_32() after wakeup return value (remaining ticks)
10000 11000 10010 990 (OK)
0xffffffff (4,294,967,295) 0x000003e7 (999) ↩️ 0x00000009 (9) ↩️ 990 (OK)
0xffffff00 (4,294,967,040) 0x000002eb (744) ↩️ 0xFFFFFF0A (4,294,967,050) 990 (OK) ↩️

Review Request

  • I add a test to reproduce the issue (it failed first), and the fixed code passed the test. The test is very very slow. May I commit the test to the repository?

Copy link

Hello @akasona, and thank you very much for your first pull request to the Zephyr project!
Our Continuous Integration pipeline will execute a series of checks on your Pull Request commit messages and code, and you are expected to address any failures by updating the PR. Please take a look at our commit message guidelines to find out how to format your commit messages, and at our contribution workflow to understand how to update your Pull Request. If you haven't already, please make sure to review the project's Contributor Expectations and update (by amending and force-pushing the commits) your pull request if necessary.
If you are stuck or need help please join us on Discord and ask your question there. Additionally, you can escalate the review when applicable. 😊

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Pretty sure the fix can be simpler. Also, yeah, that test can't really go in as written (though on the non-SMP qemu platforms it should evaluate quickly as they have non-realtime simulated time). Testing this for real probably requires adding a way to white-box an offset into the system uptime counter so that you can start near rollover.

ticks = (k_ticks_t)expected_wakeup_ticks - sys_clock_tick_get_32();
uint32_t left_ticks = expected_wakeup_ticks - sys_clock_tick_get_32();

ticks = (k_ticks_t)(int32_t)left_ticks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the fix, but this can be expressed cleaner:

ticks = expected_wakeup_ticks - sys_clock_tick_get_32();

Basically the bug here (unless I'm misunderstanding) is just putting the typecast in the wrong place. The correct time delta requires a 32 bit subtraction, but we were casting the first argument to a int64_t (usually, k_ticks_t can be configured to be 32 bit but that isn't default) and so the rollover presented a negative 64 bit result.

The cast doesn't need to be there at all, AFAICT. Just let the expression evaluate in the (unsigned 32 bit) type of the operands and then assign itself to the 64 bit ticks variable after evaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, "requires a 32 bit unsigned subtraction", that is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andyross, I have to add more explanation.

I think there should be (int32_t) casting, because sys_clock_tick_get_32() can exceed expected_wakeup_ticks in the timeout case.

Case 1. sys_clock_tick_get_32() equals expected_wakeup_ticks after wakeup.
It may be timeout case without k_wakeup() call.

ticks = expected_wakeup_ticks - sys_clock_tick_get_32()  /* maybe 0, no problem */

Case 2. sys_clock_tick_get_32() exceeds expected_wakeup_ticks after wakeup.
The k_sleep() caller thread can be suspended by higher priority threads after wakeup.

ticks = expected_wakeup_ticks - sys_clock_tick_get_32()  /* maybe very very big positive number */

Solution:

uint32_t left_ticks = expected_wakeup_ticks - sys_clock_tick_get_32();  /* first, make a 32 bit unsigned subtraction */
int32_t signed_left_ticks = (int32_t)left_ticks;  /* for handling the negative value correctly */
ticks = (k_ticks_t)signed_left_ticks;  /* force return value type */

I put the last casting for avoiding implicit type conversion, and this PR version is intended to highlight a 32 bit unsigned subtraction. If I write down the code in the simplest way:

ticks = (int32_t)(expected_wakeup_ticks - sys_clock_tick_get_32());

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross May I put some comments to the code?

        /* We require a 32 bit unsigned subtraction to care a wraparound */
	uint32_t left_ticks = expected_wakeup_ticks - sys_clock_tick_get_32();

        /* To handle a negative value correctly, once type-cast it to signed 32 bit */
	ticks = (k_ticks_t)(int32_t)left_ticks;

@akasona akasona force-pushed the fix_k_sleep_incorrect_return_value branch 2 times, most recently from be4e61e to d284de1 Compare December 1, 2024 15:20
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm convinced. The only remaining squawk is that the commit order is backwards, you need to fix the bug before adding the test, otherwise the test gets committed in a failing state and we have a bisection problem in the tree.

Feel free to assume +1 from me once that's addressed.

static const uint32_t timeout_ticks = 300; /* 3 seconds @ 100Hz tick */
static const uint32_t wakeup_ticks = 10; /* 100 ms @ 100Hz tick */

sys_clock_tick_set(start_ticks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, nice. I didn't know this API existed[1]. Looks like you aren't the first person needing to white-box the tick value, @cfriedt beat you to it in commit 4108e14.

[1] Yes, the astute will point out that I did in fact review that PR...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm lucky.
I will arrange commit order after fixing up the test. Please wait a little longer...

andyross
andyross previously approved these changes Dec 2, 2024
Fix the issue: zephyrproject-rtos#79863

The expected_wakeup_ticks and sys_clock_tick_get_32() are uint32_t values,
and may wrap around individually.
If the expected_wakeup_ticks has a wraparound and sys_clock_tick_get_32()
doesn't, so expected_wakeup_ticks < sys_clock_tick_get_32(), the API return
value will be corrupted.

The API return value, that is the remaining time, should be calculated in
32bit-unsigned-integer manner, and any wraparound will be treated properly.

Signed-off-by: Akaiwa Wataru <[email protected]>
This test reproduce the issue: zephyrproject-rtos#79863

In this test, instead of waiting the 32-bit tick wraparound for several
days, patching the kernel internal tick with the test API.

Signed-off-by: Akaiwa Wataru <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

k_sleep() returns incorrect remaining time 12 days after booting
4 participants