Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

peter-mitsis
Copy link
Collaborator

This set of commits offers improvements to both the cooperative and preemptive results in the the thread_metric benchmark.

The following numbers were obtained with the thread_metric benchmark project on the disco_l475_iot1 board. MQ denotes use of CONFIG_SCHED_MULTIQ and DQ denotes use of CONFIG_SCHED_DUMB.

Before:
Time Period Total: 12436655 (Cooperative, MQ)
Time Period Total: 11268922 (Cooperative, DQ)
Time Period Total: 5730607 (Preemptive, MQ)
Time Period Total: 6404080 (Preemptive, DQ)

After:
Time Period Total: 16001868 (Cooperative, MQ) : +28.7 %
Time Period Total: 11595554 (Cooperative, DQ) : +2.9 %
Time Period Total: 6473166 (Preemptive, MQ) : +12.9 %
Time Period Total: 7018286 (Preemptive, DQ) : +9.6%

include/zephyr/sys/dlist.h Show resolved Hide resolved
kernel/include/priority_q.h Outdated Show resolved Hide resolved
kernel/include/priority_q.h Outdated Show resolved Hide resolved
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes. The desire to see that halt_thread() refactor slow-walked and separated just barely rises to a -1 out of conservatism, just to get it a little more time to show stability impacts.


sys_dlist_t *l = &pq->queues[i * NBITS + TRAILING_ZEROS(pq->bitmask[i])];
sys_dnode_t *n = sys_dlist_peek_head(l);
if (likely(index != 0xFFFFFFFF)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for an initialization state? Seems like you could elide that test by just leaving the initial zero, which then would index the highest priority list always (and be caught by the empty-list NULL as expected on an empty queue).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. The initial state though should probably be K_NUM_THREAD_PRIO - 1--the slot for the lowest priority thread so that the update works correctly.

This should also reduce the patch set by 1. :)

@@ -122,6 +122,9 @@ struct _priq_rb {
struct _priq_mq {
sys_dlist_t queues[K_NUM_THREAD_PRIO];
unsigned long bitmask[PRIQ_BITMAP_SIZE];
#ifndef CONFIG_SMP
unsigned int prio_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: "priority index" doesn't mean much. Maybe "lowest_used_prio" or something more descriptive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently thinking cached_queue_index

include/zephyr/sys/dlist.h Show resolved Hide resolved
kernel/include/priority_q.h Outdated Show resolved Hide resolved
kernel/sched.c Outdated
@@ -158,8 +158,10 @@ static inline bool is_halting(struct k_thread *thread)
/* Clear the halting bits (_THREAD_ABORTING and _THREAD_SUSPENDING) */
static inline void clear_halting(struct k_thread *thread)
{
#if CONFIG_MP_MAX_NUM_CPUS > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: IMHO a regular if() would be cleaner and clearer here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-style: CONFIG_SMP is the better tunable to test here, not the number of cores. Non-SMP multicore builds are a legal edge case, and you can only run Zephyr threads on one core, so it doesn't need this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, CONFIG_SMP with but 1 core is also legit. Perhaps something like ...

#if defined(CONFIG_SMP) && (CONFIG_MP_MAX_NUM_CPUS > 1)

... or its if () equalivent.

kernel/sched.c Outdated
}
(void)z_abort_thread_timeout(thread);
unpend_all(&thread->join_queue);
if (likely(new_state == _THREAD_SUSPENDED)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How impactful is this patch to halt_thread()? This code was excruciatingly hard to get right and coming at it with a refactoring hatchet just for a minor performance boost gives me the willies. I don't see anything wrong, but...

I guess my strong preference would be to split this one patch out into a separate PR and give it a ton of testing in isolation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one patch gave us a 9.7% performance boost in the thread_metric preemptive benchmark (multiq) on the disco_l475_iot1 board when compared to the numbers from the previous commit.

@@ -36,7 +36,7 @@ struct k_spinlock _sched_spinlock;
__incoherent struct k_thread _thread_dummy;

static ALWAYS_INLINE void update_cache(int preempt_ok);
static void halt_thread(struct k_thread *thread, uint8_t new_state);
static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's always going to be inlined, it shouldn't need a prototype (which, obviously, can't be inlined per the language spec, though in the post-LTO world many compilers are able to do so). Take this out, and if it doesn't build let's fix the declaration order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, both they are serving as forward declarations to work around the fact that each calls the other.

@@ -1258,7 +1258,7 @@ extern void thread_abort_hook(struct k_thread *thread);
* @param thread Identify the thread to halt
* @param new_state New thread state (_THREAD_DEAD or _THREAD_SUSPENDED)
*/
static void halt_thread(struct k_thread *thread, uint8_t new_state)
static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I count three spots where this medium-sized function is called, so an ALWAYS_INLINE is going to have a non-trivial code size impact. Did you look at how bad it is? My gut says that this is too big to be inlining, but will defer to numbers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining this code showed a 124 byte code size increase on disco_l475_iot1 and a 7.8% performance boost. On qemu_x86, the code size increased by 128 bytes.

@peter-mitsis
Copy link
Collaborator Author

The main changes in this updated patch set are ...

  1. Split out the halt_thread() reordering patch as per Andy's preference, so that it can be its own PR later.
  2. Updated the implementation for caching the queue index in the muti-queue.

TaiJuWu
TaiJuWu previously approved these changes Nov 20, 2024
@peter-mitsis
Copy link
Collaborator Author

Apologies for all the updates. I kept realizing that I had forgotten a small detail to fix each time. This should hopefully be it.

cfriedt
cfriedt previously approved these changes Nov 23, 2024
Minor cleanups include ...
 1. Eliminating unnecessary if-defs and forward declarations
 2. Co-locating routines of the same queue type

Signed-off-by: Peter Mitsis <[email protected]>
Inlines z_sched_prio_cmp() to get better performance.

Signed-off-by: Peter Mitsis <[email protected]>
Dequeuing from a doubly linked list is similar to removing an item
except that it does not re-initialize the dequeued node.

This comes in handy when sorting a doubly linked list (where the
node gets removed and re-added). In that circumstance, re-initializing
the node is required. Furthermore, the compiler does not always
'understand' this. Thus, when performance is critical, dequeuing
may be preferred to removing.

Signed-off-by: Peter Mitsis <[email protected]>
Adds routines for setting and clearing the _THREAD_QUEUED
thread_state bit.

Signed-off-by: Peter Mitsis <[email protected]>
@peter-mitsis
Copy link
Collaborator Author

Rebased to resolve merge conflicts.

Adds customized yield implementations based upon the selected
scheduler (dumb, multiq or scalable). Although each follows the
same broad outline, some of them allow for additional tweaking
to extract maximal performance. For example, the multiq variant
improves the performance of k_yield() by about 20%.

Signed-off-by: Peter Mitsis <[email protected]>
Even though calculating the priority queue index in the priority
multiq is quick, caching it allows us to extract an extra 2% in
terms of performance as measured by the thread_metric cooperative
benchmark.

Signed-off-by: Peter Mitsis <[email protected]>
There is no need for clear_halting() to do anything on UP systems.

Signed-off-by: Peter Mitsis <[email protected]>
Gives a hint to the compiler that the bail-out paths in both
k_thread_suspend() and k_thread_resume() are unlikely events.

Signed-off-by: Peter Mitsis <[email protected]>
Inlining these routines helps to improve the
performance of k_thread_suspend()

Signed-off-by: Peter Mitsis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Base OS Base OS Library (lib/os) area: Kernel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants