kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

peter-mitsis · 2024-11-13T06:42:56Z

This set of commits offers improvements to both the cooperative and preemptive results in the the thread_metric benchmark.

The following numbers were obtained with the thread_metric benchmark project on the disco_l475_iot1 board. MQ denotes use of CONFIG_SCHED_MULTIQ and DQ denotes use of CONFIG_SCHED_DUMB.

Before:
Time Period Total: 12436655 (Cooperative, MQ)
Time Period Total: 11268922 (Cooperative, DQ)
Time Period Total: 5730607 (Preemptive, MQ)
Time Period Total: 6404080 (Preemptive, DQ)

After:
Time Period Total: 16001868 (Cooperative, MQ) : +28.7 %
Time Period Total: 11595554 (Cooperative, DQ) : +2.9 %
Time Period Total: 6473166 (Preemptive, MQ) : +12.9 %
Time Period Total: 7018286 (Preemptive, DQ) : +9.6%

include/zephyr/sys/dlist.h

kernel/include/priority_q.h

andyross

Some notes. The desire to see that halt_thread() refactor slow-walked and separated just barely rises to a -1 out of conservatism, just to get it a little more time to show stability impacts.

andyross · 2024-11-14T03:53:03Z

kernel/include/priority_q.h


-		sys_dlist_t *l = &pq->queues[i * NBITS + TRAILING_ZEROS(pq->bitmask[i])];
-		sys_dnode_t *n = sys_dlist_peek_head(l);
+	if (likely(index != 0xFFFFFFFF)) {


Is this just for an initialization state? Seems like you could elide that test by just leaving the initial zero, which then would index the highest priority list always (and be caught by the empty-list NULL as expected on an empty queue).

Good idea. The initial state though should probably be K_NUM_THREAD_PRIO - 1--the slot for the lowest priority thread so that the update works correctly.

This should also reduce the patch set by 1. :)

andyross · 2024-11-14T03:53:58Z

include/zephyr/kernel_structs.h

@@ -122,6 +122,9 @@ struct _priq_rb {
 struct _priq_mq {
 	sys_dlist_t queues[K_NUM_THREAD_PRIO];
 	unsigned long bitmask[PRIQ_BITMAP_SIZE];
+#ifndef CONFIG_SMP
+	unsigned int prio_index;


Naming: "priority index" doesn't mean much. Maybe "lowest_used_prio" or something more descriptive?

I'm currently thinking cached_queue_index

include/zephyr/sys/dlist.h

kernel/include/priority_q.h

andyross · 2024-11-14T04:01:46Z

kernel/sched.c

@@ -158,8 +158,10 @@ static inline bool is_halting(struct k_thread *thread)
 /* Clear the halting bits (_THREAD_ABORTING and _THREAD_SUSPENDING) */
 static inline void clear_halting(struct k_thread *thread)
 {
+#if CONFIG_MP_MAX_NUM_CPUS > 1


Style: IMHO a regular if() would be cleaner and clearer here.

Non-style: CONFIG_SMP is the better tunable to test here, not the number of cores. Non-SMP multicore builds are a legal edge case, and you can only run Zephyr threads on one core, so it doesn't need this.

As I understand it, CONFIG_SMP with but 1 core is also legit. Perhaps something like ...

#if defined(CONFIG_SMP) && (CONFIG_MP_MAX_NUM_CPUS > 1)

... or its if () equalivent.

andyross · 2024-11-14T04:07:13Z

kernel/sched.c

-			}
-			(void)z_abort_thread_timeout(thread);
-			unpend_all(&thread->join_queue);
+		if (likely(new_state == _THREAD_SUSPENDED)) {


How impactful is this patch to halt_thread()? This code was excruciatingly hard to get right and coming at it with a refactoring hatchet just for a minor performance boost gives me the willies. I don't see anything wrong, but...

I guess my strong preference would be to split this one patch out into a separate PR and give it a ton of testing in isolation.

This one patch gave us a 9.7% performance boost in the thread_metric preemptive benchmark (multiq) on the disco_l475_iot1 board when compared to the numbers from the previous commit.

andyross · 2024-11-14T04:09:30Z

kernel/sched.c

@@ -36,7 +36,7 @@ struct k_spinlock _sched_spinlock;
 __incoherent struct k_thread _thread_dummy;

 static ALWAYS_INLINE void update_cache(int preempt_ok);
-static void halt_thread(struct k_thread *thread, uint8_t new_state);
+static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state);


If it's always going to be inlined, it shouldn't need a prototype (which, obviously, can't be inlined per the language spec, though in the post-LTO world many compilers are able to do so). Take this out, and if it doesn't build let's fix the declaration order.

In this case, both they are serving as forward declarations to work around the fact that each calls the other.

andyross · 2024-11-14T04:12:41Z

kernel/sched.c

@@ -1258,7 +1258,7 @@ extern void thread_abort_hook(struct k_thread *thread);
 * @param thread Identify the thread to halt
 * @param new_state New thread state (_THREAD_DEAD or _THREAD_SUSPENDED)
 */
-static void halt_thread(struct k_thread *thread, uint8_t new_state)
+static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state)


I think I count three spots where this medium-sized function is called, so an ALWAYS_INLINE is going to have a non-trivial code size impact. Did you look at how bad it is? My gut says that this is too big to be inlining, but will defer to numbers.

Inlining this code showed a 124 byte code size increase on disco_l475_iot1 and a 7.8% performance boost. On qemu_x86, the code size increased by 128 bytes.

peter-mitsis · 2024-11-20T00:41:32Z

The main changes in this updated patch set are ...

Split out the halt_thread() reordering patch as per Andy's preference, so that it can be its own PR later.
Updated the implementation for caching the queue index in the muti-queue.

peter-mitsis · 2024-11-20T19:20:04Z

Apologies for all the updates. I kept realizing that I had forgotten a small detail to fix each time. This should hopefully be it.

Minor cleanups include ... 1. Eliminating unnecessary if-defs and forward declarations 2. Co-locating routines of the same queue type Signed-off-by: Peter Mitsis <[email protected]>

Inlines z_sched_prio_cmp() to get better performance. Signed-off-by: Peter Mitsis <[email protected]>

Dequeuing from a doubly linked list is similar to removing an item except that it does not re-initialize the dequeued node. This comes in handy when sorting a doubly linked list (where the node gets removed and re-added). In that circumstance, re-initializing the node is required. Furthermore, the compiler does not always 'understand' this. Thus, when performance is critical, dequeuing may be preferred to removing. Signed-off-by: Peter Mitsis <[email protected]>

Adds routines for setting and clearing the _THREAD_QUEUED thread_state bit. Signed-off-by: Peter Mitsis <[email protected]>

peter-mitsis · 2024-11-26T23:08:37Z

Rebased to resolve merge conflicts.

Adds customized yield implementations based upon the selected scheduler (dumb, multiq or scalable). Although each follows the same broad outline, some of them allow for additional tweaking to extract maximal performance. For example, the multiq variant improves the performance of k_yield() by about 20%. Signed-off-by: Peter Mitsis <[email protected]>

Even though calculating the priority queue index in the priority multiq is quick, caching it allows us to extract an extra 2% in terms of performance as measured by the thread_metric cooperative benchmark. Signed-off-by: Peter Mitsis <[email protected]>

There is no need for clear_halting() to do anything on UP systems. Signed-off-by: Peter Mitsis <[email protected]>

Gives a hint to the compiler that the bail-out paths in both k_thread_suspend() and k_thread_resume() are unlikely events. Signed-off-by: Peter Mitsis <[email protected]>

Inlining these routines helps to improve the performance of k_thread_suspend() Signed-off-by: Peter Mitsis <[email protected]>

zephyrbot added area: Kernel area: Base OS Base OS Library (lib/os) labels Nov 13, 2024

zephyrbot requested review from andyross, ceolin, cfriedt, dcpleung, nashif, npitre and TaiJuWu November 13, 2024 06:43

zephyrbot assigned andyross and nashif Nov 13, 2024

cfriedt reviewed Nov 13, 2024

View reviewed changes

include/zephyr/sys/dlist.h Show resolved Hide resolved

kernel/include/priority_q.h Outdated Show resolved Hide resolved

kernel/include/priority_q.h Outdated Show resolved Hide resolved

andyross requested changes Nov 14, 2024

View reviewed changes

peter-mitsis force-pushed the pmitsis-performance branch from 26c3180 to f683c61 Compare November 20, 2024 00:37

nashif force-pushed the pmitsis-performance branch from f683c61 to 82cecec Compare November 20, 2024 00:58

peter-mitsis force-pushed the pmitsis-performance branch from 82cecec to 735f935 Compare November 20, 2024 01:15

TaiJuWu previously approved these changes Nov 20, 2024

View reviewed changes

peter-mitsis force-pushed the pmitsis-performance branch from 735f935 to d15951f Compare November 20, 2024 04:46

peter-mitsis requested review from cfriedt, TaiJuWu and andyross November 20, 2024 04:46

peter-mitsis mentioned this pull request Nov 20, 2024

kernel: Reorganize halt_thread() #81677

Open

peter-mitsis dismissed TaiJuWu’s stale review via 69a0656 November 20, 2024 19:18

peter-mitsis force-pushed the pmitsis-performance branch from d15951f to 69a0656 Compare November 20, 2024 19:18

peter-mitsis mentioned this pull request Nov 21, 2024

kernel: Add _THREAD_SLEEPING thread state #81684

Open

cfriedt previously approved these changes Nov 23, 2024

View reviewed changes

peter-mitsis added 2 commits November 26, 2024 13:56

kernel: Clean up priority_q.h

1b71b00

Minor cleanups include ... 1. Eliminating unnecessary if-defs and forward declarations 2. Co-locating routines of the same queue type Signed-off-by: Peter Mitsis <[email protected]>

kernel: inline z_sched_prio_cmp()

2753541

Inlines z_sched_prio_cmp() to get better performance. Signed-off-by: Peter Mitsis <[email protected]>

peter-mitsis added 2 commits November 26, 2024 14:15

kernel: Add routines for _THREAD_QUEUED bit

4c3ef75

Adds routines for setting and clearing the _THREAD_QUEUED thread_state bit. Signed-off-by: Peter Mitsis <[email protected]>

peter-mitsis dismissed cfriedt’s stale review via d8a509d November 26, 2024 23:07

peter-mitsis force-pushed the pmitsis-performance branch from 69a0656 to d8a509d Compare November 26, 2024 23:07

peter-mitsis added 5 commits November 26, 2024 15:12

kernel: Simplify clear_halting() on UP systems

5ef3a80

There is no need for clear_halting() to do anything on UP systems. Signed-off-by: Peter Mitsis <[email protected]>

kernel: thread suspend/resume bail paths are unlikely

202d4aa

Gives a hint to the compiler that the bail-out paths in both k_thread_suspend() and k_thread_resume() are unlikely events. Signed-off-by: Peter Mitsis <[email protected]>

kernel: Inline halt_thread() and z_thread_halt()

cdbcd22

Inlining these routines helps to improve the performance of k_thread_suspend() Signed-off-by: Peter Mitsis <[email protected]>

peter-mitsis force-pushed the pmitsis-performance branch from d8a509d to cdbcd22 Compare November 26, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

peter-mitsis commented Nov 13, 2024

andyross left a comment

andyross Nov 14, 2024

peter-mitsis Nov 19, 2024

andyross Nov 14, 2024

peter-mitsis Nov 19, 2024

andyross Nov 14, 2024

andyross Nov 14, 2024

peter-mitsis Nov 14, 2024

andyross Nov 14, 2024

peter-mitsis Nov 14, 2024

andyross Nov 14, 2024

peter-mitsis Nov 20, 2024

andyross Nov 14, 2024

peter-mitsis Nov 15, 2024

peter-mitsis commented Nov 20, 2024

peter-mitsis commented Nov 20, 2024

peter-mitsis commented Nov 26, 2024

kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

Are you sure you want to change the base?

kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311

Conversation

peter-mitsis commented Nov 13, 2024

andyross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-mitsis commented Nov 20, 2024

peter-mitsis commented Nov 20, 2024

peter-mitsis commented Nov 26, 2024