-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311
base: main
Are you sure you want to change the base?
kernel: Improve cooperative and preemptive performance as per thread_metric benchmark #81311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes. The desire to see that halt_thread() refactor slow-walked and separated just barely rises to a -1 out of conservatism, just to get it a little more time to show stability impacts.
kernel/include/priority_q.h
Outdated
|
||
sys_dlist_t *l = &pq->queues[i * NBITS + TRAILING_ZEROS(pq->bitmask[i])]; | ||
sys_dnode_t *n = sys_dlist_peek_head(l); | ||
if (likely(index != 0xFFFFFFFF)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for an initialization state? Seems like you could elide that test by just leaving the initial zero, which then would index the highest priority list always (and be caught by the empty-list NULL as expected on an empty queue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. The initial state though should probably be K_NUM_THREAD_PRIO - 1
--the slot for the lowest priority thread so that the update works correctly.
This should also reduce the patch set by 1. :)
include/zephyr/kernel_structs.h
Outdated
@@ -122,6 +122,9 @@ struct _priq_rb { | |||
struct _priq_mq { | |||
sys_dlist_t queues[K_NUM_THREAD_PRIO]; | |||
unsigned long bitmask[PRIQ_BITMAP_SIZE]; | |||
#ifndef CONFIG_SMP | |||
unsigned int prio_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: "priority index" doesn't mean much. Maybe "lowest_used_prio" or something more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently thinking cached_queue_index
kernel/sched.c
Outdated
@@ -158,8 +158,10 @@ static inline bool is_halting(struct k_thread *thread) | |||
/* Clear the halting bits (_THREAD_ABORTING and _THREAD_SUSPENDING) */ | |||
static inline void clear_halting(struct k_thread *thread) | |||
{ | |||
#if CONFIG_MP_MAX_NUM_CPUS > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: IMHO a regular if()
would be cleaner and clearer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-style: CONFIG_SMP is the better tunable to test here, not the number of cores. Non-SMP multicore builds are a legal edge case, and you can only run Zephyr threads on one core, so it doesn't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, CONFIG_SMP with but 1 core is also legit. Perhaps something like ...
#if defined(CONFIG_SMP) && (CONFIG_MP_MAX_NUM_CPUS > 1)
... or its if ()
equalivent.
kernel/sched.c
Outdated
} | ||
(void)z_abort_thread_timeout(thread); | ||
unpend_all(&thread->join_queue); | ||
if (likely(new_state == _THREAD_SUSPENDED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How impactful is this patch to halt_thread()? This code was excruciatingly hard to get right and coming at it with a refactoring hatchet just for a minor performance boost gives me the willies. I don't see anything wrong, but...
I guess my strong preference would be to split this one patch out into a separate PR and give it a ton of testing in isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one patch gave us a 9.7% performance boost in the thread_metric preemptive benchmark (multiq) on the disco_l475_iot1 board when compared to the numbers from the previous commit.
@@ -36,7 +36,7 @@ struct k_spinlock _sched_spinlock; | |||
__incoherent struct k_thread _thread_dummy; | |||
|
|||
static ALWAYS_INLINE void update_cache(int preempt_ok); | |||
static void halt_thread(struct k_thread *thread, uint8_t new_state); | |||
static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's always going to be inlined, it shouldn't need a prototype (which, obviously, can't be inlined per the language spec, though in the post-LTO world many compilers are able to do so). Take this out, and if it doesn't build let's fix the declaration order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, both they are serving as forward declarations to work around the fact that each calls the other.
@@ -1258,7 +1258,7 @@ extern void thread_abort_hook(struct k_thread *thread); | |||
* @param thread Identify the thread to halt | |||
* @param new_state New thread state (_THREAD_DEAD or _THREAD_SUSPENDED) | |||
*/ | |||
static void halt_thread(struct k_thread *thread, uint8_t new_state) | |||
static ALWAYS_INLINE void halt_thread(struct k_thread *thread, uint8_t new_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I count three spots where this medium-sized function is called, so an ALWAYS_INLINE is going to have a non-trivial code size impact. Did you look at how bad it is? My gut says that this is too big to be inlining, but will defer to numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining this code showed a 124 byte code size increase on disco_l475_iot1 and a 7.8% performance boost. On qemu_x86, the code size increased by 128 bytes.
26c3180
to
f683c61
Compare
The main changes in this updated patch set are ...
|
f683c61
to
82cecec
Compare
82cecec
to
735f935
Compare
735f935
to
d15951f
Compare
d15951f
to
69a0656
Compare
Apologies for all the updates. I kept realizing that I had forgotten a small detail to fix each time. This should hopefully be it. |
Minor cleanups include ... 1. Eliminating unnecessary if-defs and forward declarations 2. Co-locating routines of the same queue type Signed-off-by: Peter Mitsis <[email protected]>
Inlines z_sched_prio_cmp() to get better performance. Signed-off-by: Peter Mitsis <[email protected]>
Dequeuing from a doubly linked list is similar to removing an item except that it does not re-initialize the dequeued node. This comes in handy when sorting a doubly linked list (where the node gets removed and re-added). In that circumstance, re-initializing the node is required. Furthermore, the compiler does not always 'understand' this. Thus, when performance is critical, dequeuing may be preferred to removing. Signed-off-by: Peter Mitsis <[email protected]>
Adds routines for setting and clearing the _THREAD_QUEUED thread_state bit. Signed-off-by: Peter Mitsis <[email protected]>
69a0656
to
d8a509d
Compare
Rebased to resolve merge conflicts. |
Adds customized yield implementations based upon the selected scheduler (dumb, multiq or scalable). Although each follows the same broad outline, some of them allow for additional tweaking to extract maximal performance. For example, the multiq variant improves the performance of k_yield() by about 20%. Signed-off-by: Peter Mitsis <[email protected]>
Even though calculating the priority queue index in the priority multiq is quick, caching it allows us to extract an extra 2% in terms of performance as measured by the thread_metric cooperative benchmark. Signed-off-by: Peter Mitsis <[email protected]>
There is no need for clear_halting() to do anything on UP systems. Signed-off-by: Peter Mitsis <[email protected]>
Gives a hint to the compiler that the bail-out paths in both k_thread_suspend() and k_thread_resume() are unlikely events. Signed-off-by: Peter Mitsis <[email protected]>
Inlining these routines helps to improve the performance of k_thread_suspend() Signed-off-by: Peter Mitsis <[email protected]>
d8a509d
to
cdbcd22
Compare
This set of commits offers improvements to both the cooperative and preemptive results in the the thread_metric benchmark.
The following numbers were obtained with the thread_metric benchmark project on the disco_l475_iot1 board. MQ denotes use of CONFIG_SCHED_MULTIQ and DQ denotes use of CONFIG_SCHED_DUMB.
Before:
Time Period Total: 12436655 (Cooperative, MQ)
Time Period Total: 11268922 (Cooperative, DQ)
Time Period Total: 5730607 (Preemptive, MQ)
Time Period Total: 6404080 (Preemptive, DQ)
After:
Time Period Total: 16001868 (Cooperative, MQ) : +28.7 %
Time Period Total: 11595554 (Cooperative, DQ) : +2.9 %
Time Period Total: 6473166 (Preemptive, MQ) : +12.9 %
Time Period Total: 7018286 (Preemptive, DQ) : +9.6%