-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of unstable sort #104116
Conversation
At it's core it leverages branchless compare and swap operations employed in optimal sorting networks. A variety of strategies is used to optimize for hot and cold runtime, binary size, and max comparisons done. Many patterns see a reduction in average comparisons performed. So this improvement is applied to all types that are deemed cheap to move. It copies parts of the stable sort. Before this get's merged they ought to be unified.
r? @m-ou-se (rustbot has picked a reviewer for you, use r? to override) |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
The job Click to see the possible cause of the failure (guessed by this bot)
|
If this is about the core/alloc split then you can expose them as an unstable, hidden feature without tracking issue and use that feature in the other crate.
Generic methods are always eligible for inlining because they are only instantiated once the concrete type is known.
Looking at the linked page it looks like sort networks optimize for short dependency chains, which should be good for ILP. But is there room for optimizing for cache effects too? I guess that's more of a research question than something that can be done in a PR. |
FYI the CI fails because of a TODO comment in the code. That TODO is a question for the authors here, it would be relatively easy to re-use |
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.
Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising. |
This is a followup to #100856, this time speeding up
slice::unstable_sort
. Fundamentally it uses optimal sorting networks to speedup sorting small slices. Before going into too much detail on the speedup, the most important thing is correctness. It passes my test suite in https://github.com/Voultapher/sort-research-rs both with normal Rust and Miri. That includes tests not found or not found with the same rigor in the standard library tests,observable_is_less
,panic_retain_original_set
andviolate_ord_retain_original_set
. And from a code structure point, it copies several elements fromslice::sort
. They live in separate modules, and I don't know enough about the structure of the standard library to unify them. In essence I could image them both living in core and stable sorting requiring a passed in function that does the allocation. But even then I'm not sure how that affects LTO and inlining, which are critical for performance. In my repository I have all the implementations copied into individual modules, and I'm not sure how to test that in the standard library.Speedups:
To understand this PR, it's advisable to read #100856. There I go into more detail on the test methodology and graphs.
The full benchmark results can be found here: https://voultapher.github.io/sort-research-rs/results/e746538/
Here are the speedups and slowdowns for Zen3:
For hot-u64 we can see that the speedups are the most extreme for smaller sizes and level out at 30%.
For hot-string which is relatively expensive to access we see it level out at 4% while seeing 10-15% speedup for smaller sizes, and 3x speedup as the most extreme case hot-string-descending-20. These results are in-line with the average reduced comparison counts and the extreme speedups are explained by the ability to detect fully or mostly decreasing inputs even for small inputs. For smaller sizes there are also a noticeable amount of slowdowns. I'd argue that the overall speedup is worth it here, but this can be tuned with the
qualifies_for_branchless_sort
heuristic.The 1k type generally produces the most noisy results, so I'm not sure this is a real signal. But it might be thanks to the tuned insert right and left functions.
The cheap to access but expensive to compare type f128 shows a modest speedup of 10-20%, with descending inputs again showing the largest outliers at a 5x speedup for descending-20, which switches from insertion sort to the more sophisticated
sort_small
.The cold results don't look too hot for small sizes <= 20, these could be addressed by adding extra logic. But even then it's a tricky balance, where something like pattern analysis is really hard if not impossible to do without some slowdown for cold code. Also note, I'm not too sure about the my methodology for cold code here.
The speedups and slowdowns for Firestorm are relatively similar:
Comparison statistics.
With some exceptions, overall we see a nice average reduction of comparisons. That's the reasons why I tuned
qualifies_for_branchless_sort
to allow all types except those that are deemed expensive to move.Binary bloat.
Compiling a simple hello world and sorting 6 different integer types and 2 different String containers, in release mode and stripping the binary yields:
Why is this faster?
Moder hardware is so complex that all I can do is guess, but in essence if you look at the assembly call graph of
insertion_sort_shift_left
:mbly
And compare that to
sort8_optimal
:It might seem clear, why the
sort8_optimal
and it's siblings are able to extract more Instruction-Level-Parallelism (ILP) than insertion sort which is currently being used.Another reason is, that pdqsort spends a lot of time in insertion sort:
So it's worthwhile to speed up this area. The kind of optimizations pdqsort does to avoid choosing a bad pivot or even switching to heapsort make a lot of sense at larger sizes, but are overkill for smaller sizes. Which are better addressed with something tailormade like
sort_small
.