-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arrow Row Format in SortExec to improve performance #5230
Comments
Hi. I've never contributed before but I've been working with datafusion the past month or so and am loving the project. I'd like to try my hand at this issue. But no worries if a more experienced contributor wants to take over this issue instead 😀 |
@jaylmiller, we will be happy to help with reviews + exchanging ideas about the details. Thank you, looking forward to collaborating! |
Hi there. I've created a draft PR which implements the suggested first iteration. Would appreciate any comments/suggestions and in the meantime going to start implementing the 2nd iteration,. Thanks! |
Looks good so far with a cursory look 🚀 Our team can review in detail early next week. If you think you can finish both steps in the next few days, we can do an end to end review too. |
That would be great. thank you! |
Hi there. I've got a rough draft (#5242) that works for the most part. I still have to get the metrics tracking working but I wanted to get all the sort logic working first (which it seems to be, the sql integration test suite is passing...) and potentially get some recommendations from experienced contributors on if this approach seems correct? Additionally I still need to spill the row encoding data to disk. Right now, when sorted batches are spilled, it works like it did previously (i.e. the row encoding is just (re)created in the SortPreservingMerge)... Any suggestions on the overall approach or recommendations on spilling the rows format to disk would be appreciated! |
I think @tustvold has some thoughts about spilling the row format to disk. Perhaps he can share here |
I've got the spill logic working now, just not sure what format to serialize it on disk to. I've got a temporary solution using arrow IPC--so I could test all the logic--but I'd imagine this is sub-optimal and would need to be changed. Any suggestions on serialization format would be appreciated! Thanks Other than the serialization format, everything else should now be ready to go. 🚀 |
You should be able to serialize the raw row bytes directly. For example a basic idea might be, write a 4 byte magic header, e.g. Parsing it should be relatively straightforward case of reversing the framing above, and then feeding the parsed bytes into |
Ok perfect--I figured something it'd end up being something along those lines, but wanted to make sure I wasn't missing anything... Thanks! |
Having trouble getting actionable benchmark results on my laptop. Running the benchmark twice in a row on the exact same build will often have differences of 50% between the same case on each run.... Making it hard to tell if changes are actually improving anything or not 🙃. Any tips for that? I'm still kindof new to rust, never used the benchmarking stuff before... Other than everything is done (atleast I think so! 😅) |
My laptop also has some wide variety -- I typically use a cloud server in this case. I can help run these benchmarks on such a machine if needed. the other thing that might help could be to use a slightly larger input data set in the benchmarks 🤔 |
Yeah larger inputs/longer runs help as @alamb suggests, as well as keeping your laptop connected to the charger. |
I am going to take a shot at trying to get some benchmark and will post them here |
Here are my performance results (I did not dig into this yet). Methodology: git checkout sort-preserve-row-encoding
git cherry-pick 322e92bea6e28ada9f8d57d9429748fb58b2a2a5
cargo bench -p datafusion --bench sort -- --save-baseline sort-preserve-row-encoding Results
I plan to next run a prof run to see where time is being spent |
Not sure if this is helpful but here is the perf record target/release/deps/sort-d963b058c1acc9f5 --bench "sort mixed tuple"
|
Thanks for this! Any suggestions on how I can improve this code would be great--still kinda new to Rust. 😁 I have some of my own ideas i might try and implement as well. Might it be good to have a separate PR for only v1 of this issue (much less code is changed there)? |
Interesting. I would expect to see much better results. There are some cases where the difference is quite significant. I still think this will give us good results once we identify what is going on and fix the issues. |
Cool. I'm happy to keep working on and look for the issues if you're confident that it will yield good results. I'm going to setup a cloud machine where so can run the benchmarks myself and iterate faster because I'm not having a good time with benchmarking on my laptop 😅 |
Also, I just realized that the sort benchmark is kindof bad (my mistake 😬) because when I've added this fix in PR #5308. Here are my benchmark results after that fix.
|
So we have Seems like these are the cases having |
@ozankabak I think this could be what's going on: In the So I think for smaller amounts of data the upfront time cost of encoding the rows is greater than the time saved by having a more efficient comparison for sorting. But as the number of rows increases, the time cost of sorting grows faster than the encoding, making the faster comparisons more beneficial. And the reason it's only the That being said, I'm not totally sure about how to approach this issue code-wise. Suggestions would be appreciated 😅 |
@tustvold, can you advise us on how to use the row conversion facility most efficiently? It seems both @jaylmiller and us are seeing the same behavior and I'd like to make sure we are using the tools at our disposal the right way. In summary, as batch sizes get smaller, row conversion seems to result in lower overall performance (probably due to gains not justifying the conversion cost for small batch sizes). If you can take a look at how @jaylmiller is using the facility and let us know whether it is used appropriately it'd be great. If everything is done right yet this behavior still persists, maybe we can then think about how to identify a reasonable default crossover point (which could be overridden via config) and use different approaches for different batch sizes. I don't like this kind of "impure" approaches in general, but sometimes they yield great performance. The history of sort algorithms are full of such "hacks", so maybe this is one of the places where it makes sense to do it 🙂 |
All the kernels in arrow rely on large batch sizes to amortise dispatch overheads, if partitioning is producing small batches, imo that is a bug / limitation that should be fixed. Perhaps we could convert to the row format before partitioning or something? In general I don't think regressing small inputs is necessarily an issue, arrow is optimised for batches of 1000s of rows, small datasets are not really what it is optimised for... TLDR is the behaviour you describe is inline with my expectations for most arrow functionality, small batches severely hurt throughput. This is especially true for dictionary arrays, where dictionaries are per-array |
Great. @jaylmiller, I think if we tweak the gating logic to utilize row-converted path only for tuples + large batch sizes (like > 512 or something), the practical performance will be very good in a wide variety of use cases. IIRC you are already have a gating logic so this should be a micro change, right? You can determine the crossover size experimentally. I expect a sane choice on an "average" computer will be good for many scenarios. We can also add an override mechanism through the config in a follow-on PR in the future. |
How common are such batches in practice? I guess I'm wondering if the added complexity is justified for what is effectively a degenerate case that will cause issues far beyond just for sort? The main reason I ask is DynComparator, which underpins non-single-column lexsort, has known issues w.r.t sorting nulls, and I had hoped to eventually deprecate and remove it - apache/arrow-rs#2687 |
Sounds good @ozankabak
Yes we are already only using row when doing multicolumn (tuple) sorting, so adding additional gating should not be an issue at all. |
No 512 is way too small @tustvold . So for the sort bench, we are seeing regression when the
|
Can't speak for the usages at large, but I've personally seen multiple use cases with relatively small (high hundred, low thousand zone) before in my data pipelines at various jobs. At Synnada, we use this parameter to trade-off throughout vs. latency; in some cases one is more important than the other depending on volumes etc. For this use case, this check adds no new complexity, so we are all good in that regard.
Good to know. I will think about this and discuss with my team, this will on our radar for future work. |
Do you see a similar regression in the single partition case, but if you instead reduce the size of the total benchmark down by a factor of 8? I could understand it if there were dictionaries involved, but the worst regression appears to be "sort mixed tuple preserve partitioning" which is just strings and primitives... |
@tustvold here's the results after scaling input row size down by 8. The "sort mixed tuple preserve partitioning" regression is approximately the same. The performance gains from the row encoding become decreased with less input rows, as we would expect @ozankabak |
I wonder if there is something else going on here then, I can take a look next week. Unless I'm missing something, this is not consistent with the issue being the batch size, as we would expect to now see a regression for even the non-preserving case. |
I agree its not just the batch size. It seems to be some combination of the individual batch size as well as the total number of batches being sorted at once |
Thanks 👍 I agree that there may be something else going on. Bettering our understanding will be helpful to isolate the effect of batch size and make any decisions based on that. |
I ran some experiments investigating how batch size impacts performance when doing multi column sorts on a single record batch. So the batch size theory seems wrong, but these results do demonstrate why the "preserve partitioning" cases are regressing. What's interesting is that while single batch sorting performance for the row format is actually worse, we're still getting significant performance increase when more than one batch is being sorted 🤔. For example, the benchmark comps for utf8-tuple
methodology: https://github.com/jaylmiller/inspect-arrow-sort. If someone could checkout the actual sorting code for the experiments its just a few lines and pretty much entirely lifted from the PR. Perhaps I'm just using the row format incorrectly here? My current idea is to buffer up a few record batches before sorting them (at the moment each batch is sorted immediately when it streams in) and use that to inform the decision of whether to gate the row encoding (in addition to the existing gating which checks that theres no limit and that it's multi-column). This should be a fairly minimal change to the current PR's code, but determining when (that is at what number of batches and/or size of those batches) to apply the row encoding might be more difficult. We know for sure that this change can give nice performance bumps (for example, when sorting 8 batches of size 12500 we see more than 2x increases for some cases) so I think this change is still worth making as long as we can minimize the perf regressions seen in some scenarios |
@jaylmiller, I haven't studied the sort code yet, so I'd like to ask a few quick questions to further my understanding first. Let's say we have
Is there a third output-related step I'm missing? |
@ozankabak Not quite, the batches are not coalesced until the very end of the process. This is the process for a single partition, with
So we're actually performing The docs for the sort implementation on main also describes the algorithm: The |
During my cursory look at the comments, the wording "sort all buffered batches" made me think we were sorting some sort of a coalesced dataset if there is no memory issue. So the comment is somewhat misleading (at least one person got it wrong!). Looking at the code more attentively, I see that it is doing what you are describing; i.e. buffering partially sorted batches. Given this, I am currently out of theories as to why we see the regression in the
Maybe we will get more ideas when @tustvold takes a look at whether row conversion is done properly. I will keep thinking about this in parallel as well. I will share here if I can think of anything. |
My current thinking is that since the single batch scenario is a special case with its own code path: and based on findings that row format sorting can often perform worse on single batches, it seems that the performance benefits of the row encoding are gained during the step of the algorithm that merges the mem sorted batches. One possible reason that row encoding perf benefits are seen when a merge is performed, is that we can use a sorting algorithm that benefits from the fact that the data is concat'd sorted sequences (according to rust docs, So I'm thinking if we gate row encoding usage based on whether or not that merge will happen, we can keep the perf advantages and remove these regressions. Here's my current bench results:
|
Thanks for all this work @jaylmiller -- we plan to assist / comment on this ticket in the next day or so. All your work so far has been very great. |
Great work 💯 Your charts above support your theory.
I agree that this should solve the regressions if the merge theory is correct. |
Apologies I have been busy with the arrow-rs <-> arrow2 unification effort, will try to get time to take a look this week, sorry for the delay |
This turns out not to have been a good first issue 😢 |
Apologies, I underestimated how complicated this one was, but thank you once again @jaylmiller for your efforts here, they've definitely helped and informed the ongoing work in this space 💪 |
No worries! This was great to work on nonetheless: learned alot about datafusion and made some other contributions in the process 😀 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
SortPreservingMerge
now makes use of the arrow row format and this has yielded significant performance improvements over the prior DynComparator based approach. We can likely signifcantly improve the performance ofSortExec
by modifyingsort_batch
to also make use of the row format when performing multi-column sorts, instead oflexsort_to_indices
which internally uses DynComparator.For single-column sorts
lexsort_to_indices
will call through tosort_to_indices
which will be faster than the row format, we should make sure to keep this special case.Describe the solution you'd like
A first iteration could simply modify
sort_batch
to use the row format for multi-column sorts, as demonstrated here, falling back tosort_to_indices
if only a single column.A second iteration could then look to find a way to convert to the row format once, and preserve this encoding when feeding sorted batches into
SortPreservingMerge
.Describe alternatives you've considered
We could not do this
Additional context
FYI @alamb @mustafasrepo @ozankabak
The text was updated successfully, but these errors were encountered: