-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework GroupByHash for faster performance and support grouping by nulls #790
Comments
Yes, this (great) explanation / overview matches my suggestion. Thanks for the write up!! Somehow the ASCII art doesn't render well here? A couple of points:
|
Great proposal. From the hashing side, an unknown to me atm is how to efficiently hash
If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances. Regardless, nulls in the group by are so important that IMO any is +1 at this point xD |
@Dandandan I think this is a version of what I was trying to explain in the "Alternatives" considered section. I think we need an entire validity mask (as only some of the group keys might be null). Also, we would need to keep validity in the hash map's values to produce the correct output , but we could get that "for free" switching to |
@jorgecarleitao -- excellent point. I have some idea of how to potentially benchmark these / keep the code separate to allow switching in different implementations. Thank you for the options (I was planning on doing the first thing you suggested, but it is good to think about the others. |
Yes, that's right. I missed the part in the alternatives, but that should work I think👍 |
One thing I'd like to benchmark is the storing of indices as part of the map values. I see that this can make updating the accumulators more efficient since we can use The overhead in that approach would be in constructing and matching I have been experimenting with another completely different approach since some time, although with a much simplified type system and also without considering null values yet. And it would require access to the whole input data and might be difficult to adapt to work with batches. The main idea is to use a hashmap that only maps input row numbers to consecutive integers that identify which group that row belongs to. So for an input like Once the indices are calculated as described above, we can do the aggregation. For each aggregate there is a generically typed implementation that takes the input array for that aggregation and the indices as parameter. It created a typed Vec of Accumulators, the length corresponding to the number of groups, and then iterators through indices and inputs and updates the accumulator at that index. This does a lot of random access, but there are no runtime typechecks or dynamic dispatch needed inside the inner loop. Another benefit of the approach is that there can be separate optimized versions for the indices part, for example when grouping by a single dictionary encoded array. |
More concretely regarding the proposal, how exactly the signature works is still a bit unclear to me. If it has a fixed length and is calculated somehow, then there is a possibility of collisions. If it is supposed to be an integer sequence then we'd need another hashmap to create it. To fix the immediate problem of null values, I would try to encode them inline into the I think in the current group by implementation, this vector is fully responsible for the equals and hashcode. That means the GroupByScalar implementation for Eq and Hash are not really used, and we could replace that with ScalarValue. The creation of this ScalarValue is already only happening when a new entry needs to be inserted into the hashmap. |
@jhorstmann the idea was that the "signature" is just a hash of the values. Collisions are handled by the fact that the entry in the hash table is a list of indicies into the mutable area -- if there are multiple values in that list each entry in the mutable area needs to be checked for equality to find the correct one. This was not very clear in the writeup and I apologize.
I agree this is a good plan (and I think it is similar to what I was trying to describe in the alternative section). I plan to try and code this version shortly so we have something to compare against.
Indeed -- I even have a PR which proposes exactly such a change: #786 :) |
Nice write-up and very interesting discussions!
|
Yes, that's also what we currently do for the hash join algorithm. It's a small performance win. It also avoids the higher re-hashing cost when growing the hashmap. I believe a hashmap could also implemented manually using a |
I don't really see how we can to this equality checks faster than the hashmap itself could do it, but I would be very happy to be proven wrong. For optimized eq and hash handling, hashbrown has some more low-level functionality that we are not yet using:
|
Thank you for the pointers @jhorstmann -- I think this is basically exactly what the design is trying to do. I will study the hashbrown api some more and see if I can use it. |
Some more info of why collision might be not that bad, even with an alternative approach:
|
SummaryTLDR; If it is ok to change
DiscussionBased on the great suggestion from @jhorstmann, I spent some time playing with hashbrown's
The approach was to create a
And then use the hashbrown API to avoid creating I made a cut down example showing how this works here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3c5e4356d83b3a5e2221dc82eb3f4ebf. In order to use this approach, the following changes are needed, as shown in the example:
The second two are straightforward. The first may require some discussion Consistent HashingRatonaleSadly, for some reason even when using the You can see the problem by switching the example to Changes neededTo illustrate the change that is needed, here is a highly simplified version of how
Here is how a consistent version that uses the
We also need to sort out how to hash null values, which are handled differently in create_hashes (not hashed) and ScalarValue (hashed). @jorgecarleitao has some good thoughts on the topic of hashing nulls here: #790 (comment) |
@Dandandan if you have a moment, I would like to know if you have any concerns with the "change |
Thanks, great writeup and nice example. Consistent hashes between Arrays and Scalar sound very nice but I think will require an extensive test suite for all data types so it doesn't get broken accidentally. An alternative could be to store the calculated hash also in the For reference, my experiments with grouping by mapping each input row to a consecutive integer used hashbrown like this:
Since the hashmap key is just an index into the input columns, the There is also a method |
I agree the testing would be key here, and I am willing to write such tests.
I tried to do this but I couldn't figure out how to do so using the
Yes that is a cool idea. I wonder if we could use something like that as an initial partial aggregate pass: we would initially aggregate each batch partially as you describe and then update the overall aggregates from the partials. |
You're right, coming from a Java background I always forget that this works very differently in Rust. It should work with
Yes, that is my understanding. The method is probably really a bit dangerous since inconsistencies will then only show up when the map gets resized.
That is also my idea, maybe not per batch, but to aggregate bigger partitions in parallel and then merging the results. The main benefit is not the hashing though, but that updating the accumulators can be done with a generic function instead of dynamic dispatch. |
I will try to have a better look at this later. The first feeling I have is that the example/proposal is:
It might be still an improvement over the current state (for hash aggregate), it looks like it simplifies some parts. |
I got enough of the approach described by @Dandandan in #790 (comment) working in https://github.com/apache/arrow-datafusion/pull/808/files to take some benchmarks. Results look as follows:
The good news is that we didn't slow down; The not so good news is that it isn't much faster either, though this may be related to the specific benches used. I will see if I can run the tpch benchmarks and do some profiling of where time is being spent. Also, in the "good news" category is that I ran out of RAM trying to find hash collisions -- lol. I have an idea to fix the hashing function for the test but the current Benchmarks run like this:
|
Cool. I am doing some tests with db-benchmark, which includes some more challenging queries. |
Some good news, this is quite like mostly an improvement on the (more challenging) db-benchmark aggregates. Master:
PR:
Performance of q10 is improved > 2x! |
Some profiling output of the gby_null_new branch. I think it shows that:
|
Thanks @Dandandan -- I think that some of the time creating/dropping/comparing I also did some some profiling with the tpch q1 (which has a group by on two keys and no joins); My conclusion from that exercise is that this approach about the same speed as the one on master Profiling command: cargo run --release --bin tpch -- benchmark datafusion --iterations 10 --path ./data --format tbl --query 1 --batch-size 10000 On master, On the gby_null_new branch Which is well within the error bounds of my measurement setup. My profiling suggests that Q1 spends 85% of the time parsing CSV data and approximately 15% of the time doing the aggregation. Of that 15%, 10% is in My next steps are is going to be:
|
Yeah Q1 from TCP-H is comparable to the Q1 of the db-benchmark: very little unique keys. The performance improvement listed above (db-benchmark q10 and others) is for a lot of groups which have some other hot paths. I think this benchmark only uses those types you mentioned, so I don't expect anything from that. There are a few other queries in TCP-H that are a bit more challenging on the aggregation side (q3 maybe?), but the benchmarks are all relatively easy as a lot of the data is filtered out on read and aggregate on big groups. |
I see the benchmark also uses the int32 type, so would be interesting to see whether that will speed up things. |
@alamb Some time ago I added the |
@jhorstmann I was dreaming about just such a feature this morning! That is an excellent idea I will given it a try later today! |
@jhorstmann -- I tried this and now the queries run too fast on SF 1 (70ms) lol -- I am trying to make SF10 data to see if that is reasonable enough for me to profile. |
Increasing the number of partitions |
TLDR; I did some more testing and with a synthetic best-case harness and I can see a small (8%) improvement for grouping on utf8 key data (average of 22 characters each). While not as much as I had hoped it is good enough for me to polish this up and get it ready for code review. Ok, I am out of time for the day but I will continue tomorrow. Details: I whipped up a custom harness that is entirely CPU bound (feeds the same record batch in multiple times) Roughly speaking, it tests grouping on 1 billion Rows, with 100 distinct keys calculating a I currently have single column group keys of type: each of int64 utf8 and dict. master:
on gby_null_new / #808
This gives an idea of what is being tested:
|
Great! Maybe it would also be good to add a benchmark that groups on unique UTF-8 keys (1 group per key), to test a workload like deduplication. Probably there is a much larger gain there, like in the db-benchmark case. |
@Dandandan how do you run db-benchmark against datafusion? Are you using h2oai/db-benchmark#182 |
Yes. To generate the data:
Then change the implementation in
And running:
|
My results on the latest version.
q4 is improved in the latest version compared to earlier (it used a int32 column to group on). q2 still looks a bit (~10%) slower. The query is: I wondering whether this comment https://github.com/apache/arrow-datafusion/pull/808/files#r683975473 might help a bit as it does some additional cloning of Another cause coulde be that hashing and comparing one |
Some profiling results reveal that a large part of the time now is spent in The first one could be partly resolved by updating/introducing a method that takes And it can get away when the actual array contents could be stored in an array directly from the start, which I think is the more longer term plan. |
Introduce the variant hash methods would help in this case. Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.
And I think in OLAP system, the hash key should always be unique (No rehashing exists). |
Thanks for the pointer @sundy-li . I agree the the value being hashed is always unique. I do think in very rare cases it is possible for multiple different input values to produce the same output hash value -- aka collisions do happen.
@Dandandan yes, I think building up the actual array content (rather than using |
FWIW I would expect |
Yes, I think for some types the hashing method might be further specialized to speed up the hashing or to reduce the amount of memory needed for the hash value instead of all ways using I think for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries. |
Yes, I believe this is the case based on the profiling results. |
#808 is now ready for review by a wider group (no pun intended) |
Rationale
Thus this ticket proposes to rearrange the GroupByHash code to be more efficient in both space and time, thus providing us with a performance budget to add NULL support without an overall regression
Overview of Current GroupByHash
This section explains the current state of GroupByHash on master at 5416341
At a high level, the group by hash does the following for each input row, in a vectorized fashion:
When all the input has been processed, then the hash table is drained, producing one row for each entry in the hash table, in the following manner:
So for example, given a query such as
This looks something like
The hash table that is formed looks something like this:
Key Formation
The current state of the art, introduced in 93de66a / apache/arrow#8863 by @Dandandan is quite clever. The code in
create_key
, packs data from the group keys together into a singlemut Vec
which is then used as the key for the hash tableFor example, if the input row was:
The resuling key is a 13 byte
Vec
, 11 bytes for "foo" (8 bytes for the length + 3 bytes for "foo") and 2 bytes for 0x1234, a 16 bit integer:However, there are at least a few downsides of this approach:
Vec
key and once in the values as aGroupByScalar
used to produce the output -- resulting in memory overhead, especially for variable length (e.g. string) valuesProposal
Modeled after what I think @Dandandan is suggesting in #786 (comment):
The HashTable would not store the key or aggregate accumulators directly, but instead would map "signatures" computed from the group by keys to list offsets in a mutable storage area that contained the values and aggregates.
The "signature" is simply a hash of the values (and validitiy bit) of the group key values. The term "signature" is used to avoid confusion with the hash used in the hash table. It would be computed as a
u64
or similar directly from the group by key valuesThe hashtable composition would be different. Each entry is a
SmallVec
(non allocating Vec) containing a list of indicies into a "mutable storage" areaEdit: Collisions are handled by the fact that the entry in the hash table is a list of indices into the mutable storage area -- if there are multiple values in that list each entry in the mutable area needs to be checked for equality to find the correct one.
The mutable storage area contains:
Vec
ofScalarValues
for each group key columnVec
of accumulators for each groupingFor example, this is one example of how (logically) this mutable storage would work
I propose using
Vec<ScalarValue>
to store the group key values in the mutable area as there is no equivalent of a mutableArray
in arrow-rs yet (though I think there isMutablePrimitiveArray
in arrow2). If/when we get access to a mutable array in datafusion, we can potentially switch to using that representation for the mutable storage area, which would likely both take less memory for some data types, but also allow for faster output generation.Alternatives considered
One alternate that would require fewer changes but be slightly slower would be to append a validity bitmap on the end of both the keys and values in the hash table. For example
And the keys would have a null bitmask appended on the end:
The text was updated successfully, but these errors were encountered: