-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve grouping performance by special casing small / fixed size keys #846
Comments
For the direct indexing idea, there is some more context here for the hash join #816 where a similar approach could be used. |
If a column is nullable, we can use another byte to store the nullable bits. If [u8, u8, u16] are all nullable, u64 key can be used. |
How do you avoid that hashes are very similar to each other and only differ in some bits? |
We don't care about the rehash in hashmap, it's the problem of hashmap. With the specified key, we can get unique Refer to clickhouse design: Why ClickHouse group-by is very faster? https://bohutang.me/2021/01/21/clickhouse-and-friends-groupby/ |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.
However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.
Describe the solution you'd like
From @sundy-li ' comment:
Introduce the variant hash methods would help in this case.
E.G:
Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.
Refer:
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by.rs#L17-L36
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by_hash.rs#L264-L274
Alternate Ideas
@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.
The text was updated successfully, but these errors were encountered: