Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve grouping performance by special casing small / fixed size keys #846

Closed
Tracked by #4973
alamb opened this issue Aug 9, 2021 · 4 comments · Fixed by #6904
Closed
Tracked by #4973

Improve grouping performance by special casing small / fixed size keys #846

alamb opened this issue Aug 9, 2021 · 4 comments · Fixed by #6904
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@alamb
Copy link
Contributor

alamb commented Aug 9, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.

However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.

Describe the solution you'd like
From @sundy-li ' comment:

Introduce the variant hash methods would help in this case.
E.G:

Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.

  1. We can allocate one large fixed memory than multiple vec allocate.
  2. The fixed key saves the hash map memory size.

Refer:
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by.rs#L17-L36

https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by_hash.rs#L264-L274

Alternate Ideas

@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.

@Dandandan
Copy link
Contributor

For the direct indexing idea, there is some more context here for the hash join #816 where a similar approach could be used.

@sundy-li
Copy link
Contributor

If a column is nullable, we can use another byte to store the nullable bits.

If [u8, u8, u16] are all nullable, u64 key can be used.

@Dandandan
Copy link
Contributor

If a column is nullable, we can use another byte to store the nullable bits.

If [u8, u8, u16] are all nullable, u64 key can be used.

How do you avoid that hashes are very similar to each other and only differ in some bits?
As of now each element gets hashed using ahash. With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?

@sundy-li
Copy link
Contributor

With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?

We don't care about the rehash in hashmap, it's the problem of hashmap.
We just ensure the Key is unique (Fixed keys are represented as Number, String key can use hash256 method, don't need to care about the key conflict because it's as safe as crack the bank's password ).

With the specified key, we can get unique AggregateFunctionState from the HashMap, then we calculate/merge this row to the state. So the block is not modified by take, we just modified the state, and only need one function or expr for each aggregate function.

Refer to clickhouse design:

https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/Aggregator.h#L580-L625

Why ClickHouse group-by is very faster?

https://bohutang.me/2021/01/21/clickhouse-and-friends-groupby/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants