Improve grouping performance by special casing small / fixed size keys #846

alamb · 2021-08-09T17:34:26Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.

However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.

Describe the solution you'd like
From @sundy-li ' comment:

Introduce the variant hash methods would help in this case.
E.G:

Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.

We can allocate one large fixed memory than multiple vec allocate.
The fixed key saves the hash map memory size.

Refer:
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by.rs#L17-L36

https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by_hash.rs#L264-L274

Alternate Ideas

@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.

Dandandan · 2021-08-09T17:58:02Z

For the direct indexing idea, there is some more context here for the hash join #816 where a similar approach could be used.

sundy-li · 2021-08-10T00:05:18Z

If a column is nullable, we can use another byte to store the nullable bits.

If [u8, u8, u16] are all nullable, u64 key can be used.

Dandandan · 2021-08-10T05:08:13Z

If a column is nullable, we can use another byte to store the nullable bits.

If [u8, u8, u16] are all nullable, u64 key can be used.

How do you avoid that hashes are very similar to each other and only differ in some bits?
As of now each element gets hashed using ahash. With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?

sundy-li · 2021-08-10T06:26:53Z

With grouping the values in one value I am wondering whether it's good enough for the hashtable? Or would you hash that again?

We don't care about the rehash in hashmap, it's the problem of hashmap.
We just ensure the Key is unique (Fixed keys are represented as Number, String key can use hash256 method, don't need to care about the key conflict because it's as safe as crack the bank's password ).

With the specified key, we can get unique AggregateFunctionState from the HashMap, then we calculate/merge this row to the state. So the block is not modified by take, we just modified the state, and only need one function or expr for each aggregate function.

Refer to clickhouse design:

https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/Aggregator.h#L580-L625

Why ClickHouse group-by is very faster?

https://bohutang.me/2021/01/21/clickhouse-and-friends-groupby/

alamb added enhancement New feature or request performance Make DataFusion faster labels Aug 9, 2021

alamb mentioned this issue Aug 9, 2021

Rework GroupByHash for faster performance and support grouping by nulls #790

Closed

alamb mentioned this issue Aug 10, 2021

Rework GroupByHash to for faster performance and support grouping by nulls #808

Merged

6 tasks

This was referenced Mar 3, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

[EPIC] A list of performance improvement tickets #5546

Open

alamb mentioned this issue Jul 12, 2023

Vectorized hash grouping #6904

Merged

16 tasks

alamb closed this as completed in #6904 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve grouping performance by special casing small / fixed size keys #846

Improve grouping performance by special casing small / fixed size keys #846

alamb commented Aug 9, 2021 •

edited

Loading

Dandandan commented Aug 9, 2021

sundy-li commented Aug 10, 2021

Dandandan commented Aug 10, 2021

sundy-li commented Aug 10, 2021

Improve grouping performance by special casing small / fixed size keys #846

Improve grouping performance by special casing small / fixed size keys #846

Comments

alamb commented Aug 9, 2021 • edited Loading

Dandandan commented Aug 9, 2021

sundy-li commented Aug 10, 2021

Dandandan commented Aug 10, 2021

sundy-li commented Aug 10, 2021

alamb commented Aug 9, 2021 •

edited

Loading