Vectorized hashing for hash aggregation code #26

Dandandan · 2021-04-21T17:58:59Z

Updating the hash aggregate implementation to use vectorized hashing should give a decent speed up to queries that are dependant on fast hash aggregate implementations.

Currently keys are generated of type Vec<u8> and are hashed row-by-row which causes

more memory usage
slow re-hashing of the backing hashmap
type un-aware hashing for simple primitive values

The implementation should also solve hash collisions, so the original should be able to be compared with the values.

There is some WIP code here apache/arrow#9213 which can be used as a starting point / to continue from.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2021-04-21T18:10:05Z

On arrow2, I took the following approach:

Expose an hash kernel
use hash_hasher

My hypothesis is that if we move hashing to arrow instead of doing item by item, we likely gain a lot. However, to do that, we need to tell our hasher to not re-hash the keys that it receives, thus the hash_hasher.

Dandandan · 2021-04-21T18:53:20Z

Thanks @jorgecarleitao !

Yeah I would agree, would be great to have a hashing kernel or maybe some basic primitives to build one easily using arrow. Something like hash_hasher looks cool too and is actually very similar to the one used in the hash join (hashbrown hashmap + IdHashBuilder which just uses the identity function) . Looking at the code of hash_hasher, it's something similar done as in the hash join (IdHashBuilder), but seems that it should be doing a bit more work (e.g. the "hash combiner" works over bytes) and hashbrown is also slightly faster. I believe because of more inlining as the standard library one uses / exports the same crate..

For this PR I was thinking to move the code to hash_utils

jorgecarleitao · 2021-04-21T19:03:53Z

Perfect, we have the same understanding of the problem, then :)

Yeah, I used hash_hasher to generalize the Dictionary builder to arbitrary types, but as long as we have the concepts right, we can change it; I agree that hashbrown + IdHashBuilder is more performant. 👍

xudong963 · 2022-04-23T14:44:56Z

The issue seems stale?

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <[email protected]> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <[email protected]> Signed-off-by: Ruihang Xia <[email protected]> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <[email protected]> Co-authored-by: Daniël Heres <[email protected]> Co-authored-by: JanKaul <[email protected]> Co-authored-by: nseekhao <[email protected]> Co-authored-by: Ruihang Xia <[email protected]>

…nparser (#12161) * feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect (#26) * feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect * refactor: Refactor DateFieldExtractStyle if checks into if/match * Fix merge issue --------- Co-authored-by: Andrew Lamb <[email protected]>

Dandandan mentioned this issue May 9, 2021

Implement vectorized hashing for hash aggregate #149

Closed

Dandandan mentioned this issue Jul 29, 2021

Rework GroupByHash for faster performance and support grouping by nulls #790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized hashing for hash aggregation code #26

Vectorized hashing for hash aggregation code #26

Dandandan commented Apr 21, 2021 •

edited

Loading

jorgecarleitao commented Apr 21, 2021

Dandandan commented Apr 21, 2021 •

edited

Loading

jorgecarleitao commented Apr 21, 2021

xudong963 commented Apr 23, 2022

Vectorized hashing for hash aggregation code #26

Vectorized hashing for hash aggregation code #26

Comments

Dandandan commented Apr 21, 2021 • edited Loading

jorgecarleitao commented Apr 21, 2021

Dandandan commented Apr 21, 2021 • edited Loading

jorgecarleitao commented Apr 21, 2021

xudong963 commented Apr 23, 2022

Dandandan commented Apr 21, 2021 •

edited

Loading

Dandandan commented Apr 21, 2021 •

edited

Loading