Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized hashing for hash aggregation code #26

Open
Dandandan opened this issue Apr 21, 2021 · 4 comments
Open

Vectorized hashing for hash aggregation code #26

Dandandan opened this issue Apr 21, 2021 · 4 comments

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Apr 21, 2021

Updating the hash aggregate implementation to use vectorized hashing should give a decent speed up to queries that are dependant on fast hash aggregate implementations.

Currently keys are generated of type Vec<u8> and are hashed row-by-row which causes

  • more memory usage
  • slow re-hashing of the backing hashmap
  • type un-aware hashing for simple primitive values

The implementation should also solve hash collisions, so the original should be able to be compared with the values.

There is some WIP code here apache/arrow#9213 which can be used as a starting point / to continue from.

@jorgecarleitao
Copy link
Member

On arrow2, I took the following approach:

  1. Expose an hash kernel
  2. use hash_hasher

My hypothesis is that if we move hashing to arrow instead of doing item by item, we likely gain a lot. However, to do that, we need to tell our hasher to not re-hash the keys that it receives, thus the hash_hasher.

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 21, 2021

Thanks @jorgecarleitao !

Yeah I would agree, would be great to have a hashing kernel or maybe some basic primitives to build one easily using arrow. Something like hash_hasher looks cool too and is actually very similar to the one used in the hash join (hashbrown hashmap + IdHashBuilder which just uses the identity function) . Looking at the code of hash_hasher, it's something similar done as in the hash join (IdHashBuilder), but seems that it should be doing a bit more work (e.g. the "hash combiner" works over bytes) and hashbrown is also slightly faster. I believe because of more inlining as the standard library one uses / exports the same crate..

For this PR I was thinking to move the code to hash_utils

@jorgecarleitao
Copy link
Member

Perfect, we have the same understanding of the problem, then :)

Yeah, I used hash_hasher to generalize the Dictionary builder to arbitrary types, but as long as we have the concepts right, we can change it; I agree that hashbrown + IdHashBuilder is more performant. 👍

@xudong963
Copy link
Member

The issue seems stale?

andygrove added a commit that referenced this issue Jan 12, 2023
* Initial commit

* initial commit

* failing test

* table scan projection

* closer

* test passes, with some hacks

* use DataFrame (#2)

* update README

* update dependency

* code cleanup (#3)

* Add support for Filter operator and BinaryOp expressions (#4)

* GitHub action (#5)

* Split code into producer and consumer modules (#6)

* Support more functions and scalar types (#7)

* Use substrait 0.1 and datafusion 8.0 (#8)

* use substrait 0.1

* use datafusion 8.0

* update datafusion to 10.0 and substrait to 0.2 (#11)

* Add basic join support (#12)

* Added fetch support (#23)

Added fetch to consumer

Added limit to producer

Added unit tests for limit

Added roundtrip_fill_none() for testing when None input can be converted to 0

Update src/consumer.rs

Co-authored-by: Andy Grove <[email protected]>

Co-authored-by: Andy Grove <[email protected]>

* Upgrade to DataFusion 13.0.0 (#25)

* Add sort consumer and producer (#24)

Add consumer

Add producer and test

Modified error string

* Add serializer/deserializer (#26)

* Add plan and function extension support (#27)

* Add plan and function extension support

* Removed unwraps

* Implement GROUP BY (#28)

* Add consumer, producer and tests for aggregate relation

Change function extension registration from absolute to relative anchor
(reference)

Remove operator to/from reference

* Fixed function registration bug

* Add test

* Addressed PR comments

* Changed field reference from mask to direct reference (#29)

* Changed field reference from masked reference to direct reference

* Handle unsupported case (struct with child)

* Handle SubqueryAlias (#30)

Fixed aggregate function register bug

* Add support for SELECT DISTINCT (#31)

Add test case

* Implement BETWEEN (#32)

* Add case (#33)

* Implement CASE WHEN

* Add more case to test

* Addressed comments

* feat: support explicit catalog/schema names in ReadRel (#34)

* feat: support explicit catalog/schema names in ReadRel

Signed-off-by: Ruihang Xia <[email protected]>

* fix: use re-exported expr crate

Signed-off-by: Ruihang Xia <[email protected]>

Signed-off-by: Ruihang Xia <[email protected]>

* move files to subfolder

* RAT

* remove rust.yaml

* revert .gitignore changes

* tomlfmt

* tomlfmt

Signed-off-by: Ruihang Xia <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: JanKaul <[email protected]>
Co-authored-by: nseekhao <[email protected]>
Co-authored-by: Ruihang Xia <[email protected]>
peasee added a commit to peasee/datafusion that referenced this issue Aug 26, 2024
…pache#26)

* feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect

* refactor: Refactor DateFieldExtractStyle if checks into if/match
alamb added a commit that referenced this issue Sep 3, 2024
…nparser (#12161)

* feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect (#26)

* feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect

* refactor: Refactor DateFieldExtractStyle if checks into if/match

* Fix merge issue

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants