-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized hashing for hash aggregation code #26
Comments
On arrow2, I took the following approach:
My hypothesis is that if we move hashing to arrow instead of doing item by item, we likely gain a lot. However, to do that, we need to tell our hasher to not re-hash the keys that it receives, thus the |
Thanks @jorgecarleitao ! Yeah I would agree, would be great to have a hashing kernel or maybe some basic primitives to build one easily using arrow. Something like For this PR I was thinking to move the code to |
Perfect, we have the same understanding of the problem, then :) Yeah, I used |
The issue seems stale? |
* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <[email protected]> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <[email protected]> Signed-off-by: Ruihang Xia <[email protected]> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <[email protected]> Co-authored-by: Daniël Heres <[email protected]> Co-authored-by: JanKaul <[email protected]> Co-authored-by: nseekhao <[email protected]> Co-authored-by: Ruihang Xia <[email protected]>
…pache#26) * feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect * refactor: Refactor DateFieldExtractStyle if checks into if/match
…nparser (#12161) * feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect (#26) * feat: Add DateFieldExtractStyle::Strftime support for SqliteDialect * refactor: Refactor DateFieldExtractStyle if checks into if/match * Fix merge issue --------- Co-authored-by: Andrew Lamb <[email protected]>
Updating the hash aggregate implementation to use vectorized hashing should give a decent speed up to queries that are dependant on fast hash aggregate implementations.
Currently keys are generated of type
Vec<u8>
and are hashed row-by-row which causesThe implementation should also solve hash collisions, so the original should be able to be compared with the values.
There is some WIP code here apache/arrow#9213 which can be used as a starting point / to continue from.
The text was updated successfully, but these errors were encountered: