Speed up hash partitioning #6822

Dandandan · 2023-07-02T09:37:33Z

Is your feature request related to a problem or challenge?

Also see request in arrow apache/arrow-rs#4476

In DataFusion, a common operation is to repartition a RecordBatch by hashing one or more columns and dividing them into partition record batches using the "formula" hash % num_partitions.

The current approach is to create the indices that match and use them to take the individual arrays (see BatchPartitioner in datafusion).

This is relatively expensive however, as we visit the arrays num_partitions times in different places of the array, leading to cache inefficient operators (especially when the number of partitions is high).

Describe the solution you'd like

Faster hash-partitioning implementation

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2023-07-03T19:30:53Z

I recommend we look into implementing Selection Vectors / bitmaks -- then repartitioning could become a calculation of such filters/ bitmasks

Dandandan added the enhancement New feature or request label Jul 2, 2023

Dandandan changed the title ~~Speed up partitioning operator~~ Speed up hash partitioning operator Jul 2, 2023

Dandandan changed the title ~~Speed up hash partitioning operator~~ Speed up hash partitioning Jul 2, 2023

Dandandan mentioned this issue Jul 2, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

Dandandan added the performance Make DataFusion faster label Jul 19, 2023

Dandandan mentioned this issue Jul 19, 2023

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up hash partitioning #6822

Speed up hash partitioning #6822

Dandandan commented Jul 2, 2023 •

edited

Loading

alamb commented Jul 3, 2023

Speed up hash partitioning #6822

Speed up hash partitioning #6822

Comments

Dandandan commented Jul 2, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 3, 2023

Dandandan commented Jul 2, 2023 •

edited

Loading