Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Add support for imperfect grouping for use in radix partitioning #27873

Open
asfimport opened this issue Mar 22, 2021 · 1 comment

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 22, 2021

ARROW-11591 adds Grouper for identifying groups based on multiple key columns.

For a large number of groups, it is beneficial to do a first pass partitioning on the key columns so that each worker thread only handles a subset of the query's groups. This is usually accomplished by computing only the hashes of the keys (not full group identity) and pushing slices of the input batches to workers based on those.

This would probably make sense as a member function of Grouper, maybe Grouper::ConsumeImperfect

Reporter: Ben Kietzman / @bkietz

Related issues:

Note: This issue was originally created as ARROW-12044. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Michal Nowakiewicz / @michalursa:
The description above suggests that this can potentially be solved at a level above Grouper using a computed field representing target partition id. A note regarding partitioning and group by: it is important to not use the same hash bits coming from the same hash function for both operations. Otherwise the number of hash conflicts in the hash table can grow big.

This issue is pointing out to an existing problem that needs to be solved. Currently Grouper has been designed and works only for up to around 10 million groups or for up to 4GB of data, whichever limit is hit first. 

There are multiple ways how Grouper can be extended to work with arbitrary number of groups up to the limit of available RAM. One solution is to just change current hash table to use 64-bit offset / hashes / group ids. From the experiments I did a few months ago this is not the most efficient way of doing group by aggregation for large hash tables - solutions based on partitioning of input data prior to grouping seem to perform better for >4GB hash tables. The performance bottleneck here is related to cache misses, including costly TLB cache misses, that happen on almost every lookup if the input has somewhat uniformly distributed keys. 

The other two approaches, which seem appealing to me, are based on the same idea of representing a large hash table as a collection of smaller hash tables. Each smaller hash table in the collection stores data for a distinct range of hashes and the sum of all the ranges covers the entire set of possible hash values. The size of the hash range for smaller hash tables can potentially vary from one to another within a collection.

One approach is to use a sequence of merges. Once a small hash table gets full, it is appended to the list of merge inputs, and a fresh empty hash table replaces it for subsequent processing of input exec batches. Fragments of merge input hash tables get merged together to form a new hash table that represents combined results. Fragments that are merged may represent a subset of hashes from input hash tables. A merge of multiple hash tables with hashes from 0 to 1023 may result for instance in a pair of hash tables representing combined grouped aggregation results, one for hashes from 0 to 511 and the other from 512 to 1023. 

Second approach is to use partitioning of data. The input keys either before or after initial aggregation are put into buckets corresponding to partitions. Each partition covers a distinct range of hash values but the number of partitions and size of the range are fixed. Then grouped aggregation would be done separately within each partition using a target hash table for that partition. Recursive partitioning or dynamic adjustments to the number of partitions may be used if the final size of the data cannot be correctly estimated.

Merging and partitioning are somewhat symmetrical. It is not immediately clear to me that one is better than the other.
So in summary the design choices I see are: a) whether a single hash table or a collection of hash tables for disjoint hash ranges should represent the aggregation results (my preference is right now on the collection, which seems to give more flexibility in the future), b) whether to use merge based or partition based approach (or neither). 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant