Speed up `DistinctCountAccumulator` #5472

Dandandan · 2023-03-03T18:39:13Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the code uses a HashSet<ScalarValue> to track unique keys, which is slow and uses more memory than needed.

Describe the solution you'd like
Use a typed array for storing the entries & keep a hashmap.
We can use the same approach as present in the dictionary builders (PrimitiveDictionaryBuilder) or parquet Interner contributed by @tustvold.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-03-03T19:25:27Z

I'm not intimately familiar with this operator, but if it needs to support tuples or nested types the row format might be an option. You can implement the operator using rows and convert to OwnedRow to store in a HashSet

Dandandan · 2023-03-03T21:00:34Z

FYI @comphead For dictionary implementation see e.g. https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/primitive_dictionary_builder.rs.html#82

comphead · 2023-03-07T17:00:18Z

Hi @Dandandan I was crawling through the dictionary builder implementation.
It has 3-in-1 : keys array, values array and a hashmap, won't be that causing an overhead?

Please elaborate if we can try a simple HashSet<ArrowPrimitiveType> as an alternative?

alamb · 2023-07-17T12:47:52Z

I think we could make COUNT DISTINCT go much faster if implemented via HashSet<ArrowPrimitiveType>. If that was implemented as a GroupsAccumuluator I suspect COUNT DISTINCT would be relatively screaming

Dandandan · 2023-07-17T14:27:01Z

I wonder how a GroupsAccumulator implementation could look like.
Make something like a HashSet<(u64, OwnedRow>) (index + row content) or HashSet<(u64, ArrowPrimitiveType>) for single rows / primitive type.

A way to make it even faster is to use create_hashes + memoize hashes (i.e. also store hash in table) and use the RawTable API.

Also there are probably more opportunities to rewrite queries (i.e. rewrite select distinct + join on the group by keys might work faster in some cases).

alamb · 2023-07-17T18:00:43Z

I was thinking something like the following (templated on primitive type):

struct CountDistinctGroupsAccumulator<T: ArrowPrimitiveType> {
  values: HashSet<T>,
}

Clearly we could then take it a step farther to use the hashbrown API to go even faster

Then use DataType::List as we do today to pass intermediate state

Also there are probably more opportunities to rewrite queries (i.e. rewrite select distinct + join on the group by keys might work faster in some cases).

that is also quite interesting

Dandandan · 2023-07-23T22:02:10Z

I tried to prototype a solution.

I think we need more than just storing a HashSet as the distinct count needs to be distinct per group index not just overall number of distinct values (that's why the original has to use a HashMap per accumulator). It needs to keep track of the unique values per group index and produce them (as list) to compute the the final count per group index.

I think an efficient way to store the data for primitives could be the following:

PrimitiveDistinctCountGroupsAccumulator<T: ArrowPrimitiveType> {
    // stores unique group index + index of value in `values` for this group index
    uniques: RawTable<(usize, usize)>,
    /// stores the unique values
    values: Vec<T::Native>,
    /// Stores the index (+1) of the first value for index (0 is end of list)
    /// The value is stored in`values`, the next (possible) index in `next`
    /// First item for group index is stored at group index
    first: Vec<usize>,
    /// Stores the location (+1) of the next value (0 is end of list)
    next: Vec<usize>,
    /// Track nulls in the input / filters
    null_state: NullState,
    /// stores number of groups (to produce state)
    total_num_groups: usize,
    /// random state
    random_state: RandomState
}

uniques stores the group id + index to the value.
We can use a similar datastructure as in the hash join to store the data in a chained list datastructure in a single Vec.
This makes it possible to produce a list array for each group index.

alamb · 2023-07-24T19:10:54Z

Storing chained values seems like a neat idea -- if you plan to use it for the Join maybe we could make it its own data structure and reuse in both places. 🤔

alamb · 2024-01-31T17:45:19Z

FWIW we made great progress in #8849 and #8721

The remaining types that might be important are Binary/LargeBinary but I would be inclined to close this ticket as complete now that we have fast count distinct for integers and strings

yjshen · 2024-02-29T20:08:53Z

Since #8827 has been merged(including the support for both Binary/LargeBinary), I think it's okay to close this one.

Dandandan added enhancement New feature or request performance Make DataFusion faster labels Mar 3, 2023

comphead mentioned this issue Mar 10, 2023

Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547

Closed

alamb mentioned this issue Jul 17, 2023

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Open

17 tasks

korowa mentioned this issue Jan 2, 2024

feat: native types in DistinctCountAccumulator for primitive types #8721

Merged

jayzhan211 mentioned this issue Jan 13, 2024

Optimize COUNT( DISTINCT ...) for strings (up to 9x faster) #8849

Merged

This was referenced Jan 21, 2024

Change Accumulator::evaluate and Accumulator::state to take &mut self #8925

Merged

Minor: Add new Extended ClickBench benchmark queries #8950

Merged

alamb mentioned this issue Jan 31, 2024

Split count_distinct.rs into separate modules #9087

Merged

yjshen closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `DistinctCountAccumulator` #5472

Speed up `DistinctCountAccumulator` #5472

Dandandan commented Mar 3, 2023

tustvold commented Mar 3, 2023

Dandandan commented Mar 3, 2023

comphead commented Mar 7, 2023

alamb commented Jul 17, 2023

Dandandan commented Jul 17, 2023

alamb commented Jul 17, 2023 •

edited

Loading

Dandandan commented Jul 23, 2023

alamb commented Jul 24, 2023

alamb commented Jan 31, 2024

yjshen commented Feb 29, 2024

Speed up DistinctCountAccumulator #5472

Speed up DistinctCountAccumulator #5472

Comments

Dandandan commented Mar 3, 2023

tustvold commented Mar 3, 2023

Dandandan commented Mar 3, 2023

comphead commented Mar 7, 2023

alamb commented Jul 17, 2023

Dandandan commented Jul 17, 2023

alamb commented Jul 17, 2023 • edited Loading

Dandandan commented Jul 23, 2023

alamb commented Jul 24, 2023

alamb commented Jan 31, 2024

yjshen commented Feb 29, 2024

Speed up `DistinctCountAccumulator` #5472

Speed up `DistinctCountAccumulator` #5472

alamb commented Jul 17, 2023 •

edited

Loading