-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547
Comments
One suggestion from @jhorstmann : Rough outline how deduplication could work using existing arrow kernels: |
The sorting would probably make this more expensive than the existing accumulators. I missed that the question was in a datafusion context. I think what is hindering the performance of the One way to avoid these indirections would be to use the row format internally in the accumulator. The state of the accumulator would consist of a byte vector that contains serialized keys, and a |
Another option which I suggested earlier would be similar to dictionary array builder (https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/primitive_dictionary_builder.rs.html#82) - without maintaining the duplicate values which the dictionary does. This would probably be a bit faster than (though compared to |
For distinct aggregates on strings, I think figuring out a way to avoid the overhead of per-value 🤔 I wonder if we made ScalarValue:Utf8 use an |
Not sure if that makes sense, when doing so you already need to copy the original string to a new (columnar) array (to avoid keeping the original data in memory) - I think it would be better to avoid using |
@Dandandan offered an interesting option trying the Primitive Dictionary Builder in #5472 (comment) Another question for Ballista, for distributed COUNT DISTINCT to get final aggregated result, they should use the same structure right ? |
Yes, we would basically not have to use the For primitive values we don't even have to use a similar structure but can use something more like So the structure could look something like: #[derive(Debug)]
pub struct BinaryDistinctValues<T>
where
T: ByteArrayType,
{
state: ahash::RandomState,
/// Used to provide a lookup from string value to key type
///
/// Note: usize's hash implementation is not used, instead the raw entry
/// API is used to store keys w.r.t the hash of the strings themselves
///
dedup: HashMap<usize, (), ()>,
values_builder: GenericByteBuilder<>,
}
enum DistinctAggregationState {
Primitive(HashSet< ArrowPrimitiveType>), // primitive values
Utf8(BinaryDistinctValues<GenericStringType<i32>>),
Binary(BinaryDistinctValues<GenericBinaryType <i32>>),
} (note this doesn't support the |
I was suggesting using an optimizer rule to rewrite aggregate with distinct into double aggregation to eliminate distinct The gist of the idea is to first move distinct columns as additional grouping columns to compute non-distinct aggregate results, and then use another round of aggregation to compute values for distinct expressions (since they have already been deduplicated in the first aggregation as grouping columns). I will paste JavaDoc for Spark's RewriteDistinctAggregates below because it contains helpful examples, source here /**
* This rule rewrites an aggregate query with distinct aggregations into an expanded double
* aggregation in which the regular aggregation expressions and every distinct clause is aggregated
* in a separate group. The results are then combined in a second aggregate.
*
* First example: query without filter clauses (in scala):
* {{{
* val data = Seq(
* ("a", "ca1", "cb1", 10),
* ("a", "ca1", "cb2", 5),
* ("b", "ca1", "cb1", 13))
* .toDF("key", "cat1", "cat2", "value")
* data.createOrReplaceTempView("data")
*
* val agg = data.groupBy($"key")
* .agg(
* count_distinct($"cat1").as("cat1_cnt"),
* count_distinct($"cat2").as("cat2_cnt"),
* sum($"value").as("total"))
* }}}
*
* This translates to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [COUNT(DISTINCT 'cat1),
* COUNT(DISTINCT 'cat2),
* sum('value)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* LocalTableScan [...]
* }}}
*
* This rule rewrites this logical plan to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [count('cat1) FILTER (WHERE 'gid = 1),
* count('cat2) FILTER (WHERE 'gid = 2),
* first('total) ignore nulls FILTER (WHERE 'gid = 0)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* Aggregate(
* key = ['key, 'cat1, 'cat2, 'gid]
* functions = [sum('value)]
* output = ['key, 'cat1, 'cat2, 'gid, 'total])
* Expand(
* projections = [('key, null, null, 0, cast('value as bigint)),
* ('key, 'cat1, null, 1, null),
* ('key, null, 'cat2, 2, null)]
* output = ['key, 'cat1, 'cat2, 'gid, 'value])
* LocalTableScan [...]
* }}}
*
* Second example: aggregate function without distinct and with filter clauses (in sql):
* {{{
* SELECT
* COUNT(DISTINCT cat1) as cat1_cnt,
* COUNT(DISTINCT cat2) as cat2_cnt,
* SUM(value) FILTER (WHERE id > 1) AS total
* FROM
* data
* GROUP BY
* key
* }}}
*
* This translates to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [COUNT(DISTINCT 'cat1),
* COUNT(DISTINCT 'cat2),
* sum('value) FILTER (WHERE 'id > 1)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* LocalTableScan [...]
* }}}
*
* This rule rewrites this logical plan to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [count('cat1) FILTER (WHERE 'gid = 1),
* count('cat2) FILTER (WHERE 'gid = 2),
* first('total) ignore nulls FILTER (WHERE 'gid = 0)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* Aggregate(
* key = ['key, 'cat1, 'cat2, 'gid]
* functions = [sum('value) FILTER (WHERE 'id > 1)]
* output = ['key, 'cat1, 'cat2, 'gid, 'total])
* Expand(
* projections = [('key, null, null, 0, cast('value as bigint), 'id),
* ('key, 'cat1, null, 1, null, null),
* ('key, null, 'cat2, 2, null, null)]
* output = ['key, 'cat1, 'cat2, 'gid, 'value, 'id])
* LocalTableScan [...]
* }}}
*
* Third example: aggregate function with distinct and filter clauses (in sql):
* {{{
* SELECT
* COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
* COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_cnt,
* SUM(value) FILTER (WHERE id > 3) AS total
* FROM
* data
* GROUP BY
* key
* }}}
*
* This translates to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [COUNT(DISTINCT 'cat1) FILTER (WHERE 'id > 1),
* COUNT(DISTINCT 'cat2) FILTER (WHERE 'id > 2),
* sum('value) FILTER (WHERE 'id > 3)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* LocalTableScan [...]
* }}}
*
* This rule rewrites this logical plan to the following (pseudo) logical plan:
* {{{
* Aggregate(
* key = ['key]
* functions = [count('cat1) FILTER (WHERE 'gid = 1 and 'max_cond1),
* count('cat2) FILTER (WHERE 'gid = 2 and 'max_cond2),
* first('total) ignore nulls FILTER (WHERE 'gid = 0)]
* output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
* Aggregate(
* key = ['key, 'cat1, 'cat2, 'gid]
* functions = [max('cond1), max('cond2), sum('value) FILTER (WHERE 'id > 3)]
* output = ['key, 'cat1, 'cat2, 'gid, 'max_cond1, 'max_cond2, 'total])
* Expand(
* projections = [('key, null, null, 0, null, null, cast('value as bigint), 'id),
* ('key, 'cat1, null, 1, 'id > 1, null, null, null),
* ('key, null, 'cat2, 2, null, 'id > 2, null, null)]
* output = ['key, 'cat1, 'cat2, 'gid, 'cond1, 'cond2, 'value, 'id])
* LocalTableScan [...]
* }}} |
Yes, if you can avoid using ScalarValue I agree that would be the best |
I am not sure how it may inspire Datafusion yet, just for reference, there are two improvements in DuckDB about parallelize
I am going to read history of hash aggregation in Datafusion this week first and see if can propose some ideas and plan |
I can work on this. In SparkSQL, it leverages the Expand operator to achieve this, In DataFusion, we do not have Expand operator, instead, we can leverage the Group by GroupingSet to achieve this(need to use the GROUPING_ID() functions with Case When Then exprs). This PR need to be merged first, so DataFusion can support GROUPING() and GROUPING_ID() functions. |
We can also implement Expand in DataFusion. Additionally, Grouping Sets, CUBE, and ROLLUP functionalities can also be implemented through Expand in DataFusion. I think it will simplify the |
I would like to try the expand and rewrite approach to see how its performance differs from the current one. |
There is already a rule |
And I think the current way that DataFusion implements group by GroupingSet is better than SparkSQL's Group By + Expand approach. |
Yes, I'm aware of this.
Could you elaborate on this? My concern about the current GroupingSet implementation is the complexity it brings to
Also rewriting distinct with expand would make memory tracking for Aggregate more feasible, and aggregate state buffer more natural with row format. |
SparkSQL's I think one reason of the slowness is that DataFusion's aggregation framework is more complex, it invoke 3 data structures(the hashmap, the global group agg state vec, and the impacted group id idx vec), more complex control flow and less efficient memory access pattern, especially group by high cardinality columns. Compared to DataBend, its aggregation framework and control flow is more straightforward, the value of the hashmap is just a memory address and the agg Accumulators update the memory address directly. It is more unsafe and more close to the C++ way. I think they learn it from ClickHouse. |
Yes, there will be more rows, but I don't think there will be more data copies. The values in unrelated group columns will be nullified and the state buffer will be created from its initial state without copying the aggregate column.
We could talk about aggregate implementations/optimizations at #4973. And I believe Databend has made a lot of specific optimizations for the ClickHouse workload. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Queries like this (which compute distinct values for high cardinality columns) are currently relatively slow (if there are many values of
UserID
):Here is a specific clickbench query from the discussion on #5276
Describe the solution you'd like
We could make this type of query faster. Hopefully we can collect ideas here
Describe alternatives you've considered
TBD
Additional context
There are thoughts on improving aggregate performance in general #4973
This is one area where clickhouse and duckdb are particularly strong
See #5276 and #5276 (comment)
The text was updated successfully, but these errors were encountered: