-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect results for aggregation functions on case-sensitive types #8551
Fix incorrect results for aggregation functions on case-sensitive types #8551
Conversation
1aadeef
to
16800a0
Compare
@Override | ||
public boolean supportsAggregationPushdown(ConnectorSession session, JdbcTableHandle table, List<AggregateFunction> aggregates, Map<String, ColumnHandle> assignments, List<List<ColumnHandle>> groupingSets) | ||
{ | ||
// Postgres sorts textual types differently compared to Trino so we cannot safely pushdown any aggregations which take a text type as an input or as part of grouping set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's about sorting (as we believe it is), limitation doesn't apply to GROUPING SETs.
Also it applies to some aggregation functions (min
, max
), but not others count
, or count(DISTINCT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The part about difference between aggregation functions came up before but I couldn't find a nice way to model it (other than introduce a field to AggregateFunction and having each of them declare if they depend on case-sensitivity for correctness).
For the grouping set - thanks for catching. Postgres indeed doesn't need the grouping set check - only the aggregate function check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re how to model - since this is function specific, it could be handled by io.trino.plugin.jdbc.expression.AggregateFunctionRule#rewrite
returning empty
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation
like now.
And instead of handling the functions in supportsAggregation
I should handle them in the rewrite rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.
That's unlikely to work. We have separate rewrites because there were some differences.
Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
i didn't read that part, so "maybe"
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?
that's my intuition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the approach you suggested - it works and thankfully the only case-sensitive rewrites live in base-jdbc already.
public boolean supportsAggregationPushdown(ConnectorSession session, JdbcTableHandle table, List<AggregateFunction> aggregates, Map<String, ColumnHandle> assignments, List<List<ColumnHandle>> groupingSets) | ||
{ | ||
// Remote database can be case insensitive. | ||
return preventTextualTypeAggregationPushdown(aggregates, groupingSets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count(a_varchar)
should still be pushed down, as case sensitivity (or more broadly: collations), doesn't impact results.
plugin/trino-clickhouse/src/main/java/io/trino/plugin/clickhouse/ClickHouseClient.java
Show resolved
Hide resolved
@@ -55,6 +64,11 @@ | |||
JdbcColumnHandle columnHandle = (JdbcColumnHandle) context.getAssignment(input.getName()); | |||
verify(columnHandle.getColumnType().equals(aggregateFunction.getOutputType())); | |||
|
|||
// Remote database is case insensitive or sorts values differently from Trino | |||
if (!isRemoteCaseSensitive && (columnHandle.getColumnType() instanceof CharType || columnHandle.getColumnType() instanceof VarcharType)) { | |||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's based on a wrong type.
For example, PostgreSQL's money and enum types are mapped to Trino varchar, while both will have different sorting properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the rewrite rule is generic I think the only solution is to allow connectors to pass a list of jdbcTypeName
through the constructor for types which should not be pushed down?
Or maybe add a static function to JdbcClient called isCollationSensitive(JdbcColumnHandle)
and let connectors define their own impls? This method already exists in the PostgreSQL client btw and may be useful for others where we can pass explicit collations (MySQL once we drop 5.x).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i think we already use "is mapped to char or varchar" as a way to determine if it's potentially case insensitive (or collation-sensitive).
it may catch too much, but shouldn't catch too little, so it's fine.
leave as is
|
||
public ImplementMinMax(boolean isRemoteCaseSensitive) | ||
{ | ||
this.isRemoteCaseSensitive = isRemoteCaseSensitive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not exactly about case sensitivity. For example, by default PostgreSQL is case-sensitive, but still sorts differently than Trino, so we shouldn't push min/max on varchar.
BTW do PostgresQL min/max accept COLLATE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the complaint here is that the variable is named in a misleading way? Maybe isCollationSensitive
would be better?
Functions don't accept explicit collations - I couldn't find any examples and the variations I tried led to syntax errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the complaint here is that the variable is named in a misleading way
yes
@@ -195,7 +195,7 @@ public void testAggregationPushdown() | |||
// GROUP BY above TopN | |||
assertConditionallyPushedDown( | |||
getSession(), | |||
"SELECT clerk, sum(totalprice) FROM (SELECT clerk, totalprice FROM orders ORDER BY orderdate ASC, totalprice ASC LIMIT 10) GROUP BY clerk", | |||
"SELECT custkey, sum(totalprice) FROM (SELECT custkey, totalprice FROM orders ORDER BY orderdate ASC, totalprice ASC LIMIT 10) GROUP BY custkey", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clerk is varchar and part of grouping set so it prevents pushdown after the change. Changed to use a numeric column since we just want to test GROUP BY + TOPN.
@@ -949,6 +950,24 @@ protected WriteMapping legacyToWriteMapping(@SuppressWarnings("unused") Connecto | |||
throw new TrinoException(NOT_SUPPORTED, "Unsupported column type: " + type.getDisplayName()); | |||
} | |||
|
|||
protected boolean preventTextualTypeAggregationPushdown(List<AggregateFunction> aggregates, List<List<ColumnHandle>> groupingSets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be static?
(that would help understand the state flow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should've been static from the start. thanks.
|
||
// case-insensitive functions can still be pushed down as long as grouping sets are not case-sensitive | ||
assertThat(query("SELECT count(a_string), count(a_char) FROM " + table.getName())).isFullyPushedDown(); | ||
assertThat(query("SELECT count(a_string), count(a_char) FROM " + table.getName() + " GROUP BY a_bigint")).isFullyPushedDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a correctness cases
count(DISTINCT a_string)
count(DISTINCT a_string),
count(DISTINCT a_bigint)` (together)
this could help avoid any regressions in #8562 cc @alexjo2144
@@ -949,6 +950,24 @@ protected WriteMapping legacyToWriteMapping(@SuppressWarnings("unused") Connecto | |||
throw new TrinoException(NOT_SUPPORTED, "Unsupported column type: " + type.getDisplayName()); | |||
} | |||
|
|||
protected boolean preventTextualTypeAggregationPushdown(List<AggregateFunction> aggregates, List<List<ColumnHandle>> groupingSets) | |||
{ | |||
// The remote database may be case-insensitive or sort textual types differently than Trino. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote database may be case-insensitive
nit: Use "Remote database can be case insensitive" words, so that similar places in the code are searchable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
% earlier feedback
504c1ae
to
9e7dd3c
Compare
LGTM @hashhar - thanks for working on that |
Rebasing to make sure nothing broke since the upgrade. Will merge once CI finishes. |
Some databases are case-insensitive (MySQL, SQL Server) while others sort textual types differently compared to Trino (PostgreSQL). For such databases pushdown of aggregation functions when the grouping set includes a textual type can lead to incorrect results. So we prevent aggregation pushdown for such cases. We also prevent pushdown for functions whose results depend on sort order (min/max) when the input is a textual type.
9e7dd3c
to
6473c84
Compare
Pushdown of aggregation functions to case-insensitive databases (e.g. MySQL) on case-sensitive inputs (e.g. VARCHAR/CHAR) can lead to incorrect results.
This change disables aggregation pushdown when:
It also introduces a toggle to restore previous incorrect behaviour.Fixes #7320