Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect results for aggregation functions on case-sensitive types #8551

Merged

Conversation

hashhar
Copy link
Member

@hashhar hashhar commented Jul 14, 2021

Pushdown of aggregation functions to case-insensitive databases (e.g. MySQL) on case-sensitive inputs (e.g. VARCHAR/CHAR) can lead to incorrect results.

This change disables aggregation pushdown when:

  • Any of the grouping sets is a case-sensitive type
  • If the aggregation function is case-sensitive (e.g. max/min but not count) and any of the inputs to the aggregation function is a case-sensitive type

It also introduces a toggle to restore previous incorrect behaviour.

Fixes #7320

@hashhar hashhar added the WIP label Jul 14, 2021
@cla-bot cla-bot bot added the cla-signed label Jul 14, 2021
@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 1aadeef to 16800a0 Compare July 16, 2021 09:01
@Override
public boolean supportsAggregationPushdown(ConnectorSession session, JdbcTableHandle table, List<AggregateFunction> aggregates, Map<String, ColumnHandle> assignments, List<List<ColumnHandle>> groupingSets)
{
// Postgres sorts textual types differently compared to Trino so we cannot safely pushdown any aggregations which take a text type as an input or as part of grouping set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's about sorting (as we believe it is), limitation doesn't apply to GROUPING SETs.
Also it applies to some aggregation functions (min, max), but not others count, or count(DISTINCT)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The part about difference between aggregation functions came up before but I couldn't find a nice way to model it (other than introduce a field to AggregateFunction and having each of them declare if they depend on case-sensitivity for correctness).

For the grouping set - thanks for catching. Postgres indeed doesn't need the grouping set check - only the aggregate function check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re how to model - since this is function specific, it could be handled by io.trino.plugin.jdbc.expression.AggregateFunctionRule#rewrite returning empty or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

That's unlikely to work. We have separate rewrites because there were some differences.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.

i didn't read that part, so "maybe"

And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

that's my intuition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the approach you suggested - it works and thankfully the only case-sensitive rewrites live in base-jdbc already.

public boolean supportsAggregationPushdown(ConnectorSession session, JdbcTableHandle table, List<AggregateFunction> aggregates, Map<String, ColumnHandle> assignments, List<List<ColumnHandle>> groupingSets)
{
// Remote database can be case insensitive.
return preventTextualTypeAggregationPushdown(aggregates, groupingSets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(a_varchar) should still be pushed down, as case sensitivity (or more broadly: collations), doesn't impact results.

@hashhar hashhar removed the WIP label Jul 19, 2021
@hashhar
Copy link
Member Author

hashhar commented Jul 19, 2021

PTAL @findepi @wendigo

I can add a session property to restore performance at the user's own discretion - what should we call it? (if we do this then I'll also address #7022 as part of it).

@hashhar hashhar marked this pull request as ready for review July 19, 2021 09:06
@hashhar hashhar requested review from wendigo and findepi July 19, 2021 09:06
@hashhar
Copy link
Member Author

hashhar commented Jul 22, 2021

@wendigo @findepi Gentle ping. 🙂

@@ -55,6 +64,11 @@
JdbcColumnHandle columnHandle = (JdbcColumnHandle) context.getAssignment(input.getName());
verify(columnHandle.getColumnType().equals(aggregateFunction.getOutputType()));

// Remote database is case insensitive or sorts values differently from Trino
if (!isRemoteCaseSensitive && (columnHandle.getColumnType() instanceof CharType || columnHandle.getColumnType() instanceof VarcharType)) {
return Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's based on a wrong type.

For example, PostgreSQL's money and enum types are mapped to Trino varchar, while both will have different sorting properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rewrite rule is generic I think the only solution is to allow connectors to pass a list of jdbcTypeName through the constructor for types which should not be pushed down?

Or maybe add a static function to JdbcClient called isCollationSensitive(JdbcColumnHandle) and let connectors define their own impls? This method already exists in the PostgreSQL client btw and may be useful for others where we can pass explicit collations (MySQL once we drop 5.x).

Copy link
Member

@findepi findepi Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think we already use "is mapped to char or varchar" as a way to determine if it's potentially case insensitive (or collation-sensitive).
it may catch too much, but shouldn't catch too little, so it's fine.

leave as is


public ImplementMinMax(boolean isRemoteCaseSensitive)
{
this.isRemoteCaseSensitive = isRemoteCaseSensitive;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not exactly about case sensitivity. For example, by default PostgreSQL is case-sensitive, but still sorts differently than Trino, so we shouldn't push min/max on varchar.

BTW do PostgresQL min/max accept COLLATE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the complaint here is that the variable is named in a misleading way? Maybe isCollationSensitive would be better?

Functions don't accept explicit collations - I couldn't find any examples and the variations I tried led to syntax errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the complaint here is that the variable is named in a misleading way

yes

@@ -195,7 +195,7 @@ public void testAggregationPushdown()
// GROUP BY above TopN
assertConditionallyPushedDown(
getSession(),
"SELECT clerk, sum(totalprice) FROM (SELECT clerk, totalprice FROM orders ORDER BY orderdate ASC, totalprice ASC LIMIT 10) GROUP BY clerk",
"SELECT custkey, sum(totalprice) FROM (SELECT custkey, totalprice FROM orders ORDER BY orderdate ASC, totalprice ASC LIMIT 10) GROUP BY custkey",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clerk is varchar and part of grouping set so it prevents pushdown after the change. Changed to use a numeric column since we just want to test GROUP BY + TOPN.

@@ -949,6 +950,24 @@ protected WriteMapping legacyToWriteMapping(@SuppressWarnings("unused") Connecto
throw new TrinoException(NOT_SUPPORTED, "Unsupported column type: " + type.getDisplayName());
}

protected boolean preventTextualTypeAggregationPushdown(List<AggregateFunction> aggregates, List<List<ColumnHandle>> groupingSets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be static?

(that would help understand the state flow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should've been static from the start. thanks.


// case-insensitive functions can still be pushed down as long as grouping sets are not case-sensitive
assertThat(query("SELECT count(a_string), count(a_char) FROM " + table.getName())).isFullyPushedDown();
assertThat(query("SELECT count(a_string), count(a_char) FROM " + table.getName() + " GROUP BY a_bigint")).isFullyPushedDown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a correctness cases

  • count(DISTINCT a_string)
  • count(DISTINCT a_string), count(DISTINCT a_bigint)` (together)

this could help avoid any regressions in #8562 cc @alexjo2144

@@ -949,6 +950,24 @@ protected WriteMapping legacyToWriteMapping(@SuppressWarnings("unused") Connecto
throw new TrinoException(NOT_SUPPORTED, "Unsupported column type: " + type.getDisplayName());
}

protected boolean preventTextualTypeAggregationPushdown(List<AggregateFunction> aggregates, List<List<ColumnHandle>> groupingSets)
{
// The remote database may be case-insensitive or sort textual types differently than Trino.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote database may be case-insensitive

nit: Use "Remote database can be case insensitive" words, so that similar places in the code are searchable.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% earlier feedback

@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 504c1ae to 9e7dd3c Compare July 27, 2021 09:29
@hashhar
Copy link
Member Author

hashhar commented Jul 27, 2021

@wendigo PTAL. Applied @findepi 's comments.

@wendigo
Copy link
Contributor

wendigo commented Jul 27, 2021

LGTM @hashhar - thanks for working on that

@hashhar
Copy link
Member Author

hashhar commented Aug 2, 2021

Rebasing to make sure nothing broke since the upgrade. Will merge once CI finishes.

Some databases are case-insensitive (MySQL, SQL Server) while others
sort textual types differently compared to Trino (PostgreSQL). For such
databases pushdown of aggregation functions when the grouping set
includes a textual type can lead to incorrect results. So we prevent
aggregation pushdown for such cases.
We also prevent pushdown for functions whose results depend on sort
order (min/max) when the input is a textual type.
@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 9e7dd3c to 6473c84 Compare August 2, 2021 15:44
@hashhar
Copy link
Member Author

hashhar commented Aug 2, 2021

CI hit #8719 and #8432

@hashhar hashhar merged commit ee57029 into trinodb:master Aug 2, 2021
@hashhar hashhar added this to the 361 milestone Aug 2, 2021
@hashhar hashhar deleted the hashhar/case-insensitive-aggregation-pushdown branch August 2, 2021 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Incorrect aggregation pushdown for case insensitive columns in JDBC some connectors
3 participants