Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pushdown of count aggregation with distinct #8562

Merged
merged 4 commits into from
Sep 2, 2021

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented Jul 14, 2021

Moves the distinct aggregation rewriting optimizers to after PushAggregationIntoTableScan.

TODO: I'm not as familiar with testing around PlanOptimizers, is there a good way to test this?

#4323

@cla-bot cla-bot bot added the cla-signed label Jul 14, 2021
@findepi
Copy link
Member

findepi commented Jul 15, 2021

TODO: I'm not as familiar with testing around PlanOptimizers, is there a good way to test this?

i am afraid there isn't, since we don't have a mock connector for exercising aggregation pushdown SPI calls.
Unless you create one.

@findepi
Copy link
Member

findepi commented Jul 15, 2021

Taking these words back. Since you're making ImplementCount support DISTINCT aggregations, you should be able to verify pushdown is happening in io.trino.plugin.jdbc.BaseJdbcConnectorTest#testAggregationPushdown (eg with PostgreSQL)

@findepi
Copy link
Member

findepi commented Jul 16, 2021

@alexjo2144 can you please review build output?

@alexjo2144
Copy link
Member Author

@alexjo2144 can you please review build output?

Yeah, I need to think about this a little more. There are some pushdowns that I'm still not fully understanding.

For example, with the original code SELECT min(DISTINCT regionkey) is fully pushed down as SELECT min("regionkey") AS "tmp" FROM (SELECT "regionkey" FROM "tpch"."nation" GROUP BY "regionkey"), which is a little odd. However, after my changes only the groupby is pushded down, which is definitely not what we want.

@alexjo2144 alexjo2144 force-pushed the pushdown-count-distinct branch 7 times, most recently from 5086bb1 to cca1e07 Compare July 21, 2021 15:49
@losipiuk
Copy link
Member

@findepi should be good now (I hope). Please give it another look.

@@ -0,0 +1,90 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows pushdown of multiple distinct aggregations as long as
only one non-count aggregation is distinct.

i may have asked about that but don't remember answer -- why "only one non-count" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what exactly @alexjo2144 had in mind here.
There are many constraints here. And definitely not all queries which satisfy the condition from commit message will be be fully pushed down

E.g. this one will not:

select max(distinct regionkey), count(distinct regionkey), count(distinct nationkey) from nation;

I suggest to just leave title line in commit message and drop the rest.

@@ -282,8 +283,9 @@ public PostgreSqlClient(
this::quoted,
ImmutableSet.<AggregateFunctionRule<JdbcExpression>>builder()
.add(new ImplementCountAll(bigintTypeHandle))
.add(new ImplementCount(bigintTypeHandle))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redudantn move

@@ -118,18 +117,6 @@ protected TestTable createTableWithUnsupportedColumn()
return Optional.of(dataMappingTestSetup);
}

@Override
protected Optional<DataMappingTestSetup> filterCaseSensitiveDataMappingTestData(DataMappingTestSetup dataMappingTestSetup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related?

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"core/trino-main/src/main/java/io/trino/sql/planner/PlanOptimizers.java"

@@ -627,19 +626,36 @@ public PlanOptimizers(
.addAll(projectionPushdownRules)
.add(new PushProjectionIntoTableScan(metadata, typeAnalyzer, scalarStatsCalculator))
.add(new RemoveRedundantIdentityProjections())
.add(new PushLimitThroughMarkDistinct())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment why this is inside pushIntoTableScanRulesExceptJoins set, otherwise looks like unintentional

maybe it shouldn't be here? instead, PushLimitThroughMarkDistinct could come together with MultipleDistinctAggregationToMarkDistinct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it shouldn't be here? instead, PushLimitThroughMarkDistinct could come together with MultipleDistinctAggregationToMarkDistinct?

per @kasiafi

After removing MultipleDistinctAggregationToMarkDistinct from an IterativeOptimizer, there's no way that a MarkDistinctNode could be present at that stage, so the rule PushLimitThroughMarkDistinct can be also removed. However, I'd keep it there next to other PushLimit rules in case it's needed again in the future. Actually, the PushLimit rules could be extracted like projectionPushdownRules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind. Reworking after chat with @kasiafi

@@ -855,6 +871,7 @@ public PlanOptimizers(
// DetermineTableScanNodePartitioning is needed to needs to ensure all table handles have proper partitioning determined
// Must run before AddExchanges
.add(new DetermineTableScanNodePartitioning(metadata, nodePartitioningManager, taskCountEstimator))
.add(new MultipleDistinctAggregationToMarkDistinct())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had MultipleDistinctAggregationToMarkDistinct once and now we have twice. What's the reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not make sense.

@losipiuk
Copy link
Member

@kasiafi done (hopefully) as we discussed. Much simpler. Thanks. Let's see what CI says.

.add(new PushAggregationThroughOuterJoin())
.add(new ReplaceRedundantJoinWithSource()) // Run this after PredicatePushDown optimizer as it inlines filter constants
.add(new MultipleDistinctAggregationToMarkDistinct()) // Run this after aggregation pushdown
.addAll(limitPushdownRules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.addAll(limitPushdownRules)
.addAll(limitPushdownRules) // including through MarkDistinct

Copy link
Member

@losipiuk losipiuk Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually having this here results in test failures. @kasiafi looked deeply into that and the problem comes from extra call of PushLimitThroughOuterJoin and the fact that applyJoin will not consume same limit twice.
So without this optimization went

... Limit -> OuterJoin -> Scan(no limit)
(PushLimitThroughOuterJoin)
... Limit -> OuterJoin -> Limit -> Scan(no limit)
(PushLimitIntoTableScan)
... Limit -> OuterJoin -> Scan(with limit)

now join is pushable to TS

With this extra line the optimization continues from when it completed previously

... Limit -> OuterJoin -> Scan(with limit)
(PushLimitThroughOuterJoin)
... Limit -> OuterJoin -> Limit -> Scan(with limit)
(PushLimitIntoTableScan; applyJoin return Optional.empty() because of https://github.com/trinodb/trino/blob/408476e5d4067db61324f773510c27731373b91b/plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java#L459-L461)
... Limit -> OuterJoin -> Limit -> Scan(with limit)

!!!!! join no longer pushable to table scan

This could be addressed with #9056 but I am not super convinced this is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi, @kasiafi I suggest we go without limit pushdown here for now. It should not matter. What was to be pushed down should be already.
Do you feel it is good to be merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in PlanOptimizers looks OK.

.add(new RemoveRedundantIdentityProjections())
.add(new PushAggregationThroughOuterJoin())
.add(new ReplaceRedundantJoinWithSource()) // Run this after PredicatePushDown optimizer as it inlines filter constants
.add(new MultipleDistinctAggregationToMarkDistinct()) // Run this after aggregation pushdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... so that multiple distinct aggregations can be pushed into a connector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder whether we actually considered solving the problem differently. instead of changing rule order, we could have a rule that matches to MarkDistinct and invokes aggregation pushdown.

trying this out could be a follow up. if successful, we could simply this class, presumably even rolling the changes back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MarkDistinct is not always a result of the MultipleDistinctAggregationToMarkDistinct rule. And when it is, it could be hard to identify the original aggregations because of intervening rules.
Also, applying rule MultipleDistinctAggregationToMarkDistinct later might be possibly beneficial from the viewpoint of decorrelation.

@losipiuk losipiuk merged commit d55bb1c into trinodb:master Sep 2, 2021
@ebyhr ebyhr added this to the 362 milestone Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants