[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

cloud-fan · 2018-02-01T14:40:52Z

What changes were proposed in this pull request?

This PR fixes a mistake in the PushDownOperatorsToDataSource rule, the column pruning logic is incorrect about Project.

How was this patch tested?

a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the PushDownOperatorsToDataSource really works.

cloud-fan · 2018-02-01T14:41:32Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel

    // TODO: add more push down rules.

-    // TODO: nested fields pruning
-    def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = {


make it a private method instead of an inline method

cloud-fan · 2018-02-01T14:42:13Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

-    def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = {
-      plan match {
-        case Project(projectList, child) =>
-          val required = projectList.filter(requiredByParent.contains).flatMap(_.references)


This line is wrong and I fixed to https://github.com/apache/spark/pull/20476/files#diff-b7f3810e65a2bb1585de9609ea491469R93

cloud-fan · 2018-02-01T14:43:11Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

-        case _ => plan.children.foreach(child => pushDownRequiredColumns(child, child.output))
+      case relation: DataSourceV2Relation => relation.reader match {
+        case reader: SupportsPushDownRequiredColumns =>
+          val requiredColumns = relation.output.filter(requiredByParent.contains)


a cleaner way to retain the original case of attributes.

cloud-fan · 2018-02-01T14:43:40Z

cc @gatorsmile @rdblue most of the changes are tests.

cloud-fan · 2018-02-01T14:45:02Z

@rdblue I know you wanna use PhysicalOperation to replace the current operator pushdown rule, but before we reach a consensus, I think we should still fix bugs in the existing code.

SparkQA · 2018-02-01T17:44:50Z

Test build #86933 has finished for PR 20476 at commit 353dd6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-02-01T18:29:26Z

@cloud-fan, @gatorsmile, this PR demonstrates why we should use PhysicalOperation. I ported the tests from this PR over to our branch and they pass without modifying the push-down code. That's because it reuses code that we already trust.

I'm see no benefit to using a brand new code path for push-down when we can use what is already well tested. I know you want to push other operations, but I've already raised concerns about the design of this new code: it is brittle because it requires matching specific plan nodes.

Push-down should work as it always has: by pushing nodes that are adjacent to relations in the logical plan and relying on the optimizer to push projections and filters down as far as possible. The separation of concerns into simple rules is fundamental to the design of the optimizer. I don't think there is a good argument for new code that breaks how the optimizer is intended to work.

cc @henryr, who might want to chime in.

gatorsmile · 2018-02-01T17:57:24Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

-    // After column pruning, we may have redundant PROJECT nodes in the query plan, remove them.
-    RemoveRedundantProject(filterPushed)
+      // TODO: there may be more operators can be used to calculate required columns, we can add
+      // more and more in the future.


Nit. there may be more operators that can be used to calculate the required columns. We can add more and more in the future.

gatorsmile · 2018-02-01T18:44:02Z

@rdblue To be honest, the push-down solution in the current code base (which is based on PhysicalOperation) is not well designed. We got many feedbacks from the community (e.g., SAP and IBM Research). One proposed a bottom-up solution and another proposed a top-down solution. No solution is perfect.

In this release, we want to introduce a new solution for enhancing the capability of operator push-down. The new code path is not stable yet. We are welcoming the community to try it and provide more feedbacks about it.

gatorsmile · 2018-02-01T18:46:55Z

To everyone, this is a bug fix we should merge before the next RC of Spark 2.3.

rdblue · 2018-02-01T19:03:56Z

@gatorsmile, thanks for the context. If we need to redesign push-down, then I think we should do that separately and with a design plan.

I don't think it's a good idea to bundle it into an unrelated API update.

For one thing, we want to be able to use the existing tests for the redesigned push-down strategy, not reimplement them in pieces. We also don't want to conflate the two changes for early adopters of the new API. V2 should be as reliable as possible by minimizing new behavior.

This just isn't the right place to test out experimental designs for push-down operations.

gatorsmile · 2018-02-01T19:15:01Z

@rdblue Operator pushdown is part of the data source API V2 SPIP: https://issues.apache.org/jira/browse/SPARK-15689

Based on the PR review history, it sounds like you also reviewed the proposal and the prototype. Since we are trying to finish the release of Spark 2.3, it might be too late to rewrite everything at the last minute.

When more users try it, we might get more feedbacks about this. Then, we can have more discussion. Hopefully, in the next release, the community can get the consensus about the design of operator push-down.

rdblue · 2018-02-01T19:28:05Z

@gatorsmile, Do you mean this?

Extensibility is not good and operator push-down capabilities are limited.

If so, that's very open to interpretation. I would assume it means that the V2 interfaces should support more than just projection and filter push-down, but not a redesign of how push-down happens in the optimizer. Even if it is called out as a goal, I now see it as a misguided choice.

But either way, you make a good point about changing things for a release. I'll defer to your judgement about what should be done for the release. But for the long term, I think this issue underscores my point about reusing code that already works. Let's separate DSv2 from a push-down redesign and get it working reliably without introducing more risk and unknown problems.

gatorsmile · 2018-02-01T19:43:07Z

#19424 is the original PR that introduced the new rule PushDownOperatorsToDataSource. Both of us reviewed it. : )

Thank you for your understanding! We can have more design discussion in the next few months after you tried the new data source APIs. The code quality is always critical for Spark. We are trying to add more test cases to ensure the codes are stable and well-tested, even if we introduced new rules/codes.

rdblue · 2018-02-01T20:39:37Z

Yeah, I did review it, but at the time I wasn't familiar with how the other code paths worked and assumed that it was necessary to introduce this. I wasn't very familiar with how it should work, so I didn't +1 it.

There are a few telling comments though:

How do we know that there aren't more cases that need to be supported?

What are the guarantees made by the previous batches in the optimizer? The work done by FilterAndProject seems redundant to me because the optimizer should already push filters below projection. Is that not guaranteed by the time this runs?

In any case, I now think that we should not introduce a new push-down design in conjunction with DSv2. Let's get DSv2 working properly and redesign push-down separately. In parallel is fine by me.

gatorsmile · 2018-02-01T22:00:23Z

Since you are being more and more familar with our codes, I believe you can offer us more useful inputs.

Let me merge this PR for fixing the bugs. Then, we can have more detailed discussions later?

SparkQA · 2018-02-02T04:41:01Z

Test build #86956 has finished for PR 20476 at commit 12c8035.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-02T04:44:17Z

LGTM.

Thanks! Merged to master/2.3

…ry expressions This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the column pruning logic is incorrect about `Project`. a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the `PushDownOperatorsToDataSource` really works. Author: Wenchen Fan <[email protected]> Closes #20476 from cloud-fan/push-down. (cherry picked from commit 19c7c7e) Signed-off-by: gatorsmile <[email protected]>

cloud-fan commented Feb 1, 2018

View reviewed changes

gatorsmile reviewed Feb 1, 2018

View reviewed changes

data source column pruning should work for arbitrary expressions

8fd8211

cloud-fan force-pushed the push-down branch from 353dd6b to 8fd8211 Compare February 2, 2018 01:24

address comments

12c8035

asfgit closed this in 19c7c7e Feb 2, 2018

This was referenced Feb 2, 2018

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

Closed

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

cloud-fan commented Feb 1, 2018

cloud-fan Feb 1, 2018

cloud-fan Feb 1, 2018

cloud-fan Feb 1, 2018

cloud-fan commented Feb 1, 2018

cloud-fan commented Feb 1, 2018

SparkQA commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile Feb 1, 2018

gatorsmile commented Feb 1, 2018 •

edited

Loading

gatorsmile commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018 •

edited

Loading

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018

SparkQA commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

Conversation

cloud-fan commented Feb 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

cloud-fan commented Feb 1, 2018

cloud-fan commented Feb 1, 2018

SparkQA commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile Feb 1, 2018

Choose a reason for hiding this comment

gatorsmile commented Feb 1, 2018 • edited Loading

gatorsmile commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018 • edited Loading

rdblue commented Feb 1, 2018

gatorsmile commented Feb 1, 2018

SparkQA commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

gatorsmile commented Feb 1, 2018 •

edited

Loading

gatorsmile commented Feb 1, 2018 •

edited

Loading