Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes a mistake in the PushDownOperatorsToDataSource rule, the column pruning logic is incorrect about Project.

How was this patch tested?

a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the PushDownOperatorsToDataSource really works.

@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel

// TODO: add more push down rules.

// TODO: nested fields pruning
def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a private method instead of an inline method

def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = {
plan match {
case Project(projectList, child) =>
val required = projectList.filter(requiredByParent.contains).flatMap(_.references)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _ => plan.children.foreach(child => pushDownRequiredColumns(child, child.output))
case relation: DataSourceV2Relation => relation.reader match {
case reader: SupportsPushDownRequiredColumns =>
val requiredColumns = relation.output.filter(requiredByParent.contains)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a cleaner way to retain the original case of attributes.

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @rdblue most of the changes are tests.

@cloud-fan
Copy link
Contributor Author

@rdblue I know you wanna use PhysicalOperation to replace the current operator pushdown rule, but before we reach a consensus, I think we should still fix bugs in the existing code.

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86933 has finished for PR 20476 at commit 353dd6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2018

@cloud-fan, @gatorsmile, this PR demonstrates why we should use PhysicalOperation. I ported the tests from this PR over to our branch and they pass without modifying the push-down code. That's because it reuses code that we already trust.

I'm see no benefit to using a brand new code path for push-down when we can use what is already well tested. I know you want to push other operations, but I've already raised concerns about the design of this new code: it is brittle because it requires matching specific plan nodes.

Push-down should work as it always has: by pushing nodes that are adjacent to relations in the logical plan and relying on the optimizer to push projections and filters down as far as possible. The separation of concerns into simple rules is fundamental to the design of the optimizer. I don't think there is a good argument for new code that breaks how the optimizer is intended to work.

cc @henryr, who might want to chime in.

// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them.
RemoveRedundantProject(filterPushed)
// TODO: there may be more operators can be used to calculate required columns, we can add
// more and more in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. there may be more operators that can be used to calculate the required columns. We can add more and more in the future.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 1, 2018

@rdblue To be honest, the push-down solution in the current code base (which is based on PhysicalOperation) is not well designed. We got many feedbacks from the community (e.g., SAP and IBM Research). One proposed a bottom-up solution and another proposed a top-down solution. No solution is perfect.

In this release, we want to introduce a new solution for enhancing the capability of operator push-down. The new code path is not stable yet. We are welcoming the community to try it and provide more feedbacks about it.

@gatorsmile
Copy link
Member

To everyone, this is a bug fix we should merge before the next RC of Spark 2.3.

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2018

@gatorsmile, thanks for the context. If we need to redesign push-down, then I think we should do that separately and with a design plan.

I don't think it's a good idea to bundle it into an unrelated API update.

For one thing, we want to be able to use the existing tests for the redesigned push-down strategy, not reimplement them in pieces. We also don't want to conflate the two changes for early adopters of the new API. V2 should be as reliable as possible by minimizing new behavior.

This just isn't the right place to test out experimental designs for push-down operations.

@gatorsmile
Copy link
Member

@rdblue Operator pushdown is part of the data source API V2 SPIP: https://issues.apache.org/jira/browse/SPARK-15689

Based on the PR review history, it sounds like you also reviewed the proposal and the prototype. Since we are trying to finish the release of Spark 2.3, it might be too late to rewrite everything at the last minute.

When more users try it, we might get more feedbacks about this. Then, we can have more discussion. Hopefully, in the next release, the community can get the consensus about the design of operator push-down.

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2018

@gatorsmile, Do you mean this?

Extensibility is not good and operator push-down capabilities are limited.

If so, that's very open to interpretation. I would assume it means that the V2 interfaces should support more than just projection and filter push-down, but not a redesign of how push-down happens in the optimizer. Even if it is called out as a goal, I now see it as a misguided choice.

But either way, you make a good point about changing things for a release. I'll defer to your judgement about what should be done for the release. But for the long term, I think this issue underscores my point about reusing code that already works. Let's separate DSv2 from a push-down redesign and get it working reliably without introducing more risk and unknown problems.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 1, 2018

#19424 is the original PR that introduced the new rule PushDownOperatorsToDataSource. Both of us reviewed it. : )

Thank you for your understanding! We can have more design discussion in the next few months after you tried the new data source APIs. The code quality is always critical for Spark. We are trying to add more test cases to ensure the codes are stable and well-tested, even if we introduced new rules/codes.

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2018

Yeah, I did review it, but at the time I wasn't familiar with how the other code paths worked and assumed that it was necessary to introduce this. I wasn't very familiar with how it should work, so I didn't +1 it.

There are a few telling comments though:

How do we know that there aren't more cases that need to be supported?

What are the guarantees made by the previous batches in the optimizer? The work done by FilterAndProject seems redundant to me because the optimizer should already push filters below projection. Is that not guaranteed by the time this runs?

In any case, I now think that we should not introduce a new push-down design in conjunction with DSv2. Let's get DSv2 working properly and redesign push-down separately. In parallel is fine by me.

@gatorsmile
Copy link
Member

Since you are being more and more familar with our codes, I believe you can offer us more useful inputs.

Let me merge this PR for fixing the bugs. Then, we can have more detailed discussions later?

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86956 has finished for PR 20476 at commit 12c8035.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM.

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Feb 2, 2018
…ry expressions

This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the column pruning logic is incorrect about `Project`.

a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the `PushDownOperatorsToDataSource` really works.

Author: Wenchen Fan <[email protected]>

Closes #20476 from cloud-fan/push-down.

(cherry picked from commit 19c7c7e)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 19c7c7e Feb 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants