-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23301][SQL] data source column pruning should work for arbitrary expressions #20476
Conversation
@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel | |||
|
|||
// TODO: add more push down rules. | |||
|
|||
// TODO: nested fields pruning | |||
def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it a private method instead of an inline method
def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { | ||
plan match { | ||
case Project(projectList, child) => | ||
val required = projectList.filter(requiredByParent.contains).flatMap(_.references) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is wrong and I fixed to https://github.com/apache/spark/pull/20476/files#diff-b7f3810e65a2bb1585de9609ea491469R93
case _ => plan.children.foreach(child => pushDownRequiredColumns(child, child.output)) | ||
case relation: DataSourceV2Relation => relation.reader match { | ||
case reader: SupportsPushDownRequiredColumns => | ||
val requiredColumns = relation.output.filter(requiredByParent.contains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a cleaner way to retain the original case of attributes.
cc @gatorsmile @rdblue most of the changes are tests. |
@rdblue I know you wanna use |
Test build #86933 has finished for PR 20476 at commit
|
@cloud-fan, @gatorsmile, this PR demonstrates why we should use PhysicalOperation. I ported the tests from this PR over to our branch and they pass without modifying the push-down code. That's because it reuses code that we already trust. I'm see no benefit to using a brand new code path for push-down when we can use what is already well tested. I know you want to push other operations, but I've already raised concerns about the design of this new code: it is brittle because it requires matching specific plan nodes. Push-down should work as it always has: by pushing nodes that are adjacent to relations in the logical plan and relying on the optimizer to push projections and filters down as far as possible. The separation of concerns into simple rules is fundamental to the design of the optimizer. I don't think there is a good argument for new code that breaks how the optimizer is intended to work. cc @henryr, who might want to chime in. |
// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them. | ||
RemoveRedundantProject(filterPushed) | ||
// TODO: there may be more operators can be used to calculate required columns, we can add | ||
// more and more in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. there may be more operators that can be used to calculate the required columns. We can add more and more in the future.
@rdblue To be honest, the push-down solution in the current code base (which is based on In this release, we want to introduce a new solution for enhancing the capability of operator push-down. The new code path is not stable yet. We are welcoming the community to try it and provide more feedbacks about it. |
To everyone, this is a bug fix we should merge before the next RC of Spark 2.3. |
@gatorsmile, thanks for the context. If we need to redesign push-down, then I think we should do that separately and with a design plan. I don't think it's a good idea to bundle it into an unrelated API update. For one thing, we want to be able to use the existing tests for the redesigned push-down strategy, not reimplement them in pieces. We also don't want to conflate the two changes for early adopters of the new API. V2 should be as reliable as possible by minimizing new behavior. This just isn't the right place to test out experimental designs for push-down operations. |
@rdblue Operator pushdown is part of the data source API V2 SPIP: https://issues.apache.org/jira/browse/SPARK-15689 Based on the PR review history, it sounds like you also reviewed the proposal and the prototype. Since we are trying to finish the release of Spark 2.3, it might be too late to rewrite everything at the last minute. When more users try it, we might get more feedbacks about this. Then, we can have more discussion. Hopefully, in the next release, the community can get the consensus about the design of operator push-down. |
@gatorsmile, Do you mean this?
If so, that's very open to interpretation. I would assume it means that the V2 interfaces should support more than just projection and filter push-down, but not a redesign of how push-down happens in the optimizer. Even if it is called out as a goal, I now see it as a misguided choice. But either way, you make a good point about changing things for a release. I'll defer to your judgement about what should be done for the release. But for the long term, I think this issue underscores my point about reusing code that already works. Let's separate DSv2 from a push-down redesign and get it working reliably without introducing more risk and unknown problems. |
#19424 is the original PR that introduced the new rule Thank you for your understanding! We can have more design discussion in the next few months after you tried the new data source APIs. The code quality is always critical for Spark. We are trying to add more test cases to ensure the codes are stable and well-tested, even if we introduced new rules/codes. |
Yeah, I did review it, but at the time I wasn't familiar with how the other code paths worked and assumed that it was necessary to introduce this. I wasn't very familiar with how it should work, so I didn't +1 it. There are a few telling comments though:
In any case, I now think that we should not introduce a new push-down design in conjunction with DSv2. Let's get DSv2 working properly and redesign push-down separately. In parallel is fine by me. |
Since you are being more and more familar with our codes, I believe you can offer us more useful inputs. Let me merge this PR for fixing the bugs. Then, we can have more detailed discussions later? |
Test build #86956 has finished for PR 20476 at commit
|
LGTM. Thanks! Merged to master/2.3 |
…ry expressions This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the column pruning logic is incorrect about `Project`. a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the `PushDownOperatorsToDataSource` really works. Author: Wenchen Fan <[email protected]> Closes #20476 from cloud-fan/push-down. (cherry picked from commit 19c7c7e) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
This PR fixes a mistake in the
PushDownOperatorsToDataSource
rule, the column pruning logic is incorrect aboutProject
.How was this patch tested?
a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the
PushDownOperatorsToDataSource
really works.