-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23877][SQL][followup] use PhysicalOperation to simplify the handling of Project and Filter over partitioned relation #21111
Conversation
…over partitioned relation
val partitionData = fsRelation.location.listFiles(relFilters, Nil) | ||
// partition data may be a stream, which can cause serialization to hit stack level too | ||
// deep exceptions because it is a recursive structure in memory. converting to array | ||
// avoids the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is already fixed in https://issues.apache.org/jira/browse/SPARK-21884
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that does fix it but that's in a non-obvious way. What isn't clear is what guarantees that the rows used to construct the LocalRelation will never need to be serialized. Would it be reasonable for a future commit to remove the @transient
modifier and re-introduce the problem?
I would rather this return the data in a non-recursive structure, but it's a minor point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable for a future commit to remove the @transient modifier and re-introduce the problem?
That's very unlikely. SPARK-21884 guarantees Spark won't serialize the rows and we have regression tests to protect us. BTW it would be a lot of work to make sure all the places that create LocalRelation
do not use recursive structure. I'll add some comments to LocalRelation
to emphasize it.
@@ -32,13 +33,22 @@ class OptimizeHiveMetadataOnlyQuerySuite extends QueryTest with TestHiveSingleto | |||
|
|||
import spark.implicits._ | |||
|
|||
before { | |||
override def beforeAll(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this test suite to follow the existing style in OptimizeMetadataOnlyQuerySuite
Test build #89607 has finished for PR 21111 at commit
|
retest this please |
Thanks for doing this as a follow-up. I had one minor comment, but otherwise I'm +1. I see what you mean about using |
Test build #89628 has finished for PR 21111 at commit
|
Test build #89695 has finished for PR 21111 at commit
|
retest this please |
Test build #89706 has finished for PR 21111 at commit
|
thanks, merging to master! |
…ndling of Project and Filter over partitioned relation A followup of apache#20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. existing test Author: Wenchen Fan <[email protected]> Closes apache#21111 from cloud-fan/refactor. (cherry picked from commit f70f46d) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala
…ndling of Project and Filter over partitioned relation A followup of apache#20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. existing test Author: Wenchen Fan <[email protected]> Closes apache#21111 from cloud-fan/refactor. (cherry picked from commit f70f46d) Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala
What changes were proposed in this pull request?
A followup of #20988
PhysicalOperation
can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it inOptimizeMetadataOnlyQuery
rule to handle the Project and Filter over partitioned relation.How was this patch tested?
existing test