Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23877][SQL][followup] use PhysicalOperation to simplify the handling of Project and Filter over partitioned relation #21111

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 20, 2018

What changes were proposed in this pull request?

A followup of #20988

PhysicalOperation can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in OptimizeMetadataOnlyQuery rule to handle the Project and Filter over partitioned relation.

How was this patch tested?

existing test

@cloud-fan
Copy link
Contributor Author

cc @rdblue @gatorsmile

val partitionData = fsRelation.location.listFiles(relFilters, Nil)
// partition data may be a stream, which can cause serialization to hit stack level too
// deep exceptions because it is a recursive structure in memory. converting to array
// avoids the problem.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is already fixed in https://issues.apache.org/jira/browse/SPARK-21884

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that does fix it but that's in a non-obvious way. What isn't clear is what guarantees that the rows used to construct the LocalRelation will never need to be serialized. Would it be reasonable for a future commit to remove the @transient modifier and re-introduce the problem?

I would rather this return the data in a non-recursive structure, but it's a minor point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable for a future commit to remove the @transient modifier and re-introduce the problem?

That's very unlikely. SPARK-21884 guarantees Spark won't serialize the rows and we have regression tests to protect us. BTW it would be a lot of work to make sure all the places that create LocalRelation do not use recursive structure. I'll add some comments to LocalRelation to emphasize it.

@@ -32,13 +33,22 @@ class OptimizeHiveMetadataOnlyQuerySuite extends QueryTest with TestHiveSingleto

import spark.implicits._

before {
override def beforeAll(): Unit = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this test suite to follow the existing style in OptimizeMetadataOnlyQuerySuite

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89607 has finished for PR 21111 at commit c3a6190.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@rdblue
Copy link
Contributor

rdblue commented Apr 20, 2018

Thanks for doing this as a follow-up. I had one minor comment, but otherwise I'm +1. I see what you mean about using PhysicalOperation now. It is slightly cleaner and I guess PhysicalOperation is the right way to accumulate the filters and projection from a sub-tree.

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89628 has finished for PR 21111 at commit c3a6190.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89695 has finished for PR 21111 at commit d624955.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89706 has finished for PR 21111 at commit d624955.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in f70f46d Apr 23, 2018
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
…ndling of Project and Filter over partitioned relation

A followup of apache#20988

`PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation.

existing test

Author: Wenchen Fan <[email protected]>

Closes apache#21111 from cloud-fan/refactor.

(cherry picked from commit f70f46d)

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
…ndling of Project and Filter over partitioned relation

A followup of apache#20988

`PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation.

existing test

Author: Wenchen Fan <[email protected]>

Closes apache#21111 from cloud-fan/refactor.

(cherry picked from commit f70f46d)

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants