Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

Closed
wants to merge 1 commit into from

Conversation

chenghao-intel
Copy link
Contributor

We didn't refresh the cache(CacheManager) in InsertIntoHadoopFsRelation, however, even by adding the fresh operation, I also noticed that the spark plan is immutable(in CacheData), this is probably a bug if the underlying files are changed externally (added/deleted).

So I make the the PhyscialRdd to mutable, and always create the new RDD whenever the doExecute() function is called.

@chenghao-intel
Copy link
Contributor Author

cc @liancheng

(a, f) =>
toCatalystRDD(l, a, t.buildScan(a.map(_.name).toArray, f, t.paths, confBroadcast))) :: Nil
(a, f) => {
t.refresh()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh the HadoopFsRelation right before making the rdd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is really not acceptable. We can't afford a refresh for every data source table scan... I missed those two similar refresh() calls when merging #7696. I'm trying to figuring out a more reasonable fix for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rethining about this, this can be removed, as the InsertIntoHadoopFsRelation will refresh the file status for us. we don't need to do that any more.

@chenghao-intel
Copy link
Contributor Author

no logs?

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40156 has finished for PR 8023 at commit 0fcbda2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40174 has finished for PR 8023 at commit 051c31e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

This fixing probably cause some other failure, I will look at this tomorrow.

@@ -92,14 +92,20 @@ private[sql] case class LogicalRDD(
)
}

private[sql] object PhysicalRDD {
def apply(output: Seq[Attribute], rdd: RDD[InternalRow]): PhysicalRDD = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made another constructor of PhysicalRDD.

@@ -565,6 +565,7 @@ abstract class HadoopFsRelation private[sql](maybePartitionSpec: Option[Partitio
filters: Array[Filter],
inputPaths: Array[String],
broadcastedConf: Broadcast[SerializableConfiguration]): RDD[Row] = {
refresh()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng seems refresh the file status is unavoidable. let's do that right before getting the input files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Basically it's impossible to

  1. Create a temporary JSON table pointing to path P
  2. Change the contents by arbitrary means without notifying the temporary table
  3. Read the table again and expect to get updated contents

In the old JSON relation implementation, the refreshing logic is done by TextInputFormat.listStatus(), while the new JSONRelation relies on HadoopFsRelation. We can use SqlNewHadoopRDD and override the input format there to inject the FileStatus cache to avoid extra refreshing costs there. Similar to what we did in ParquetRelation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we'd better provide our own InputFormat, at least we minimize the refreshed dir. But it probably requires lots of code change, can we do that in a separated PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we also need to refresh the partition directory before pruning the partition, probably we need to think more further how to fix that also. In the following PR(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll probably work on this later this week, it can be relatively tricky to handle...

@chenghao-intel
Copy link
Contributor Author

no logs?

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40389 has finished for PR 8023 at commit 94d6804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MQTTUtils(object):

@@ -183,6 +183,16 @@ private[sql] case class InMemoryRelation(
batchStats).asInstanceOf[this.type]
}

private[sql] def withChild(newChild: SparkPlan): this.type = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai @liancheng After double checking the source code, the spark plan of InMemoryRelation is the PhysicalRDD, which hold a data source scanning RDD instances as its property.

That's what I mean we will not take the latest files under the path when recache method called, because the RDD is materialized already and never been changed, this PR will re-created the SparkPlan from the logical plan, and the DataSourceStrategy will rebuild the RDD based on the latest files.

See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L99
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L312

I've actually tried some other approaches for the fixing:

  1. Update the code of PhyscialRDD, to take the RDDBuilder instead of the RDD for as its property, however this failed due to widely impact the existed code.
  2. Create a customized RDD, which take the path as parameter (instead of the file status), however, it's requires lots of interface changed in HadoopFsRelation, as inputFiles: Array[FileStatus] is widely used for buildScan, particularly the partition pruning is done in the DataSourceStrategy, not the HadoopFsRelation.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

I think this has been addressed in 2.0 with the introduction of refreshByPath.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants