[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

chenghao-intel · 2015-08-07T05:44:37Z

We didn't refresh the cache(CacheManager) in InsertIntoHadoopFsRelation, however, even by adding the fresh operation, I also noticed that the spark plan is immutable(in CacheData), this is probably a bug if the underlying files are changed externally (added/deleted).

So I make the the PhyscialRdd to mutable, and always create the new RDD whenever the doExecute() function is called.

chenghao-intel · 2015-08-07T05:44:47Z

cc @liancheng

chenghao-intel · 2015-08-07T05:46:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

-        (a, f) =>
-          toCatalystRDD(l, a, t.buildScan(a.map(_.name).toArray, f, t.paths, confBroadcast))) :: Nil
+        (a, f) => {
+          t.refresh()


Refresh the HadoopFsRelation right before making the rdd.

Ah, this is really not acceptable. We can't afford a refresh for every data source table scan... I missed those two similar refresh() calls when merging #7696. I'm trying to figuring out a more reasonable fix for this.

After rethining about this, this can be removed, as the InsertIntoHadoopFsRelation will refresh the file status for us. we don't need to do that any more.

chenghao-intel · 2015-08-07T07:02:56Z

no logs?

chenghao-intel · 2015-08-07T07:03:03Z

retest this please

SparkQA · 2015-08-07T09:13:15Z

Test build #40156 has finished for PR 8023 at commit 0fcbda2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-07T16:25:19Z

Test build #40174 has finished for PR 8023 at commit 051c31e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-07T16:45:45Z

This fixing probably cause some other failure, I will look at this tomorrow.

liancheng · 2015-08-07T17:04:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

@@ -92,14 +92,20 @@ private[sql] case class LogicalRDD(
  )
 }

+private[sql] object PhysicalRDD {
+  def apply(output: Seq[Attribute], rdd: RDD[InternalRow]): PhysicalRDD = {


This can be made another constructor of PhysicalRDD.

chenghao-intel · 2015-08-11T03:49:27Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

@@ -565,6 +565,7 @@ abstract class HadoopFsRelation private[sql](maybePartitionSpec: Option[Partitio
      filters: Array[Filter],
      inputPaths: Array[String],
      broadcastedConf: Broadcast[SerializableConfiguration]): RDD[Row] = {
+    refresh()


@liancheng seems refresh the file status is unavoidable. let's do that right before getting the input files.

Yeah, I agree. Basically it's impossible to

Create a temporary JSON table pointing to path P

Change the contents by arbitrary means without notifying the temporary table

Read the table again and expect to get updated contents

In the old JSON relation implementation, the refreshing logic is done by TextInputFormat.listStatus(), while the new JSONRelation relies on HadoopFsRelation. We can use SqlNewHadoopRDD and override the input format there to inject the FileStatus cache to avoid extra refreshing costs there. Similar to what we did in ParquetRelation.

I agree we'd better provide our own InputFormat, at least we minimize the refreshed dir. But it probably requires lots of code change, can we do that in a separated PR?

And we also need to refresh the partition directory before pruning the partition, probably we need to think more further how to fix that also. In the following PR(s).

Yeah, I'll probably work on this later this week, it can be relatively tricky to handle...

chenghao-intel · 2015-08-11T04:34:09Z

no logs?

chenghao-intel · 2015-08-11T04:34:17Z

retest this please

SparkQA · 2015-08-11T06:20:21Z

Test build #40389 has finished for PR 8023 at commit 94d6804.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class MQTTUtils(object):

chenghao-intel · 2015-08-18T02:25:13Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

@@ -183,6 +183,16 @@ private[sql] case class InMemoryRelation(
      batchStats).asInstanceOf[this.type]
  }

+  private[sql] def withChild(newChild: SparkPlan): this.type = {


@yhuai @liancheng After double checking the source code, the spark plan of InMemoryRelation is the PhysicalRDD, which hold a data source scanning RDD instances as its property.

That's what I mean we will not take the latest files under the path when recache method called, because the RDD is materialized already and never been changed, this PR will re-created the SparkPlan from the logical plan, and the DataSourceStrategy will rebuild the RDD based on the latest files.

See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L99
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L312

I've actually tried some other approaches for the fixing:

Update the code of PhyscialRDD, to take the RDDBuilder instead of the RDD for as its property, however this failed due to widely impact the existed code.

Create a customized RDD, which take the path as parameter (instead of the file status), however, it's requires lots of interface changed in HadoopFsRelation, as inputFiles: Array[FileStatus] is widely used for buildScan, particularly the partition pruning is done in the DataSourceStrategy, not the HadoopFsRelation.

rxin · 2016-06-15T22:13:39Z

I think this has been addressed in 2.0 with the introduction of refreshByPath.

chenghao-intel reviewed Aug 7, 2015
View reviewed changes

liancheng mentioned this pull request Aug 7, 2015

[SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions #8010

Closed

chenghao-intel mentioned this pull request Aug 7, 2015

[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035

Closed

liancheng reviewed Aug 7, 2015
View reviewed changes

fix bug of invalidate cache for HadoopFsRelation

94d6804

chenghao-intel force-pushed the cache_refresh branch from 051c31e to 94d6804 Compare August 11, 2015 03:46

chenghao-intel reviewed Aug 11, 2015
View reviewed changes

chenghao-intel reviewed Aug 18, 2015
View reviewed changes

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

chenghao-intel commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

chenghao-intel Aug 7, 2015

liancheng Aug 7, 2015

chenghao-intel Aug 7, 2015

chenghao-intel commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

SparkQA commented Aug 7, 2015

SparkQA commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

liancheng Aug 7, 2015

chenghao-intel Aug 11, 2015

liancheng Aug 11, 2015

chenghao-intel Aug 12, 2015

chenghao-intel Aug 12, 2015

liancheng Aug 12, 2015

chenghao-intel commented Aug 11, 2015

chenghao-intel commented Aug 11, 2015

SparkQA commented Aug 11, 2015

chenghao-intel Aug 18, 2015

rxin commented Jun 15, 2016

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

Conversation

chenghao-intel commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

SparkQA commented Aug 7, 2015

SparkQA commented Aug 7, 2015

chenghao-intel commented Aug 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Aug 11, 2015

chenghao-intel commented Aug 11, 2015

SparkQA commented Aug 11, 2015

Choose a reason for hiding this comment

rxin commented Jun 15, 2016