[SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC #23601

gengliangwang · 2019-01-21T09:44:26Z

What changes were proposed in this pull request?

Create a framework for write path of File Source V2.
Also, migrate write path of ORC to V2.

Supported:

Write to file as Dataframe

Not Supported:

Partitioning, which is still under development in the data source V2 project.
Bucketing, which is still under development in the data source V2 project.
Catalog.

How was this patch tested?

Unit test

SparkQA · 2019-01-21T09:49:37Z

Test build #101471 has finished for PR 23601 at commit 91689ac.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileSourceWriter(
case class FileDataWriterFactory (
abstract class FileWriteBuilder(options: DataSourceOptions)
class OrcWriteBuilder(options: DataSourceOptions) extends FileWriteBuilder(options)

gengliangwang · 2019-01-21T17:11:22Z

Due to #21381, the write path is much easier to implement.

SparkQA · 2019-01-21T17:51:08Z

Test build #101484 has finished for PR 23601 at commit 54893e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-21T20:11:14Z

Test build #101486 has finished for PR 23601 at commit d3cd59d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-21T20:21:14Z

Test build #101488 has finished for PR 23601 at commit ebf4466.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-21T22:11:35Z

Hi, @gengliangwang . Please check UT failure in your local environment first.

gengliangwang · 2019-01-22T16:45:07Z

Hi @dongjoon-hyun ,
I ran the orc test cases before pushing the code.
After I push code, I find some comments need to be revise, so I have to push several times. That is why the test is triggered multiple times.
Sorry about that. I will try to avoid such behavior.

dongjoon-hyun · 2019-01-22T17:26:44Z

I got it~ And, thanks for the fix, @gengliangwang .

SparkQA · 2019-01-22T20:47:12Z

Test build #101547 has finished for PR 23601 at commit d6b7a95.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-22T20:55:42Z

Oh, SparkR seems to complain for some reasons.

1. Failure: Call DataFrameWriter.save() API in Java without path and check argument types (@test_sparkSQL.R#3552) 
error$message does not match "Error in orc : analysis error - path file:.*already exists".
Actual value: "Error in orc : java.lang.RuntimeException: data already exists.\n\tat

SparkQA · 2019-01-23T08:05:01Z

Test build #101575 has finished for PR 23601 at commit 2ca90a7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-23T08:52:30Z

retest this please.

SparkQA · 2019-01-23T12:49:13Z

Test build #101580 has finished for PR 23601 at commit 2ca90a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-23T16:23:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

+   * Returns whether this format supports the given [[DataType]] in write path.
+   * By default all data types are supported.
+   */
+  def supportDataType(dataType: DataType): Boolean = true


I will try to find a better solution for this. Mark this PR as WIP for now.

I think we can implement the supportDataType API in another PR. This PR is ready for review.

dongjoon-hyun · 2019-01-27T18:14:57Z

Ur, #23639 seems to make conflicts. Could you resolve the conflicts?

SparkQA · 2019-01-27T19:52:09Z

Test build #101732 has finished for PR 23601 at commit 5fda97e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileBatchWrite(
abstract class FileWriteBuilder(options: DataSourceOptions)
case class FileWriterFactory (
class OrcWriteBuilder(options: DataSourceOptions) extends FileWriteBuilder(options)

SparkQA · 2019-01-28T15:54:29Z

Test build #101761 has finished for PR 23601 at commit 9538a1b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-28T20:22:19Z

Test build #101762 has finished for PR 23601 at commit 5358ad4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileBatchWrite(
abstract class FileWriteBuilder(options: DataSourceOptions)
case class FileWriterFactory (
class OrcWriteBuilder(options: DataSourceOptions) extends FileWriteBuilder(options)

cloud-fan · 2019-01-29T08:11:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileBatchWrite.scala

+    committer: FileCommitProtocol)
+  extends BatchWrite {
+  override def commit(messages: Array[WriterCommitMessage]): Unit = {
+    committer.commitJob(job, messages.map(_.asInstanceOf[WriteTaskResult].commitMsg))


shall we call FileFormatWriter.processStats here?

SparkQA · 2019-01-29T15:54:51Z

Test build #101819 has finished for PR 23601 at commit 2bdd73a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-29T17:20:02Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        f.fallBackFileFormat
+      case _ => lookupCls
+    }
+    // SPARK-26673: In Data Source V2 project, partitioning is still under development.


Shall we remove this (SPARK-26673) since this is the current PR's JIRA?

dongjoon-hyun · 2019-01-29T17:20:40Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    }
+    // SPARK-26673: In Data Source V2 project, partitioning is still under development.
+    //              Here we fallback to V1 if the write path if output partitioning is required.
+    // TODO: use V2 implementations when partitioning feature is supported.


Could you clearly mention what JIRA ID is for this TODO?

dongjoon-hyun · 2019-01-29T17:25:05Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/FallbackOrcDataSourceV2.scala

@@ -29,7 +29,7 @@ import org.apache.spark.sql.execution.datasources.v2.orc.OrcTable
 * E.g, with temporary view `t` using [[FileDataSourceV2]], inserting into  view `t` fails
 * since there is no corresponding physical plan.
 * SPARK-23817: This is a temporary hack for making current data source V2 work. It should be
- * removed when write path of file data source v2 is finished.
+ * removed when Catalog of file data source v2 is finished.


Catalog of file data source v2 is finished? Does this mean catalog support of file data source v2?

SparkQA · 2019-01-30T08:05:02Z

Test build #101869 has finished for PR 23601 at commit 31bc1b7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-30T08:32:29Z

retest this please.

SparkQA · 2019-01-30T12:29:37Z

Test build #101883 has finished for PR 23601 at commit 31bc1b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-31T06:18:59Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+      case _ => lookupCls
+    }
+    // In Data Source V2 project, partitioning is still under development.
+    // Here we fallback to V1 if the write path if output partitioning is required.


Here we fallback to V1 if partitioning columns are specified

cloud-fan · 2019-01-31T06:26:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriteBuilder.scala

+    this
+  }
+
+  override def buildForBatch(): BatchWrite = {


this method is too long, could be better if we can separate it into multiple methods

cloud-fan · 2019-01-31T06:29:50Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

@@ -56,18 +56,25 @@ case class WriteToDataSourceV2Exec(batchWrite: BatchWrite, query: SparkPlan)
    val writerFactory = batchWrite.createBatchWriterFactory()
    val useCommitCoordinator = batchWrite.useCommitCoordinator
    val rdd = query.execute()
-    val messages = new Array[WriterCommitMessage](rdd.partitions.length)
+    // SPARK-23271 If we are attempting to write a zero partition rdd, create a dummy single
+    // partition rdd to make sure we at least set up one write task to write the metadata.


It's ok for now, but we should improve it later:

use a config to do it, it seems only file source need it

or do it in FileBatchWrite.commit. If commit messages are empty, write a metadata file.

cloud-fan · 2019-01-31T06:30:13Z

LGTM except a few minor comments

SparkQA · 2019-01-31T08:05:02Z

Test build #101937 has finished for PR 23601 at commit 8a6a9b6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-31T12:09:34Z

Test build #101939 has finished for PR 23601 at commit 7bd1c09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-31T13:29:34Z

thanks, merging to master!

…ast object in FileWriterFactory ## What changes were proposed in this pull request? This is a followup PR to fix two issues in #23601: 1. the class `FileWriterFactory` contains `conf: SerializableConfiguration` as a member, which is duplicated with `WriteJobDescription. serializableHadoopConf `. By removing it we can reduce the broadcast task binary size by around 70KB 2. The test suite `OrcV1QuerySuite`/`OrcV1QuerySuite`/`OrcV1PartitionDiscoverySuite` didn't change the configuration `SQLConf.USE_V1_SOURCE_WRITER_LIST` to `"orc"`. We should set the conf. ## How was this patch tested? Unit test Closes #23800 from gengliangwang/reduceWriteTaskSize. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…e ORC ## What changes were proposed in this pull request? Create a framework for write path of File Source V2. Also, migrate write path of ORC to V2. Supported: * Write to file as Dataframe Not Supported: * Partitioning, which is still under development in the data source V2 project. * Bucketing, which is still under development in the data source V2 project. * Catalog. ## How was this patch tested? Unit test Closes apache#23601 from gengliangwang/orc_write. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ast object in FileWriterFactory ## What changes were proposed in this pull request? This is a followup PR to fix two issues in apache#23601: 1. the class `FileWriterFactory` contains `conf: SerializableConfiguration` as a member, which is duplicated with `WriteJobDescription. serializableHadoopConf `. By removing it we can reduce the broadcast task binary size by around 70KB 2. The test suite `OrcV1QuerySuite`/`OrcV1QuerySuite`/`OrcV1PartitionDiscoverySuite` didn't change the configuration `SQLConf.USE_V1_SOURCE_WRITER_LIST` to `"orc"`. We should set the conf. ## How was this patch tested? Unit test Closes apache#23800 from gengliangwang/reduceWriteTaskSize. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

gatorsmile · 2019-02-23T08:07:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriteBuilder.scala

+        null
+
+      case SaveMode.Overwrite =>
+        committer.deleteWithJob(fs, path, true)


What happened if the path does not exist? It is possible that the underlying committer's deleteWithJob might not handle this case.

if (fs.exists(path)) { committer.deleteWithJob(fs, path, recursive = true) }

@gatorsmile I check the source code. Actually, all the implementations (that I can see in IDE) handle the case that the file path does not exist. But in InsertIntoHadoopFsRelationCommand the deleteWithJob is used as following:

if (fs.exists(path) && !committer.deleteWithJob(fs, path, true)) { throw new IOException(s"Unable to clear partition " + s"directory $path prior to writing to it") }

Should we follow it?

yea let's follow it.

OK, create #23889 for this.

…t path before delete it ## What changes were proposed in this pull request? This is a followup PR to resolve comment: apache#23601 (review) When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist, it is suggested to follow the V1 code by checking the existence. ## How was this patch tested? Apply apache#23836 and run unit tests Closes apache#23889 from gengliangwang/checkFileBeforeOverwrite. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>

gengliangwang force-pushed the orc_write branch from 74c37f5 to 91689ac Compare January 21, 2019 09:44

gengliangwang changed the title ~~[WIP][SPARK-26673] File source V2 write: create framework and migrate ORC to it~~ [SPARK-26673][SQL] File source V2 write: create framework and migrate ORC to it Jan 21, 2019

gengliangwang changed the title ~~[SPARK-26673][SQL] File source V2 write: create framework and migrate ORC to it~~ [SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC Jan 21, 2019

gengliangwang changed the title ~~[SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC~~ [WIP][SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC Jan 22, 2019

gengliangwang commented Jan 23, 2019

View reviewed changes

gengliangwang mentioned this pull request Jan 25, 2019

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

Closed

gengliangwang force-pushed the orc_write branch from 2ca90a7 to 5fda97e Compare January 27, 2019 15:41

gengliangwang changed the title ~~[WIP][SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC~~ [SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC Jan 27, 2019

gengliangwang force-pushed the orc_write branch from 5fda97e to 9538a1b Compare January 28, 2019 15:48

orc v2 writer

5358ad4

gengliangwang force-pushed the orc_write branch from 9538a1b to 5358ad4 Compare January 28, 2019 16:02

cloud-fan reviewed Jan 29, 2019

View reviewed changes

process write stats

2bdd73a

dongjoon-hyun reviewed Jan 29, 2019

View reviewed changes

address comments

31bc1b7

cloud-fan reviewed Jan 31, 2019

View reviewed changes

gengliangwang added 3 commits January 31, 2019 15:50

address comments

1652d3b

remove one blank line

8a6a9b6

revise

7bd1c09

cloud-fan closed this in df4c53e Jan 31, 2019

gengliangwang mentioned this pull request Feb 15, 2019

[SPARK-26673][FollowUp][SQL] File source V2: remove duplicated broadcast object in FileWriterFactory #23800

Closed

gatorsmile reviewed Feb 23, 2019

View reviewed changes

gengliangwang mentioned this pull request Feb 25, 2019

[SPARK-26673][FollowUp][SQL]File Source V2: check existence of output path before delete it #23889

Closed

[SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC #23601

[SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC #23601

Conversation

gengliangwang commented Jan 21, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 21, 2019

gengliangwang commented Jan 21, 2019

SparkQA commented Jan 21, 2019

SparkQA commented Jan 21, 2019

SparkQA commented Jan 21, 2019

dongjoon-hyun commented Jan 21, 2019

gengliangwang commented Jan 22, 2019

dongjoon-hyun commented Jan 22, 2019

SparkQA commented Jan 22, 2019

dongjoon-hyun commented Jan 22, 2019

SparkQA commented Jan 23, 2019

gengliangwang commented Jan 23, 2019

SparkQA commented Jan 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 27, 2019

SparkQA commented Jan 27, 2019

SparkQA commented Jan 28, 2019

SparkQA commented Jan 28, 2019

Choose a reason for hiding this comment

SparkQA commented Jan 29, 2019

dongjoon-hyun Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2019

gengliangwang commented Jan 30, 2019

SparkQA commented Jan 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 31, 2019

SparkQA commented Jan 31, 2019

SparkQA commented Jan 31, 2019

cloud-fan commented Jan 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jan 21, 2019 •

edited

Loading

dongjoon-hyun Jan 29, 2019 •

edited

Loading

gengliangwang Feb 25, 2019 •

edited

Loading