[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

cloud-fan · 2017-10-11T14:57:38Z

What changes were proposed in this pull request?

In #18064, we allowed RunnableCommand to have children in order to fix some UI issues. Then we made InsertIntoXXX commands take the input query as a child, when we do the actual writing, we just pass the physical plan to the writer(FileFormatWriter.write).

However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. ColumnPruning rule will remove no-op Projects, like Project("A", Scan("a")), and thus change the output schema from "<A: int>" to <a: int>. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.

Fortunately, in #18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input query as child of InsertIntoXXX commands.

So the fix is simple, this PR partially revert #18064 and make InsertIntoXXX commands leaf nodes again.

How was this patch tested?

new regression test

cloud-fan · 2017-10-11T15:34:17Z

cc @gatorsmile @viirya

cloud-fan · 2017-10-11T16:11:59Z

For a simple command Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe"), the UI before this PR:

The UI after this PR:

The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.

SparkQA · 2017-10-11T17:52:00Z

Test build #82638 has finished for PR 19474 at commit 3b1174f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Command extends LeafNode
trait RunnableCommand extends Command
case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode

SparkQA · 2017-10-11T18:31:16Z

Test build #82639 has finished for PR 19474 at commit 0667ac8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Command extends LeafNode
trait RunnableCommand extends Command
case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode

SparkQA · 2017-10-11T19:17:59Z

Test build #82640 has finished for PR 19474 at commit 9d4c7a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Command extends LeafNode
trait RunnableCommand extends Command
case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode

viirya · 2017-10-12T02:07:27Z

The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.

Totally agreed. LGTM

viirya · 2017-10-12T02:10:20Z

I like this change because the relation between ExecutedCommandExec and RunnableCommand is a little entangled before.

viirya · 2017-10-12T02:14:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
    job.setOutputValueClass(classOf[InternalRow])
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

-    val allColumns = plan.output
+    val allColumns = queryExecution.logical.output


I think it'd be good to leave a comment that we should not use optimized output here in case it will be changed in the future.

Btw, shall we use queryExecution.analyzed.output?

viirya · 2017-10-12T02:16:27Z

Minor comments. LGTM

gatorsmile · 2017-10-12T05:32:59Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+
+  test("FileFormatWriter should respect the input query schema") {
+    withTable("t1", "t2") {
+      spark.range(1).select('id as 'col1, 'id as 'col2).write.saveAsTable("t1")


Also add another case here?

spark.range(1).select('id, 'id as 'col1, 'id as 'col2).write.saveAsTable("t3")

gatorsmile · 2017-10-12T05:38:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala

+  def query: LogicalPlan
+
+  // We make the input `query` an inner child instead of a child in order to hide it from the
+  // optimizer. This is because optimizer may change the output schema names, and we have to keep


You will scare others. :)

-> may not preserve the output schema names' case

gatorsmile · 2017-10-12T05:41:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
    job.setOutputValueClass(classOf[InternalRow])
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

-    val allColumns = plan.output
+    val allColumns = queryExecution.logical.output


Explicitly using analyzed's schema is better here.

gatorsmile · 2017-10-12T05:44:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala

@@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration
 */
 trait DataWritingCommand extends RunnableCommand {

+  def query: LogicalPlan


Add one line description for query?

gatorsmile · 2017-10-12T07:19:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

-    val allColumns = plan.output
+    // Pick the attributes from analyzed plan, as optimizer may not preserve the output schema
+    // names' case.
+    val allColumns = queryExecution.analyzed.output
    val partitionSet = AttributeSet(partitionColumns)


You might need to double check the partitionColumns in all the other files are also from analyzed plans.

gatorsmile · 2017-10-12T07:19:22Z

LGTM pending Jenkins.

SparkQA · 2017-10-12T09:58:53Z

Test build #82661 has finished for PR 19474 at commit 5bdaf7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-12T12:21:35Z

thanks for review, merging to master!

## What changes were proposed in this pull request? This is a minor folllowup of #19474 . #19474 partially reverted #18064 but accidentally introduced a behavior change. `Command` extended `LogicalPlan` before #18064 , but #19474 made it extend `LeafNode`. This is an internal behavior change as now all `Command` subclasses can't define children, and they have to implement `computeStatistic` method. This PR fixes this by making `Command` extend `LogicalPlan` ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #19493 from cloud-fan/minor.

cloud-fan force-pushed the bug branch from 3b1174f to 0667ac8 Compare October 11, 2017 15:32

FileFormatWriter should respect the input query schema

9d4c7a2

cloud-fan force-pushed the bug branch from 0667ac8 to 9d4c7a2 Compare October 11, 2017 16:06

viirya reviewed Oct 12, 2017

View reviewed changes

gatorsmile reviewed Oct 12, 2017

View reviewed changes

address comments

5bdaf7d

gatorsmile reviewed Oct 12, 2017

View reviewed changes

asfgit closed this in 274f0ef Oct 12, 2017

cloud-fan mentioned this pull request Oct 13, 2017

[SPARK-22252][SQL][followup] Command should not be a LeafNode #19493

Closed

gengliangwang mentioned this pull request Dec 19, 2017

[SPARK-22834][SQL] Make insertion commands have real children to fix UI issues #20020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

cloud-fan commented Oct 11, 2017

cloud-fan commented Oct 11, 2017

cloud-fan commented Oct 11, 2017

SparkQA commented Oct 11, 2017

SparkQA commented Oct 11, 2017

SparkQA commented Oct 11, 2017

viirya commented Oct 12, 2017

viirya commented Oct 12, 2017 •

edited

Loading

viirya Oct 12, 2017

viirya Oct 12, 2017

viirya commented Oct 12, 2017

gatorsmile Oct 12, 2017

gatorsmile Oct 12, 2017

gatorsmile Oct 12, 2017

gatorsmile Oct 12, 2017

gatorsmile Oct 12, 2017

gatorsmile commented Oct 12, 2017

SparkQA commented Oct 12, 2017

cloud-fan commented Oct 12, 2017

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

Conversation

cloud-fan commented Oct 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Oct 11, 2017

cloud-fan commented Oct 11, 2017

SparkQA commented Oct 11, 2017

SparkQA commented Oct 11, 2017

SparkQA commented Oct 11, 2017

viirya commented Oct 12, 2017

viirya commented Oct 12, 2017 • edited Loading

viirya Oct 12, 2017

Choose a reason for hiding this comment

viirya Oct 12, 2017

Choose a reason for hiding this comment

viirya commented Oct 12, 2017

gatorsmile Oct 12, 2017

Choose a reason for hiding this comment

gatorsmile Oct 12, 2017

Choose a reason for hiding this comment

gatorsmile Oct 12, 2017

Choose a reason for hiding this comment

gatorsmile Oct 12, 2017

Choose a reason for hiding this comment

gatorsmile Oct 12, 2017

Choose a reason for hiding this comment

gatorsmile commented Oct 12, 2017

SparkQA commented Oct 12, 2017

cloud-fan commented Oct 12, 2017

viirya commented Oct 12, 2017 •

edited

Loading