[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

gatorsmile · 2017-06-22T06:32:45Z

What changes were proposed in this pull request?

The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule SimplifyCasts and RemoveRedundantAliases.

 SELECT word, length, cast(first as string) as first FROM view1

This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in FileFormatWriter.

The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master.

How was this patch tested?

Added a test case

gatorsmile · 2017-06-22T06:33:41Z

cc @cloud-fan

SparkQA · 2017-06-22T06:37:38Z

Test build #78435 has started for PR 18386 at commit 86ac975.

cloud-fan · 2017-06-22T06:53:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

    val partitionSet = AttributeSet(partitionColumns)
-    val dataColumns = queryExecution.logical.output.filterNot(partitionSet.contains)
+    val dataColumns = queryExecution.executedPlan.output.filterNot(partitionSet.contains)


nit: use allColumns here

cloud-fan · 2017-06-22T07:25:33Z

retest this please

cloud-fan · 2017-06-22T07:26:51Z

LGTM, pending jenkins

SparkQA · 2017-06-22T08:17:15Z

Test build #78436 has finished for PR 18386 at commit 00a63b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-22T08:29:33Z

hmm, seems a legitimate test failure...

SparkQA · 2017-06-22T08:50:03Z

Test build #78441 has finished for PR 18386 at commit 00a63b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-22T13:02:14Z

The jenkins has been unstable recently.

jiangxb1987 · 2017-06-22T13:02:30Z

retest this please

SparkQA · 2017-06-22T14:21:19Z

Test build #78458 has finished for PR 18386 at commit 00a63b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-22T14:22:50Z

Weird. The test case passed in my local environment. Need to do more investigation.

SparkQA · 2017-06-22T22:15:53Z

Test build #78473 has finished for PR 18386 at commit dfc8884.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-23T05:12:40Z

Test build #78502 has started for PR 18386 at commit 08015c8.

SparkQA · 2017-06-23T05:17:33Z

Test build #78503 has started for PR 18386 at commit bb8348d.

gatorsmile · 2017-06-23T05:38:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

      val adjustedColumns = tableCols.map { col =>
-        query.resolve(Seq(col), resolver).getOrElse {
+        query.resolve(Seq(col), resolver).map(Alias(_, col)()).getOrElse {


Need to add an alias for enforcing the query to preserve the original name of table schema, whose case could be different from the underlying query schema

ah good catch!

SparkQA · 2017-06-23T07:01:21Z

Test build #78500 has finished for PR 18386 at commit fdd254e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-23T07:18:39Z

retest this please

SparkQA · 2017-06-23T09:39:45Z

Test build #78511 has finished for PR 18386 at commit bb8348d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… INSERT AS SELECT [WIP] ### What changes were proposed in this pull request? The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`. ```SQL SELECT word, length, cast(first as string) as first FROM view1 ``` This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`. The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master. ### How was this patch tested? Added a test case Author: Xiao Li <[email protected]> Author: gatorsmile <[email protected]> Closes #18386 from gatorsmile/newRC5.

cloud-fan · 2017-06-23T12:44:55Z

thanks, merging to 2.2!

cloud-fan · 2017-10-11T14:54:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -111,9 +111,18 @@ object FileFormatWriter extends Logging {
    job.setOutputValueClass(classOf[InternalRow])
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

-    val allColumns = queryExecution.logical.output
+    val allColumns = queryExecution.executedPlan.output


This is problematic. The physical plan may have different schema from logical plan(schema name may be different), and the writer should respect the logical schema as that what users expects.

Yes. We should always use analyzed.output

…ry schema ## What changes were proposed in this pull request? #18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts #18386 and picks the patch from #19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes #19484 from cloud-fan/bug.

…ry schema ## What changes were proposed in this pull request? apache#18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts apache#18386 and picks the patch from apache#19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes apache#19484 from cloud-fan/bug.

fix.

86ac975

gatorsmile changed the title ~~[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan~~ [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT Jun 22, 2017

cloud-fan reviewed Jun 22, 2017

View reviewed changes

fix.

00a63b2

try

dfc8884

gatorsmile changed the title ~~[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT~~ [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [WIP] Jun 22, 2017

fix.

08015c8

gatorsmile force-pushed the newRC5 branch from fdd254e to 08015c8 Compare June 23, 2017 05:09

clean

bb8348d

gatorsmile commented Jun 23, 2017

View reviewed changes

gatorsmile changed the title ~~[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [WIP]~~ [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT Jun 23, 2017

gatorsmile closed this Jun 23, 2017

cloud-fan reviewed Oct 11, 2017

View reviewed changes

cloud-fan mentioned this pull request Oct 12, 2017

[SPARK-22252][SQL][2.2] FileFormatWriter should respect the input query schema #19484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

gatorsmile commented Jun 22, 2017 •

edited

Loading

gatorsmile commented Jun 22, 2017

SparkQA commented Jun 22, 2017

cloud-fan Jun 22, 2017

cloud-fan commented Jun 22, 2017

cloud-fan commented Jun 22, 2017

SparkQA commented Jun 22, 2017

cloud-fan commented Jun 22, 2017

SparkQA commented Jun 22, 2017

jiangxb1987 commented Jun 22, 2017

jiangxb1987 commented Jun 22, 2017

SparkQA commented Jun 22, 2017

gatorsmile commented Jun 22, 2017

SparkQA commented Jun 22, 2017

SparkQA commented Jun 23, 2017

SparkQA commented Jun 23, 2017

gatorsmile Jun 23, 2017

cloud-fan Jun 23, 2017

SparkQA commented Jun 23, 2017

gatorsmile commented Jun 23, 2017

SparkQA commented Jun 23, 2017

cloud-fan commented Jun 23, 2017

cloud-fan Oct 11, 2017

gatorsmile Oct 11, 2017

[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

Conversation

gatorsmile commented Jun 22, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Jun 22, 2017

SparkQA commented Jun 22, 2017

cloud-fan Jun 22, 2017

Choose a reason for hiding this comment

cloud-fan commented Jun 22, 2017

cloud-fan commented Jun 22, 2017

SparkQA commented Jun 22, 2017

cloud-fan commented Jun 22, 2017

SparkQA commented Jun 22, 2017

jiangxb1987 commented Jun 22, 2017

jiangxb1987 commented Jun 22, 2017

SparkQA commented Jun 22, 2017

gatorsmile commented Jun 22, 2017

SparkQA commented Jun 22, 2017

SparkQA commented Jun 23, 2017

SparkQA commented Jun 23, 2017

gatorsmile Jun 23, 2017

Choose a reason for hiding this comment

cloud-fan Jun 23, 2017

Choose a reason for hiding this comment

SparkQA commented Jun 23, 2017

gatorsmile commented Jun 23, 2017

SparkQA commented Jun 23, 2017

cloud-fan commented Jun 23, 2017

cloud-fan Oct 11, 2017

Choose a reason for hiding this comment

gatorsmile Oct 11, 2017

Choose a reason for hiding this comment

gatorsmile commented Jun 22, 2017 •

edited

Loading