Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386

Closed
wants to merge 5 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jun 22, 2017

What changes were proposed in this pull request?

The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule SimplifyCasts and RemoveRedundantAliases.

 SELECT word, length, cast(first as string) as first FROM view1

This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in FileFormatWriter.

The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master.

How was this patch tested?

Added a test case

@gatorsmile
Copy link
Member Author

cc @cloud-fan

@gatorsmile gatorsmile changed the title [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT Jun 22, 2017
@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78435 has started for PR 18386 at commit 86ac975.

val partitionSet = AttributeSet(partitionColumns)
val dataColumns = queryExecution.logical.output.filterNot(partitionSet.contains)
val dataColumns = queryExecution.executedPlan.output.filterNot(partitionSet.contains)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use allColumns here

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78436 has finished for PR 18386 at commit 00a63b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hmm, seems a legitimate test failure...

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78441 has finished for PR 18386 at commit 00a63b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

The jenkins has been unstable recently.

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78458 has finished for PR 18386 at commit 00a63b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Weird. The test case passed in my local environment. Need to do more investigation.

@gatorsmile gatorsmile changed the title [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [WIP] Jun 22, 2017
@SparkQA
Copy link

SparkQA commented Jun 22, 2017

Test build #78473 has finished for PR 18386 at commit dfc8884.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78502 has started for PR 18386 at commit 08015c8.

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78503 has started for PR 18386 at commit bb8348d.

val adjustedColumns = tableCols.map { col =>
query.resolve(Seq(col), resolver).getOrElse {
query.resolve(Seq(col), resolver).map(Alias(_, col)()).getOrElse {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add an alias for enforcing the query to preserve the original name of table schema, whose case could be different from the underlying query schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch!

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78500 has finished for PR 18386 at commit fdd254e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78511 has finished for PR 18386 at commit bb8348d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 23, 2017
… INSERT AS SELECT [WIP]

### What changes were proposed in this pull request?

The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`.
```SQL
 SELECT word, length, cast(first as string) as first FROM view1
```

This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`.

The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master.

### How was this patch tested?
Added a test case

Author: Xiao Li <[email protected]>
Author: gatorsmile <[email protected]>

Closes #18386 from gatorsmile/newRC5.
@cloud-fan
Copy link
Contributor

thanks, merging to 2.2!

@gatorsmile gatorsmile changed the title [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT [WIP] [SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT Jun 23, 2017
@gatorsmile gatorsmile closed this Jun 23, 2017
@@ -111,9 +111,18 @@ object FileFormatWriter extends Logging {
job.setOutputValueClass(classOf[InternalRow])
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))

val allColumns = queryExecution.logical.output
val allColumns = queryExecution.executedPlan.output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic. The physical plan may have different schema from logical plan(schema name may be different), and the writer should respect the logical schema as that what users expects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should always use analyzed.output

asfgit pushed a commit that referenced this pull request Oct 13, 2017
…ry schema

## What changes were proposed in this pull request?

#18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts #18386 and picks the patch from #19483 to fix SPARK-21165.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes #19484 from cloud-fan/bug.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…ry schema

## What changes were proposed in this pull request?

apache#18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts apache#18386 and picks the patch from apache#19483 to fix SPARK-21165.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#19484 from cloud-fan/bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants