-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21165] [SQL] [2.2] Use executedPlan instead of analyzedPlan in INSERT AS SELECT #18386
Conversation
cc @cloud-fan |
Test build #78435 has started for PR 18386 at commit |
val partitionSet = AttributeSet(partitionColumns) | ||
val dataColumns = queryExecution.logical.output.filterNot(partitionSet.contains) | ||
val dataColumns = queryExecution.executedPlan.output.filterNot(partitionSet.contains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use allColumns
here
retest this please |
LGTM, pending jenkins |
Test build #78436 has finished for PR 18386 at commit
|
hmm, seems a legitimate test failure... |
Test build #78441 has finished for PR 18386 at commit
|
The jenkins has been unstable recently. |
retest this please |
Test build #78458 has finished for PR 18386 at commit
|
Weird. The test case passed in my local environment. Need to do more investigation. |
Test build #78473 has finished for PR 18386 at commit
|
Test build #78502 has started for PR 18386 at commit |
Test build #78503 has started for PR 18386 at commit |
val adjustedColumns = tableCols.map { col => | ||
query.resolve(Seq(col), resolver).getOrElse { | ||
query.resolve(Seq(col), resolver).map(Alias(_, col)()).getOrElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add an alias for enforcing the query to preserve the original name of table schema, whose case could be different from the underlying query schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch!
Test build #78500 has finished for PR 18386 at commit
|
retest this please |
Test build #78511 has finished for PR 18386 at commit
|
… INSERT AS SELECT [WIP] ### What changes were proposed in this pull request? The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule `SimplifyCasts` and `RemoveRedundantAliases`. ```SQL SELECT word, length, cast(first as string) as first FROM view1 ``` This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in `FileFormatWriter`. The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master. ### How was this patch tested? Added a test case Author: Xiao Li <[email protected]> Author: gatorsmile <[email protected]> Closes #18386 from gatorsmile/newRC5.
thanks, merging to 2.2! |
@@ -111,9 +111,18 @@ object FileFormatWriter extends Logging { | |||
job.setOutputValueClass(classOf[InternalRow]) | |||
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath)) | |||
|
|||
val allColumns = queryExecution.logical.output | |||
val allColumns = queryExecution.executedPlan.output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic. The physical plan may have different schema from logical plan(schema name may be different), and the writer should respect the logical schema as that what users expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should always use analyzed.output
…ry schema ## What changes were proposed in this pull request? #18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts #18386 and picks the patch from #19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes #19484 from cloud-fan/bug.
…ry schema ## What changes were proposed in this pull request? apache#18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts apache#18386 and picks the patch from apache#19483 to fix SPARK-21165. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes apache#19484 from cloud-fan/bug.
What changes were proposed in this pull request?
The input query schema of INSERT AS SELECT could be changed after optimization. For example, the following query's output schema is changed by the rule
SimplifyCasts
andRemoveRedundantAliases
.This PR is to fix the issue in Spark 2.2. Instead of using the analyzed plan of the input query, this PR use its executed plan to determine the attributes in
FileFormatWriter
.The related issue in the master branch has been fixed by #18064. After this PR is merged, I will submit a separate PR to merge the test case to the master.
How was this patch tested?
Added a test case