-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474
Conversation
Test build #82638 has finished for PR 19474 at commit
|
Test build #82639 has finished for PR 19474 at commit
|
Test build #82640 has finished for PR 19474 at commit
|
Totally agreed. LGTM |
I like this change because the relation between |
@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging { | |||
job.setOutputValueClass(classOf[InternalRow]) | |||
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath)) | |||
|
|||
val allColumns = plan.output | |||
val allColumns = queryExecution.logical.output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to leave a comment that we should not use optimized output here in case it will be changed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, shall we use queryExecution.analyzed.output
?
Minor comments. LGTM |
|
||
test("FileFormatWriter should respect the input query schema") { | ||
withTable("t1", "t2") { | ||
spark.range(1).select('id as 'col1, 'id as 'col2).write.saveAsTable("t1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add another case here?
spark.range(1).select('id, 'id as 'col1, 'id as 'col2).write.saveAsTable("t3")
def query: LogicalPlan | ||
|
||
// We make the input `query` an inner child instead of a child in order to hide it from the | ||
// optimizer. This is because optimizer may change the output schema names, and we have to keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will scare others. :)
-> may not preserve the output schema names' case
@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging { | |||
job.setOutputValueClass(classOf[InternalRow]) | |||
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath)) | |||
|
|||
val allColumns = plan.output | |||
val allColumns = queryExecution.logical.output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly using analyzed
's schema is better here.
@@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration | |||
*/ | |||
trait DataWritingCommand extends RunnableCommand { | |||
|
|||
def query: LogicalPlan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one line description for query
?
val allColumns = plan.output | ||
// Pick the attributes from analyzed plan, as optimizer may not preserve the output schema | ||
// names' case. | ||
val allColumns = queryExecution.analyzed.output | ||
val partitionSet = AttributeSet(partitionColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to double check the partitionColumns
in all the other files are also from analyzed plans.
LGTM pending Jenkins. |
Test build #82661 has finished for PR 19474 at commit
|
thanks for review, merging to master! |
## What changes were proposed in this pull request? This is a minor folllowup of #19474 . #19474 partially reverted #18064 but accidentally introduced a behavior change. `Command` extended `LogicalPlan` before #18064 , but #19474 made it extend `LeafNode`. This is an internal behavior change as now all `Command` subclasses can't define children, and they have to implement `computeStatistic` method. This PR fixes this by making `Command` extend `LogicalPlan` ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #19493 from cloud-fan/minor.
What changes were proposed in this pull request?
In #18064, we allowed
RunnableCommand
to have children in order to fix some UI issues. Then we madeInsertIntoXXX
commands take the inputquery
as a child, when we do the actual writing, we just pass the physical plan to the writer(FileFormatWriter.write
).However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g.
ColumnPruning
rule will remove no-opProject
s, likeProject("A", Scan("a"))
, and thus change the output schema from "<A: int>" to<a: int>
. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.Fortunately, in #18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input
query
as child ofInsertIntoXXX
commands.So the fix is simple, this PR partially revert #18064 and make
InsertIntoXXX
commands leaf nodes again.How was this patch tested?
new regression test