[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

gengliangwang · 2018-05-15T06:36:41Z

What changes were proposed in this pull request?

In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration.

Also clean up some code in SQL module.

How was this patch tested?

Unit test

SparkQA · 2018-05-15T07:05:02Z

Test build #90624 has finished for PR 21329 at commit 353606c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-15T07:36:27Z

retest this please

SparkQA · 2018-05-15T11:06:48Z

Test build #90631 has finished for PR 21329 at commit 353606c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-15T11:08:45Z

retest this please.

SparkQA · 2018-05-15T14:49:21Z

Test build #90639 has finished for PR 21329 at commit 353606c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-16T04:25:15Z

retest this please.

SparkQA · 2018-05-16T07:05:02Z

Test build #90669 has finished for PR 21329 at commit 353606c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-16T11:43:03Z

retest this please.

SparkQA · 2018-05-16T16:03:57Z

Test build #90680 has finished for PR 21329 at commit 353606c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-17T14:56:48Z

retest this please.

SparkQA · 2018-05-17T17:11:14Z

Test build #90731 has finished for PR 21329 at commit 5739621.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-17T19:21:22Z

Test build #90739 has finished for PR 21329 at commit 5739621.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-18T07:34:09Z

thanks, merging to master!

rxin · 2018-05-18T16:32:25Z

Why are we cleaning up stuff like this?

gengliangwang · 2018-05-18T17:39:15Z

@rxin When I was implementing writer with Data Source V2, I find the code in HadoopMapReduceCommitProtocol quite misleading.
The code here is just setting configuration mapreduce.job.id andmapreduce.task.id to some meaningless dummy value.

JoshRosen · 2018-05-18T18:58:09Z

In general, I'm very wary of cleanup changes like this: unless we have a need to do this (i.e. it causes negative side effects, breaks workloads, prevents specific concrete improvements, etc.) then the risk of changing longstanding old code outweighs any benefits of "cleanliness".

In this specific case, I'm most concerned about the removal of those jobContext Configuration changes: they might appear superfluous but I'd bet that they were originally added for a reason which made sense at the time the original code was authored. If we're going to remove such old code, I'd like to see a written explanation of why the change is safe which explains the original context for adding that code. I chased through the git blame for the removed lines and the oldest version I found was 0595b6d#diff-5c7d283b2f533b6f491dd1845dd86797R299, which is #5526: "[SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API" by @liancheng. It looks that PR simply copied even older code from either hiveWriterContainers.scala or HadoopRDD. It looks like these properties actually did have some effect for something at one point in time because #101 was fixing some sort of corner-case or bug in that code in the HadoopRDD case. If you go even further back, similar lines are present in a commit which predated our merge script (so it's an intermediate commit from @mateiz as part of some larger changeset): 0ccfe20#diff-8382b1a276e6cbfae1efb3ee49c7e06eR144

Given this long history, I'd like to flag that change as potentially high-risk: it's not obvious to me that this code is unneeded and if we don't have a strong reason to change it then I'd prefer to leave it as it was before simply to help us manage / reduce risk.

JoshRosen · 2018-05-18T19:01:00Z

I'd also like to note that commit protocols have historically been a very high risk area of the code, so I think we should have a much higher bar for explaining changes to that component.

gatorsmile · 2018-05-18T19:49:13Z

Let me revert it now. @gengliangwang Please address the code comments and resubmit the PR. Thanks!

gengliangwang · 2018-05-19T06:50:14Z

@JoshRosen Thanks for the explaination. I can understand your concerns.
My main point is the job/task ID here https://github.com/apache/spark/pull/21329/files#diff-d97cfb5711116287a7655f32cd5675cbL149 is dummy, not the real job/task/attempt id of the write action.

While in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L254 the configurations are correctly overwritten.

However, overall the clean up is trivial. I am OK with the revert and we should focus on something more important :)

cloud-fan · 2018-05-19T07:03:14Z

The history is exactly like what @JoshRosen said: the conf setting logic is there at the write side since day 1, and then #101 applied it to the read side. My major concern is the driver side just sets some dummy values(job id 0, task id 0, numPartitions 0), and these confs will be set again at executor side by real values. It seems to me we did conf setting at driver side just to make the behavior consistent between driver and executor, there is no specific reason. (need confirmation from @mateiz )

After migrating file source to data source v2, the implementation will be the best data source v2 example, and hopefully we don't have mysterious code to confuse our readers :)

steveloughran · 2020-10-23T13:47:14Z

Very late commentary here: committers may expect a consistent jobID across job and task; reverting this was the right thing to do. Still got some race conditions in FileFormatWriter though, as >1 job can be created with same job ID

Code clean up

353606c

add comments

5739621

asfgit closed this in 7b2dca5 May 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

gengliangwang commented May 15, 2018

SparkQA commented May 15, 2018

HyukjinKwon commented May 15, 2018

SparkQA commented May 15, 2018

gengliangwang commented May 15, 2018

SparkQA commented May 15, 2018

gengliangwang commented May 16, 2018

SparkQA commented May 16, 2018

gengliangwang commented May 16, 2018

SparkQA commented May 16, 2018

gengliangwang commented May 17, 2018

SparkQA commented May 17, 2018

SparkQA commented May 17, 2018

cloud-fan commented May 18, 2018

rxin commented May 18, 2018

gengliangwang commented May 18, 2018 •

edited

Loading

JoshRosen commented May 18, 2018

JoshRosen commented May 18, 2018

gatorsmile commented May 18, 2018

gengliangwang commented May 19, 2018 •

edited

Loading

cloud-fan commented May 19, 2018 •

edited

Loading

steveloughran commented Oct 23, 2020

[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

Conversation

gengliangwang commented May 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 15, 2018

HyukjinKwon commented May 15, 2018

SparkQA commented May 15, 2018

gengliangwang commented May 15, 2018

SparkQA commented May 15, 2018

gengliangwang commented May 16, 2018

SparkQA commented May 16, 2018

gengliangwang commented May 16, 2018

SparkQA commented May 16, 2018

gengliangwang commented May 17, 2018

SparkQA commented May 17, 2018

SparkQA commented May 17, 2018

cloud-fan commented May 18, 2018

rxin commented May 18, 2018

gengliangwang commented May 18, 2018 • edited Loading

JoshRosen commented May 18, 2018

JoshRosen commented May 18, 2018

gatorsmile commented May 18, 2018

gengliangwang commented May 19, 2018 • edited Loading

cloud-fan commented May 19, 2018 • edited Loading

steveloughran commented Oct 23, 2020

gengliangwang commented May 18, 2018 •

edited

Loading

gengliangwang commented May 19, 2018 •

edited

Loading

cloud-fan commented May 19, 2018 •

edited

Loading