Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329

Closed
wants to merge 2 commits into from

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration.

Also clean up some code in SQL module.

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90624 has finished for PR 21329 at commit 353606c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90631 has finished for PR 21329 at commit 353606c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90639 has finished for PR 21329 at commit 353606c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented May 16, 2018

Test build #90669 has finished for PR 21329 at commit 353606c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented May 16, 2018

Test build #90680 has finished for PR 21329 at commit 353606c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented May 17, 2018

Test build #90731 has finished for PR 21329 at commit 5739621.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2018

Test build #90739 has finished for PR 21329 at commit 5739621.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 7b2dca5 May 18, 2018
@rxin
Copy link
Contributor

rxin commented May 18, 2018

Why are we cleaning up stuff like this?

@gengliangwang
Copy link
Member Author

gengliangwang commented May 18, 2018

@rxin When I was implementing writer with Data Source V2, I find the code in HadoopMapReduceCommitProtocol quite misleading.
The code here is just setting configuration mapreduce.job.id andmapreduce.task.id to some meaningless dummy value.

@JoshRosen
Copy link
Contributor

In general, I'm very wary of cleanup changes like this: unless we have a need to do this (i.e. it causes negative side effects, breaks workloads, prevents specific concrete improvements, etc.) then the risk of changing longstanding old code outweighs any benefits of "cleanliness".

In this specific case, I'm most concerned about the removal of those jobContext Configuration changes: they might appear superfluous but I'd bet that they were originally added for a reason which made sense at the time the original code was authored. If we're going to remove such old code, I'd like to see a written explanation of why the change is safe which explains the original context for adding that code. I chased through the git blame for the removed lines and the oldest version I found was 0595b6d#diff-5c7d283b2f533b6f491dd1845dd86797R299, which is #5526: "[SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API" by @liancheng. It looks that PR simply copied even older code from either hiveWriterContainers.scala or HadoopRDD. It looks like these properties actually did have some effect for something at one point in time because #101 was fixing some sort of corner-case or bug in that code in the HadoopRDD case. If you go even further back, similar lines are present in a commit which predated our merge script (so it's an intermediate commit from @mateiz as part of some larger changeset): 0ccfe20#diff-8382b1a276e6cbfae1efb3ee49c7e06eR144

Given this long history, I'd like to flag that change as potentially high-risk: it's not obvious to me that this code is unneeded and if we don't have a strong reason to change it then I'd prefer to leave it as it was before simply to help us manage / reduce risk.

@JoshRosen
Copy link
Contributor

I'd also like to note that commit protocols have historically been a very high risk area of the code, so I think we should have a much higher bar for explaining changes to that component.

@gatorsmile
Copy link
Member

Let me revert it now. @gengliangwang Please address the code comments and resubmit the PR. Thanks!

@gengliangwang
Copy link
Member Author

gengliangwang commented May 19, 2018

@JoshRosen Thanks for the explaination. I can understand your concerns.
My main point is the job/task ID here https://github.com/apache/spark/pull/21329/files#diff-d97cfb5711116287a7655f32cd5675cbL149 is dummy, not the real job/task/attempt id of the write action.

While in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L254 the configurations are correctly overwritten.

However, overall the clean up is trivial. I am OK with the revert and we should focus on something more important :)

@cloud-fan
Copy link
Contributor

cloud-fan commented May 19, 2018

The history is exactly like what @JoshRosen said: the conf setting logic is there at the write side since day 1, and then #101 applied it to the read side. My major concern is the driver side just sets some dummy values(job id 0, task id 0, numPartitions 0), and these confs will be set again at executor side by real values. It seems to me we did conf setting at driver side just to make the behavior consistent between driver and executor, there is no specific reason. (need confirmation from @mateiz )

After migrating file source to data source v2, the implementation will be the best data source v2 example, and hopefully we don't have mysterious code to confuse our readers :)

@steveloughran
Copy link
Contributor

Very late commentary here: committers may expect a consistent jobID across job and task; reverting this was the right thing to do. Still got some race conditions in FileFormatWriter though, as >1 job can be created with same job ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants