-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol #21329
Conversation
Test build #90624 has finished for PR 21329 at commit
|
retest this please |
Test build #90631 has finished for PR 21329 at commit
|
retest this please. |
Test build #90639 has finished for PR 21329 at commit
|
retest this please. |
Test build #90669 has finished for PR 21329 at commit
|
retest this please. |
Test build #90680 has finished for PR 21329 at commit
|
retest this please. |
Test build #90731 has finished for PR 21329 at commit
|
Test build #90739 has finished for PR 21329 at commit
|
thanks, merging to master! |
Why are we cleaning up stuff like this? |
@rxin When I was implementing writer with Data Source V2, I find the code in |
In general, I'm very wary of cleanup changes like this: unless we have a need to do this (i.e. it causes negative side effects, breaks workloads, prevents specific concrete improvements, etc.) then the risk of changing longstanding old code outweighs any benefits of "cleanliness". In this specific case, I'm most concerned about the removal of those Given this long history, I'd like to flag that change as potentially high-risk: it's not obvious to me that this code is unneeded and if we don't have a strong reason to change it then I'd prefer to leave it as it was before simply to help us manage / reduce risk. |
I'd also like to note that commit protocols have historically been a very high risk area of the code, so I think we should have a much higher bar for explaining changes to that component. |
Let me revert it now. @gengliangwang Please address the code comments and resubmit the PR. Thanks! |
@JoshRosen Thanks for the explaination. I can understand your concerns. While in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L254 the configurations are correctly overwritten. However, overall the clean up is trivial. I am OK with the revert and we should focus on something more important :) |
The history is exactly like what @JoshRosen said: the conf setting logic is there at the write side since day 1, and then #101 applied it to the read side. My major concern is the driver side just sets some dummy values(job id 0, task id 0, numPartitions 0), and these confs will be set again at executor side by real values. It seems to me we did conf setting at driver side just to make the behavior consistent between driver and executor, there is no specific reason. (need confirmation from @mateiz ) After migrating file source to data source v2, the implementation will be the best data source v2 example, and hopefully we don't have mysterious code to confuse our readers :) |
Very late commentary here: committers may expect a consistent jobID across job and task; reverting this was the right thing to do. Still got some race conditions in FileFormatWriter though, as >1 job can be created with same job ID |
What changes were proposed in this pull request?
In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration.
Also clean up some code in SQL module.
How was this patch tested?
Unit test