-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert soft dynamic allocation for SPARK-25299. #513
Revert soft dynamic allocation for SPARK-25299. #513
Conversation
We want to work off of the master branch's version of various parts of the code that soft dynamic allocation changed.
@rynorris we could use some help in figuring out if we reverted all the right things here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a strategically valuable feature that we're trying to roll out to make Spark-on-k8s work better in production. We should not be reverting it.
I'd much rather contribute the dynamic allocation stuff upstream, since we've successfully tested it in production (albeit still needs work to be properly supported in https://github.com/palantir/k8s-spark-scheduler ) |
@lwwmanning we're not reverting it on master, just the feature branch on which we're developing the ESS changes. we want this so our perf and unit tests accurately reflect what's happening upstream |
Ohhh, I see. Well, was a good impetus for me to finally create apache#24083 |
compared this with Will's upstream PR and apache/master. for the most part, this looks good to me, except tests are failing. i think you've checked the files out from apache/master, not palantir/master, so might be worth merging from upstream first so we're not taking half of the changes in certain diffs in this PR. or not if there are a lot of merge conflicts, but tests should at least pass. also, as for the k8s soft dynamic allocation upstream PR, agreed that we should upstream the changes, so we'll take a look at that PR. however, i don't think we should block this work on that, so let's merge this into the feature branch for now and then undo this when the upstream PR merges |
Cool, yeah, don't want to block experimentation. However, I would want to block merging the feature branch into palantir/master on effectively reverting this PR. |
When we propose the experimental work against upstream we're going to cherry-pick our patches to make the diff there. Then when we merge into palantir/master we can resolve the merge conflicts and deal with our stuff then. Or, if apache#24083 merges we'll just take that in the feature branch anyways. |
The only place I didn't check out master in was in |
I'll try also taking a merge from upstream - there's a lot of conflicts but I think they're routine that we fix all the time for the most part. |
This reverts commit 1fc5ea9.
Can't run with dynamic allocation without the external shuffle service anymore.
I took a closer look at the diff. I believe everything is only with regards to |
Ok, the build is finally passing. Not sure what was causing the problems. @yifeih can we get a final sign off to merge here? |
I'm going to revert the changes I made to circleci YML that forced the caches to invalidate. That might make the build fail again if it restores bad caches, but we'll declare that the build was passing at one point and hopefully the build on the feature branch should pass regardless. |
Ok, merging now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
================================================================================================
SortShuffleWriter writer
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter without spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills 10 17 6 0.1 10042.1 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter with spills: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
no map side combine 14220 14375 148 0.5 2118.9 1.0X
with map side aggregation 14147 14279 169 0.5 2108.0 1.0X
with map side sort 14115 14310 174 0.5 2103.3 1.0X
When working on SPARK-25299 specifically, we want to work off of the master branch's version of various parts of the code that soft dynamic allocation changed.
Due to the way the code diverged, and due to the way conflict resolution was performed when merging from upstream, simply reverting the commits that introduced the feature was not trivial. We instead followed a more manual approach:
MapOutputTracker