Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert soft dynamic allocation for SPARK-25299. #513

Merged
merged 9 commits into from
Mar 14, 2019

Conversation

mccheah
Copy link

@mccheah mccheah commented Mar 13, 2019

When working on SPARK-25299 specifically, we want to work off of the master branch's version of various parts of the code that soft dynamic allocation changed.

Due to the way the code diverged, and due to the way conflict resolution was performed when merging from upstream, simply reverting the commits that introduced the feature was not trivial. We instead followed a more manual approach:

  1. git checkout the upstream master version of MapOutputTracker
  2. Find all compilation errors in dependent classes
  3. git checkout the upstream master version of all broken classes
  4. Repeat 2-3 until no compilation errors are left

We want to work off of the master branch's version of various parts of the code that soft dynamic allocation changed.
@mccheah
Copy link
Author

mccheah commented Mar 13, 2019

@rynorris we could use some help in figuring out if we reverted all the right things here.

Copy link

@lwwmanning lwwmanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a strategically valuable feature that we're trying to roll out to make Spark-on-k8s work better in production. We should not be reverting it.

@lwwmanning
Copy link

lwwmanning commented Mar 13, 2019

I'd much rather contribute the dynamic allocation stuff upstream, since we've successfully tested it in production (albeit still needs work to be properly supported in https://github.com/palantir/k8s-spark-scheduler )

@yifeih
Copy link

yifeih commented Mar 13, 2019

@lwwmanning we're not reverting it on master, just the feature branch on which we're developing the ESS changes. we want this so our perf and unit tests accurately reflect what's happening upstream

@lwwmanning
Copy link

Ohhh, I see. Well, was a good impetus for me to finally create apache#24083

@lwwmanning lwwmanning dismissed their stale review March 13, 2019 17:34

misunderstood target branch

@yifeih
Copy link

yifeih commented Mar 13, 2019

compared this with Will's upstream PR and apache/master. for the most part, this looks good to me, except tests are failing. i think you've checked the files out from apache/master, not palantir/master, so might be worth merging from upstream first so we're not taking half of the changes in certain diffs in this PR. or not if there are a lot of merge conflicts, but tests should at least pass.

also, as for the k8s soft dynamic allocation upstream PR, agreed that we should upstream the changes, so we'll take a look at that PR. however, i don't think we should block this work on that, so let's merge this into the feature branch for now and then undo this when the upstream PR merges

@lwwmanning
Copy link

Cool, yeah, don't want to block experimentation. However, I would want to block merging the feature branch into palantir/master on effectively reverting this PR.

@mccheah
Copy link
Author

mccheah commented Mar 13, 2019

When we propose the experimental work against upstream we're going to cherry-pick our patches to make the diff there. Then when we merge into palantir/master we can resolve the merge conflicts and deal with our stuff then. Or, if apache#24083 merges we'll just take that in the feature branch anyways.

@mccheah
Copy link
Author

mccheah commented Mar 13, 2019

The only place I didn't check out master in was in SparkContext actually - just fixed the compilation errors manually there.

@mccheah
Copy link
Author

mccheah commented Mar 13, 2019

I'll try also taking a merge from upstream - there's a lot of conflicts but I think they're routine that we fix all the time for the most part.

mccheah added 3 commits March 13, 2019 13:18
Can't run with dynamic allocation without the external shuffle service anymore.
@mccheah
Copy link
Author

mccheah commented Mar 13, 2019

@ifilonenko

@rynorris
Copy link

@mccheah looks to me like you've hit all the parts of the core code (that I can recall).

You haven't reverted the changes to the k8s pieces made in #427 though, mainly ExecutorPodsAllocator. Are you intending to keep those changes?

@mccheah
Copy link
Author

mccheah commented Mar 14, 2019

I took a closer look at the diff. I believe everything is only with regards to SafeLogging. Might as well keep those.

@mccheah
Copy link
Author

mccheah commented Mar 14, 2019

Ok, the build is finally passing. Not sure what was causing the problems.

@yifeih can we get a final sign off to merge here?

@mccheah
Copy link
Author

mccheah commented Mar 14, 2019

I'm going to revert the changes I made to circleci YML that forced the caches to invalidate. That might make the build fail again if it restores bad caches, but we'll declare that the build was passing at one point and hopefully the build on the feature branch should pass regardless.

@mccheah
Copy link
Author

mccheah commented Mar 14, 2019

Ok, merging now!

@mccheah mccheah merged commit 5cf18c4 into spark-25299 Mar 14, 2019
Copy link

@svc-spark-25299 svc-spark-25299 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

================================================================================================
SortShuffleWriter writer
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter without spills:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills                         10             17           6          0.1       10042.1       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter with spills:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
no map side combine                               14220          14375         148          0.5        2118.9       1.0X
with map side aggregation                         14147          14279         169          0.5        2108.0       1.0X
with map side sort                                14115          14310         174          0.5        2103.3       1.0X



mccheah added a commit that referenced this pull request Sep 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants