-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10797] RDD's coalesce should not write out the temporary key #8979
Conversation
Can I have a test build here? |
Test build #1850 has finished for PR 8979 at commit
|
Fixed Scala style.
Can I have a test build here? |
Hey @ehnalis, I like the core idea behind this patch. However, I'm a bit concerned about how many files it needs to touch and the level of complexity here. I think that your approach 1, of creating a new ShuffledRDD variant to do value-only shuffle, actually won't be too much work. I did something similar, but optimized only for Spark SQL's |
Actually, let me think about this a bit more. The approach here might be easier, but I'll need to spend a bit more time to think through this in the context of some other planned shuffle refactorings. |
It does touch a few files, but the level of complexity it introduces is minimal I think. It might be convenient for now, to apply this patch and reconsider a step-by-step shuffle-refactoring. The sad thing is that other approaches would touch |
I'm going to look upon your previous PR tomorrow and will think through a more generic way of solving this, but I think it would mean a huge hit on the shuffle-codebase. Let me know, if you would like me to rebase this PR. |
I've looked into your implementation of |
@JoshRosen, how's your refactoring? Should I rebase to PR again? |
@ehnalis have you done some benchmarking on how much this actually saves us? Both in terms of time and shuffle size? |
@andrewor14 On a 4 node YARN cluster I've measured a 38% shuffle-size reduction when the payload was an (Int, String) pair (JavaSerializer), where the average length of String was 5.1 (English word). I did not benchmark time. Practically you save an |
@zzvara this is a pretty cool idea -- we probably want to generalize this a little bit more in Spark 2.1 or Spark 2.2. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Let's keep discussing it in the jira ticket and revisit. |
I think we have the following options to solve this problem:
I've implemented and I'm using the 3rd option. You will experience speed-up based on the size of current payload (values).