-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2871] [PySpark] add RDD.lookup(key) #2093
Conversation
QA tests have started for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
are you planning to add tests for this? |
refactor
The doc tests should covered all the code paths, do we still need more tests? |
QA tests have started for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
it's worth including a lookup for 1000 or 1234, which won't be found |
@mattf I had added a test case for it, thx. I had do much refactor in this PR, please re-review it, thanks. |
QA tests have started for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
>>> sc.parallelize([]).reduce(add) | ||
Traceback (most recent call last): | ||
... | ||
ValueError: Can not reduce() of empty RDD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but I'd drop the 'of' and just say "Cannot reduce() empty RDD"
This PR is great. I like how you systematically addressed all of the bugs in It looks like there's a merge conflict due to me merging another one of your patches. Once you fix that and address my comments, I'll merge this into master (and hopefully |
Conflicts: python/pyspark/rdd.py
Jenkins, test this please. |
QA tests have started for PR 2093 at commit
|
QA tests have started for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
i'm never a fan of code reformatting, whitespace changes or refactoring at the same time as functional changes, e.g. if (k, v) => if k, v; v if k not in m else func(m[k], v) => func(m[k], v) if k in m else v; MaxHeapQ; if not x: return None -> if x: return y while all those are good changes, it's good practice (and future you me and others will thank you) to bundle those changes w/ their own justification, for instance - question: why did you change to using count() instead of collect() to force evaluation? |
@mattf While I was scanning down the whole file line by line in order to find out all the issues related to persersesPartitioning, reformatting them in the same time, if some lines did not looks nice to me. It's a completely personal judgement, so maybe it does not make sense to others. It's not a good idea to do this kind of reformatting in in PR, I also was thinking of do it as a separated PR or do not dot it if we have no necessary reason. Should I remove these not-related changes? |
It's supposed that count() will cheaper than collect(), we call count() instead of collect() to trigger the calculation in Scala/Java, It's better to keep the same style in Python. But in PySpark, count() depends on collect(), which will dump the result into disks and load them into Python. In future, this is maybe changed, count() will returned a number from JVM. Right now, no strong reason to change collect() to count(), revert it? |
QA tests have started for PR 2093 at commit
|
QA tests have finished for PR 2093 at commit
|
@mattf I agree with you that it's generally a good idea to separate functional vs. cosmetic changes (I've weighed in against several PRs that only perform minor code formatting changes, especially ones that touch tens or hundreds of files). Ideally, this PR would have only made the necessary changes for |
@ScrapCodes Can you explain the original motivation for MaxHeapQ? Is there a reason why you used it instead of |
Simply because, heapq is min heap implementation and to track "top N" I needed a Max heap. #97 has some discussion. Let me see if I can come up with something. |
Hey so I think heapq.nsmallest does the same thing. It should be correct to use it, quick look at its source, I was doubtful about its performance since it has a sort call in each nsmallest. For larger N "I think", MaxHeapQ will perform better. What do you think ? |
Only for N (top) > S (total), nsmallest() will sort, it's faster than heap, because it's implemented in C. For N << S, nsmallest() will faster than MaxHeapQ, because it's implemented in C (_heapq). nsmallest() is stable, for the same key, the values will keep in the order as them in the original RDD. nsmallest(n, it, key=None)/nlargest(n, it, key=None) are available since Python 2.5. |
btw, MaxHeapQ is actually a min heap, why we call it |
My point was MaxHeap implementation does not need to sort. For Most values it may just replace the top. Or at worst logN swaps. ? But I think even though sort is C based implementation, it would be NlogN (at best). My python knowledge is definitely limited. I will be happy to be corrected.
From what I have learnt, MaxHeap keeps largest element at top. ? Is that wrong ? |
If N (top) > S(total), sort() will be SlogS, MaxHeapQ will keep all of them, so it's also S * logS (logS is swap), they have same CPU time in algorithm. (heap is also a kind of algorithm used for sorting) @ScrapCodes Sorry, I misunderstood it. your are right, it's max heap. |
Some quick benchmark: >>> from pyspark.rdd import MaxHeapQ
>>> import heapq, random, timeit
>>> l = range(1<<13)
>>> random.shuffle(l)
>>>
>>> def take1():
>>> q = MaxHeapQ(100)
>>> for i in l:
>>> q.insert(i)
>>> return q.getElements()
>>>
>>> def take2():
>>> return heapq.nsmallest(100, l)
>>> # for S >> N
>>> print timeit.timeit("take1()", "from __main__ import *", number=100)
0.748146057129
>>> print timeit.timeit("take2()", "from __main__ import *", number=100)
0.142593860626
>>> # for N > S
>>> l = range(80)
>>> random.shuffle(l)
>>> print timeit.timeit("take1()", "from __main__ import *", number=1000)
0.156821012497
>>> print timeit.timeit("take2()", "from __main__ import *", number=1000)
0.00907206535339 Whenever S < N or S > N, nsmallest() is much faster than MaxHeapQ. |
In PyPy 2.3, the result is reversed, MaxHeapQ is 3x faster than nsmallest(). They are both implemented in pure Python, nsmallest() does more than MaxHeapQ, it will make it stable. If nsmallest() does not do stable sort, MaxHeapQ is still 30% faster than nsmallest(), because it will try to call le if no lt (MaxHeapQ will fail if object has no gt). BTW, I admire that PyPy do very well in optimizing these algorithm. |
Looks like I understand what you mean now, I am still curious to try to benchmark the takeOrdered function with both approaches. There is definitely 3x difference of C vs python implementation(w/o pypy). Can we use PyPy with spark ? |
if it were up to me, i'd say yes. it's not though, so i'll go with the flow. i'm still trying to get a feel for what the spark community likes in its PRs and JIRAs. |
thank you for the explaination, it wasn't clear from the code. my preference is for isolated changes, so i'd suggest reverting and doing it separately. however, others may not agree. so i'd say at least add a comment about why count() is used -- someone might come along and change it back to collect() without knowing they shouldn't. |
@mattf I'm working on Spark since recently, also trying to follow the process as others, and made some mistakes sometime, hope that I will do better, thanks. There are many things needed to do (especially to 1.1 release) recently, quality and process are both important things we should take care of. Could we merge this, maybe it can catch the last train of 1.1? |
@ScrapCodes I had ran pyspark with PyPy successfully, will send out PR and some benchmarks later. |
i understand, but not my call. i'll probably have a stronger opinion in coming weeks smile |
@davies Thanks, just saw your patch. |
This looks good to me. Thanks for discussing MaxHeapQ and verifying that it's behavior is the same as I'm going to merge this into |
RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning). Author: Davies Liu <[email protected]> Closes apache#2093 from davies/lookup and squashes the following commits: 1789cd4 [Davies Liu] `f` in foreach could be generator or not. 2871b80 [Davies Liu] Merge branch 'master' into lookup c6390ea [Davies Liu] address all comments 0f1bce8 [Davies Liu] add test case for lookup() be0e8ba [Davies Liu] fix preservesPartitioning eb1305d [Davies Liu] add RDD.lookup(key)
RDD.lookup(key)
It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).