[SPARK-2871] [PySpark] add RDD.lookup(key) #2093

davies · 2014-08-22T04:55:26Z

RDD.lookup(key)

    Return the list of values in the RDD for key `key`. This operation
    is done efficiently if the RDD has a known partitioner by only
    searching the partition that the key maps to.

    >>> l = range(1000)
    >>> rdd = sc.parallelize(zip(l, l), 10)
    >>> rdd.lookup(42)  # slow
    [42]
    >>> sorted = rdd.sortByKey()
    >>> sorted.lookup(42)  # fast
    [42]

It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).

SparkQA · 2014-08-22T05:00:47Z

QA tests have started for PR 2093 at commit eb1305d.

This patch merges cleanly.

SparkQA · 2014-08-22T05:57:03Z

QA tests have finished for PR 2093 at commit eb1305d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mattf · 2014-08-23T01:55:22Z

are you planning to add tests for this?

refactor

davies · 2014-08-23T07:40:11Z

The doc tests should covered all the code paths, do we still need more tests?

SparkQA · 2014-08-23T07:40:42Z

QA tests have started for PR 2093 at commit be0e8ba.

This patch merges cleanly.

SparkQA · 2014-08-23T08:36:19Z

QA tests have finished for PR 2093 at commit be0e8ba.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mattf · 2014-08-23T12:39:39Z

The doc tests should covered all the code paths, do we still need more tests?

it's worth including a lookup for 1000 or 1234, which won't be found

davies · 2014-08-23T17:40:07Z

@mattf I had added a test case for it, thx.

I had do much refactor in this PR, please re-review it, thanks.

SparkQA · 2014-08-23T17:40:45Z

QA tests have started for PR 2093 at commit 0f1bce8.

This patch merges cleanly.

SparkQA · 2014-08-23T18:35:09Z

QA tests have finished for PR 2093 at commit 0f1bce8.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-24T02:03:41Z

python/pyspark/rdd.py

+        >>> sc.parallelize([]).reduce(add)
+        Traceback (most recent call last):
+            ...
+        ValueError: Can not reduce() of empty RDD


Minor nit, but I'd drop the 'of' and just say "Cannot reduce() empty RDD"

JoshRosen · 2014-08-24T02:18:33Z

This PR is great. I like how you systematically addressed all of the bugs in preservesPartitioning; the minor cleanups are good, too.

It looks like there's a merge conflict due to me merging another one of your patches. Once you fix that and address my comments, I'll merge this into master (and hopefully branch-1.1, since the partitioner preservation could be a huge performance win).

Conflicts: python/pyspark/rdd.py

davies · 2014-08-24T06:20:57Z

Jenkins, test this please.

SparkQA · 2014-08-24T06:20:59Z

QA tests have started for PR 2093 at commit 2871b80.

This patch merges cleanly.

SparkQA · 2014-08-24T06:26:03Z

QA tests have started for PR 2093 at commit 2871b80.

This patch merges cleanly.

SparkQA · 2014-08-24T07:15:46Z

QA tests have finished for PR 2093 at commit 2871b80.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BoundedFloat(float):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

SparkQA · 2014-08-24T07:22:26Z

QA tests have finished for PR 2093 at commit 2871b80.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BoundedFloat(float):
- class JoinedRow2 extends Row
- class JoinedRow3 extends Row
- class JoinedRow4 extends Row
- class JoinedRow5 extends Row
- class GenericRow(protected[sql] val values: Array[Any]) extends Row
- abstract class MutableValue extends Serializable
- final class MutableInt extends MutableValue
- final class MutableFloat extends MutableValue
- final class MutableBoolean extends MutableValue
- final class MutableDouble extends MutableValue
- final class MutableShort extends MutableValue
- final class MutableLong extends MutableValue
- final class MutableByte extends MutableValue
- final class MutableAny extends MutableValue
- final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression
- case class CollectHashSetFunction(
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountFunction(
- case class CountDistinctFunction(
- case class MaxOf(left: Expression, right: Expression) extends Expression
- case class NewSet(elementType: DataType) extends LeafExpression
- case class AddItemToSet(item: Expression, set: Expression) extends Expression
- case class CombineSets(left: Expression, right: Expression) extends BinaryExpression
- case class CountSet(child: Expression) extends UnaryExpression

mattf · 2014-08-24T12:00:46Z

i'm never a fan of code reformatting, whitespace changes or refactoring at the same time as functional changes, e.g. if (k, v) => if k, v; v if k not in m else func(m[k], v) => func(m[k], v) if k in m else v; MaxHeapQ; if not x: return None -> if x: return y

while all those are good changes, it's good practice (and future you me and others will thank you) to bundle those changes w/ their own justification, for instance -

question: why did you change to using count() instead of collect() to force evaluation?

davies · 2014-08-24T17:18:49Z

@mattf While I was scanning down the whole file line by line in order to find out all the issues related to persersesPartitioning, reformatting them in the same time, if some lines did not looks nice to me. It's a completely personal judgement, so maybe it does not make sense to others.

It's not a good idea to do this kind of reformatting in in PR, I also was thinking of do it as a separated PR or do not dot it if we have no necessary reason.

Should I remove these not-related changes?

davies · 2014-08-24T17:28:30Z

It's supposed that count() will cheaper than collect(), we call count() instead of collect() to trigger the calculation in Scala/Java, It's better to keep the same style in Python.

But in PySpark, count() depends on collect(), which will dump the result into disks and load them into Python. In future, this is maybe changed, count() will returned a number from JVM.

Right now, no strong reason to change collect() to count(), revert it?

SparkQA · 2014-08-24T18:00:49Z

QA tests have started for PR 2093 at commit 1789cd4.

This patch merges cleanly.

SparkQA · 2014-08-24T18:55:35Z

QA tests have finished for PR 2093 at commit 1789cd4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
- $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$

JoshRosen · 2014-08-25T18:25:46Z

@mattf I agree with you that it's generally a good idea to separate functional vs. cosmetic changes (I've weighed in against several PRs that only perform minor code formatting changes, especially ones that touch tens or hundreds of files).

Ideally, this PR would have only made the necessary changes for lookup(), since that makes things easier to review. However, I think we do want to eventually make most of the other changes, so I don't mind trying to review those changes here (the alternative is having to review another PR, which creates more work for me now that I've already reviewed the code here).

JoshRosen · 2014-08-25T18:27:18Z

@ScrapCodes Can you explain the original motivation for MaxHeapQ? Is there a reason why you used it instead of heapq.nsmallest? If there's a corner-case that the old code supported but that this PR breaks, can you submit a failing test?

ScrapCodes · 2014-08-26T05:19:35Z

Simply because, heapq is min heap implementation and to track "top N" I needed a Max heap. #97 has some discussion. Let me see if I can come up with something.

ScrapCodes · 2014-08-26T05:32:00Z

Hey so I think heapq.nsmallest does the same thing. It should be correct to use it, quick look at its source, I was doubtful about its performance since it has a sort call in each nsmallest. For larger N "I think", MaxHeapQ will perform better. What do you think ?

davies · 2014-08-26T05:43:31Z

Only for N (top) > S (total), nsmallest() will sort, it's faster than heap, because it's implemented in C.

For N << S, nsmallest() will faster than MaxHeapQ, because it's implemented in C (_heapq).

nsmallest() is stable, for the same key, the values will keep in the order as them in the original RDD.

nsmallest(n, it, key=None)/nlargest(n, it, key=None) are available since Python 2.5.

davies · 2014-08-26T05:44:50Z

btw, MaxHeapQ is actually a min heap, why we call it MaxHeapQ ?

ScrapCodes · 2014-08-26T05:51:53Z

Only for N (top) > S (total), nsmallest() will sort, it's faster than heap, because it's implemented in C.

My point was MaxHeap implementation does not need to sort. For Most values it may just replace the top. Or at worst logN swaps. ? But I think even though sort is C based implementation, it would be NlogN (at best). My python knowledge is definitely limited. I will be happy to be corrected.

btw, MaxHeapQ is actually a min heap, why we call it MaxHeapQ ?

From what I have learnt, MaxHeap keeps largest element at top. ? Is that wrong ?

davies · 2014-08-26T05:58:31Z

If N (top) > S(total), sort() will be SlogS, MaxHeapQ will keep all of them, so it's also S * logS (logS is swap), they have same CPU time in algorithm. (heap is also a kind of algorithm used for sorting)

@ScrapCodes Sorry, I misunderstood it. your are right, it's max heap.

davies · 2014-08-26T06:14:27Z

Some quick benchmark:

    >>> from pyspark.rdd import MaxHeapQ
    >>> import heapq, random, timeit
    >>> l = range(1<<13)
    >>> random.shuffle(l)
    >>>
    >>> def take1():
    >>>     q = MaxHeapQ(100)
    >>>     for i in l:
    >>>         q.insert(i)
    >>>     return q.getElements()
    >>>
    >>> def take2():
    >>>     return heapq.nsmallest(100, l)
    >>> # for S >> N
    >>> print timeit.timeit("take1()", "from __main__ import *", number=100)
    0.748146057129
    >>> print timeit.timeit("take2()", "from __main__ import *", number=100)
    0.142593860626
    >>> # for N > S
    >>> l = range(80)
    >>> random.shuffle(l)
    >>> print timeit.timeit("take1()", "from __main__ import *", number=1000)
    0.156821012497
    >>> print timeit.timeit("take2()", "from __main__ import *", number=1000)
    0.00907206535339

Whenever S < N or S > N, nsmallest() is much faster than MaxHeapQ.

davies · 2014-08-26T06:41:15Z

In PyPy 2.3, the result is reversed, MaxHeapQ is 3x faster than nsmallest(). They are both implemented in pure Python, nsmallest() does more than MaxHeapQ, it will make it stable.

If nsmallest() does not do stable sort, MaxHeapQ is still 30% faster than nsmallest(), because it will try to call le if no lt (MaxHeapQ will fail if object has no gt).

BTW, I admire that PyPy do very well in optimizing these algorithm.

ScrapCodes · 2014-08-26T07:44:22Z

Looks like I understand what you mean now, I am still curious to try to benchmark the takeOrdered function with both approaches. There is definitely 3x difference of C vs python implementation(w/o pypy). Can we use PyPy with spark ?

mattf · 2014-08-26T12:51:06Z

@mattf While I was scanning down the whole file line by line in order to find out all the issues related to persersesPartitioning, reformatting them in the same time, if some lines did not looks nice to me. It's a completely personal judgement, so maybe it does not make sense to others.

It's not a good idea to do this kind of reformatting in in PR, I also was thinking of do it as a separated PR or do not dot it if we have no necessary reason.

Should I remove these not-related changes?

if it were up to me, i'd say yes. it's not though, so i'll go with the flow.

i'm still trying to get a feel for what the spark community likes in its PRs and JIRAs.

mattf · 2014-08-26T12:53:53Z

It's supposed that count() will cheaper than collect(), we call count() instead of collect() to trigger the calculation in Scala/Java, It's better to keep the same style in Python.

But in PySpark, count() depends on collect(), which will dump the result into disks and load them into Python. In future, this is maybe changed, count() will returned a number from JVM.

Right now, no strong reason to change collect() to count(), revert it?

thank you for the explaination, it wasn't clear from the code. my preference is for isolated changes, so i'd suggest reverting and doing it separately. however, others may not agree. so i'd say at least add a comment about why count() is used -- someone might come along and change it back to collect() without knowing they shouldn't.

davies · 2014-08-26T16:41:45Z

@mattf I'm working on Spark since recently, also trying to follow the process as others, and made some mistakes sometime, hope that I will do better, thanks.

There are many things needed to do (especially to 1.1 release) recently, quality and process are both important things we should take care of.

Could we merge this, maybe it can catch the last train of 1.1?

davies · 2014-08-26T16:43:41Z

@ScrapCodes I had ran pyspark with PyPy successfully, will send out PR and some benchmarks later.

mattf · 2014-08-26T16:46:07Z

Could we merge this, maybe it can catch the last train of 1.1?

i understand, but not my call.

i'll probably have a stronger opinion in coming weeks smile

ScrapCodes · 2014-08-27T05:53:52Z

@davies Thanks, just saw your patch.

JoshRosen · 2014-08-27T20:17:51Z

This looks good to me. Thanks for discussing MaxHeapQ and verifying that it's behavior is the same as nsmallest.

I'm going to merge this into master for now. Thanks!

RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning). Author: Davies Liu <[email protected]> Closes apache#2093 from davies/lookup and squashes the following commits: 1789cd4 [Davies Liu] `f` in foreach could be generator or not. 2871b80 [Davies Liu] Merge branch 'master' into lookup c6390ea [Davies Liu] address all comments 0f1bce8 [Davies Liu] add test case for lookup() be0e8ba [Davies Liu] fix preservesPartitioning eb1305d [Davies Liu] add RDD.lookup(key)

add RDD.lookup(key)

eb1305d

davies mentioned this pull request Aug 22, 2014

[SPARK-2871] [PySpark] Add missing API #1791

Closed

fix preservesPartitioning

be0e8ba

refactor

add test case for lookup()

0f1bce8

JoshRosen reviewed Aug 24, 2014
View reviewed changes

davies added 2 commits August 23, 2014 23:14

address all comments

c6390ea

Merge branch 'master' into lookup

2871b80

Conflicts: python/pyspark/rdd.py

f in foreach could be generator or not.

1789cd4

asfgit closed this in 4fa2fda Aug 27, 2014

davies deleted the lookup branch September 15, 2014 22:18

JoshRosen mentioned this pull request Oct 23, 2014

Clarify docstring for Pyspark's foreachPartition #2895

Closed

[SPARK-2871] [PySpark] add RDD.lookup(key) #2093

[SPARK-2871] [PySpark] add RDD.lookup(key) #2093

Conversation

davies commented Aug 22, 2014

SparkQA commented Aug 22, 2014

SparkQA commented Aug 22, 2014

mattf commented Aug 23, 2014

davies commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

mattf commented Aug 23, 2014

davies commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

JoshRosen Aug 24, 2014

Choose a reason for hiding this comment

JoshRosen commented Aug 24, 2014

davies commented Aug 24, 2014

SparkQA commented Aug 24, 2014

SparkQA commented Aug 24, 2014

SparkQA commented Aug 24, 2014

SparkQA commented Aug 24, 2014

mattf commented Aug 24, 2014

davies commented Aug 24, 2014

davies commented Aug 24, 2014

SparkQA commented Aug 24, 2014

SparkQA commented Aug 24, 2014

JoshRosen commented Aug 25, 2014

JoshRosen commented Aug 25, 2014

ScrapCodes commented Aug 26, 2014

ScrapCodes commented Aug 26, 2014

davies commented Aug 26, 2014

davies commented Aug 26, 2014

ScrapCodes commented Aug 26, 2014

davies commented Aug 26, 2014

davies commented Aug 26, 2014

davies commented Aug 26, 2014

ScrapCodes commented Aug 26, 2014

mattf commented Aug 26, 2014

mattf commented Aug 26, 2014

davies commented Aug 26, 2014

davies commented Aug 26, 2014

mattf commented Aug 26, 2014

ScrapCodes commented Aug 27, 2014

JoshRosen commented Aug 27, 2014