Skip to content

Commit

Permalink
Improve combineByKey docs
Browse files Browse the repository at this point in the history
* Add note on memory allocation
* Change example code to use different mergeValue and mergeCombiners
  • Loading branch information
David Gingrich committed Apr 5, 2017
1 parent a2d8d76 commit 74691d1
Showing 1 changed file with 19 additions and 5 deletions.
24 changes: 19 additions & 5 deletions python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -1804,17 +1804,31 @@ def combineByKey(self, createCombiner, mergeValue, mergeCombiners,
a one-element list)
- C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
a list)
- C{mergeCombiners}, to combine two C's into a single one.
- C{mergeCombiners}, to combine two C's into a single one (e.g., merges
the lists)
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
modify and return their first argument instead of creating a new C.
In addition, users can control the partitioning of the output RDD.
.. note:: V and C can be different -- for example, one might group an RDD of type
(Int, Int) into an RDD of type (Int, List[Int]).
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> def add(a, b): return a + str(b)
>>> sorted(x.combineByKey(str, add, add).collect())
[('a', '11'), ('b', '1')]
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
>>> def to_list(a):
... return [a]
...
>>> def append(a, b):
... a.append(b)
... return a
...
>>> def extend(a, b):
... a.extend(b)
... return a
...
>>> sorted(x.combineByKey(to_list, append, extend).collect())
[('a', [1, 2]), ('b', [1])]
"""
if numPartitions is None:
numPartitions = self._defaultReducePartitions()
Expand Down

0 comments on commit 74691d1

Please sign in to comment.