Skip to content

Commit

Permalink
misleading task number of groupByKey
Browse files Browse the repository at this point in the history
"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to #389

detail is as following code :

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }

Author: Chen Chao <[email protected]>

Closes #403 from CrazyJvm/patch-4 and squashes the following commits:

42f6c9e [Chen Chao] fix format
829a995 [Chen Chao] fix format
1568336 [Chen Chao] misleading task number of groupByKey

(cherry picked from commit 9c40b9e)
Signed-off-by: Reynold Xin <[email protected]>
  • Loading branch information
CrazyJvm authored and rxin committed Apr 17, 2014
1 parent f0abf5f commit 51c41da
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/scala-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,8 @@ The following tables list the transformations and actions currently supported (s
<tr>
<td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br />
<b>Note:</b> By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
</td>
<b>Note:</b> By default, if the RDD already has a partitioner, the task number is decided by the partition number of the partitioner, or else relies on the value of <code>spark.default.parallelism</code> if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
</td>
</tr>
<tr>
<td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
Expand Down

0 comments on commit 51c41da

Please sign in to comment.