[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

SlavikBaranov · 2015-06-11T15:41:47Z

The problem occurs because the position mask 0xEFFFFFF is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, OpenHashMap calculates incorrect index of value in _values array.

I've also added a size check in rehash(), so that it fails instead of reporting invalid item indices.

JoshRosen · 2015-06-11T17:32:56Z

Jenkins, this is ok to test.

SparkQA · 2015-06-11T20:33:23Z

Test build #34701 timed out for PR 6763 at commit 3920656 after a configured wait of 175m.

SlavikBaranov · 2015-06-11T21:16:37Z

I can't figure out how this failure could be related to my fix. The test I've added takes only a few seconds to complete: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34701/testReport/org.apache.spark.util.collection/OpenHashMapSuite/

Could it be just a random timeout?

srowen · 2015-06-11T21:21:05Z

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

@@ -278,7 +279,7 @@ object OpenHashSet {

  val INVALID_POS = -1
  val NONEXISTENCE_MASK = 0x80000000
-  val POSITION_MASK = 0xEFFFFFF
+  val POSITION_MASK = 0x1FFFFFFF


I think you're right that this is a subtle but important bug, but it looks like the intent is to use all but the top bit. That's 0x7FFFFFFF not 0x1FFFFFFF. Therefore the max position and size is 2^31-1, not 2^29, and that's already the max value of an int, so I don't think the check is needed. Well you could check for a negative value. Basically it's reusing the sign bit that would never otherwise be used since position and size must be positive.

It's easy to make it support 2^30 capacity, but support of 2^31 will require some hacks. In JDK8 maximum array size is 2^31 - 1, so we'd need to store the item with hashCode 2^31 - 1 somewhere else. It will require additional check that will probably affect performance.
As I remember, in JDK6 max array size is either 2^31 - 4 or 2^31 - 5, so JDK6 support will require some additional work.

I see following possibilities:

Leave the fix as is

Update it to support capacity 2^30

Make it support 2^31 with some hacks

Make it support even larger capacity by splitting value storage into several arrays.

IMO, second option is most reasonable, since 1B max capacity is definitely better than 500M. :)
On the other hand, options 3 & 4 look like an overkill: due to distributed nature of Spark, it's usually not necessary to collect more than a billion items on a single machine even when working with multi-billion datasets.

srowen · 2015-06-12T05:07:00Z

Yes, 2^31 is not possible at all. There are caveats to the actual max array size, yes, but this is really an orthogonal issue. I think it's best to not assert about the size here at all, or just assert about a negative value on overflow. I don't think anything else can or should be done.

The right value of POSITION_MAX is still 0x7FFFFFFF.

SlavikBaranov · 2015-06-12T16:16:30Z

Sean,

I've updated request. I've verified that the OpenHashMap works fine with 2^30 capacity. I didn't make a test for it, since it requires -Xmx16g java flag. Hope it's what you've expected.

SparkQA · 2015-06-12T17:18:54Z

Test build #34778 has finished for PR 6763 at commit f9284fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-06-12T17:59:34Z

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

@@ -45,7 +45,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
    loadFactor: Double)
  extends Serializable {

-  require(initialCapacity <= (1 << 29), "Can't make capacity bigger than 2^29 elements")
+  require(initialCapacity <= (1 << 30), "Can't make capacity bigger than 2^30 elements")


Ah right, because it chooses the next greater power of 2 as a capacity, so this limit sounds correct.

Still I'm not sure why the code had 1 << 29 then, and actually the old POSITION_MAX was 0x0EFFFFFF not 0xEFFFFFFF. It looks like it really should be 0x7FFFFFFF but maybe we should check with @rxin to make sure I'm not missing something?

srowen · 2015-06-12T18:12:00Z

Oh, hah, you'll have to change this test in OpenHashMapSuite now:

    intercept[IllegalArgumentException] {
      new OpenHashMap[String, Int](1 << 30) // Invalid map size: bigger than 2^29
    }

Runs out of memory now since it succeeds.

SlavikBaranov · 2015-06-12T18:21:47Z

@srowen Fixed, sorry about that. I wonder how could I miss it.

SparkQA · 2015-06-12T20:12:33Z

Test build #34783 has finished for PR 6763 at commit eaf1e68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-06-12T20:21:55Z

LGTM and an important fix, potentially. Let me leave it for a short while for review.

JoshRosen · 2015-06-12T20:48:40Z

/cc @zsxwing

zsxwing · 2015-06-13T13:24:57Z

@SlavikBaranov Could you check how BytesToBytesMap.putNewKey grows the capacity? I think you can use a similar approach to increase the max capacity from 0.7 * (1 << 30) to 1 << 30.

zsxwing · 2015-06-13T13:30:46Z

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

@@ -278,7 +279,7 @@ object OpenHashSet {

  val INVALID_POS = -1
  val NONEXISTENCE_MASK = 0x80000000
-  val POSITION_MASK = 0xEFFFFFF
+  val POSITION_MASK = 0x7FFFFFFF


Could you change NONEXISTENCE_MASK to 1 << 31 and POSITION_MASK to (1 << 31) - 1? Just for readability.

As these are masks, I'm not sure that's more readable. A hex string is what I would expect.

@srowen I think it's very easy to miss some F in a hex string just like this issue.

It's a fair case-in-point here, yes. We fixed it. Well I'm not against the alternate expression.

srowen · 2015-06-14T08:59:39Z

@zsxwing I'm not sure that's entirely safe, since the code appears to rely on rehash making more space. If it just does nothing when already at the max size, eventually an add operation will go into an infinite loop.

SlavikBaranov · 2015-06-15T07:22:09Z

@zsxwing are you talking about changing condition in rehashIfNeeded to something like this:

if (_size > _growThreshold && _capacity < MAX_CAPACITY) {
    rehash(k, allocateFunc, moveFunc)
}

Well... it's possible (and it's possible to guard against infinite loop), but as the load increases, both put and lookup time become O(n). I mean, in the worst case scenario, adding the last item takes 2^30 iterations & further lookup of that item takes the same number of iterations.

Is my understanding correct or I'm missing something?

zsxwing · 2015-06-15T07:29:08Z

I agree with the concern about the worse case scenario. Maybe the error message should be improved. Can't make capacity bigger than 2^30 elements will be confusing if the user finds he only inserts 0.7 * 2^30 items.

@JoshRosen what do you think about the max capacity issue?

SparkQA · 2015-06-15T11:43:02Z

Test build #34934 has finished for PR 6763 at commit 4d5b954.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SlavikBaranov · 2015-06-15T22:36:01Z

@zsxwing Is the updated error message ok?

srowen · 2015-06-16T06:09:42Z

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

@@ -223,6 +224,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
   */
  private def rehash(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) {
    val newCapacity = _capacity * 2
+    require(newCapacity <= OpenHashSet.MAX_CAPACITY,


I agree this is the theoretically largest number of elements that can be in the set. The failure will occur any time that twice the grow threshold exceeds MAX_CAPACITY, which can happen when the collection is less full than this. So I am actually not sure what's clearer here. Up to you.

I think we still have a little problem here, because when capacity reaches 2^30, twice that number becomes negative and newCapacity <= OpenHashSet.MAX_CAPACITY is true, still, because of overflow. Check whether the existing capacity is <= OpenHashSet.MAX_CAPACITY / 2 first?

@srowen Integer overflow is not possible with current MAX_CAPACITY setting, since 2^31 is still positive number. Anyway, I've added check for positive capacity. IMO it's more clear way to guard against overflow.

2^31 is positive, but it is not representable as a 32-bit signed int.

scala> val _capacity = 1 << 30 _capacity: Int = 1073741824 scala> val newCapacity = _capacity * 2 newCapacity: Int = -2147483648

So that's an important check.

Oh, yes. Sorry :)

SparkQA · 2015-06-16T12:49:38Z

Test build #34978 has finished for PR 6763 at commit 8557445.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <[email protected]> Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap (cherry picked from commit c13da20) Signed-off-by: Sean Owen <[email protected]>

The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <[email protected]> Closes apache#6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes #22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b7d8034) Signed-off-by: Dongjoon Hyun <[email protected]>

…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes #22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work apache#6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes apache#22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

SPARK-8309: Support for more than 12M items in OpenHashMap

3920656

srowen reviewed Jun 11, 2015
View reviewed changes

Resolved review comments

f9284fd

srowen reviewed Jun 12, 2015
View reviewed changes

Fixed failing test

eaf1e68

zsxwing reviewed Jun 13, 2015
View reviewed changes

Resolved review comments

4d5b954

srowen reviewed Jun 16, 2015
View reviewed changes

Resolved review comments

8557445

asfgit closed this in c13da20 Jun 17, 2015

viirya mentioned this pull request Sep 27, 2018

[SPARK-25542][Core][Test] Move flaky test in OpenHashMapSuite to OpenHashSetSuite and make it against OpenHashSet #22569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

SlavikBaranov commented Jun 11, 2015

JoshRosen commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SlavikBaranov commented Jun 11, 2015

srowen Jun 11, 2015

SlavikBaranov Jun 11, 2015

srowen commented Jun 12, 2015

SlavikBaranov commented Jun 12, 2015

SparkQA commented Jun 12, 2015

srowen Jun 12, 2015

srowen commented Jun 12, 2015

SlavikBaranov commented Jun 12, 2015

SparkQA commented Jun 12, 2015

srowen commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

zsxwing commented Jun 13, 2015

zsxwing Jun 13, 2015

srowen Jun 14, 2015

zsxwing Jun 15, 2015

srowen Jun 15, 2015

srowen commented Jun 14, 2015

SlavikBaranov commented Jun 15, 2015

zsxwing commented Jun 15, 2015

SparkQA commented Jun 15, 2015

SlavikBaranov commented Jun 15, 2015

srowen Jun 16, 2015

SlavikBaranov Jun 16, 2015

srowen Jun 16, 2015

SlavikBaranov Jun 16, 2015

SparkQA commented Jun 16, 2015

[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

Conversation

SlavikBaranov commented Jun 11, 2015

JoshRosen commented Jun 11, 2015

SparkQA commented Jun 11, 2015

SlavikBaranov commented Jun 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Jun 12, 2015

SlavikBaranov commented Jun 12, 2015

SparkQA commented Jun 12, 2015

Choose a reason for hiding this comment

srowen commented Jun 12, 2015

SlavikBaranov commented Jun 12, 2015

SparkQA commented Jun 12, 2015

srowen commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

zsxwing commented Jun 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Jun 14, 2015

SlavikBaranov commented Jun 15, 2015

zsxwing commented Jun 15, 2015

SparkQA commented Jun 15, 2015

SlavikBaranov commented Jun 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 16, 2015