Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763

Closed
wants to merge 5 commits into from

Conversation

SlavikBaranov
Copy link
Contributor

The problem occurs because the position mask 0xEFFFFFF is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, OpenHashMap calculates incorrect index of value in _values array.

I've also added a size check in rehash(), so that it fails instead of reporting invalid item indices.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34701 timed out for PR 6763 at commit 3920656 after a configured wait of 175m.

@SlavikBaranov
Copy link
Contributor Author

I can't figure out how this failure could be related to my fix. The test I've added takes only a few seconds to complete: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34701/testReport/org.apache.spark.util.collection/OpenHashMapSuite/

Could it be just a random timeout?

@@ -278,7 +279,7 @@ object OpenHashSet {

val INVALID_POS = -1
val NONEXISTENCE_MASK = 0x80000000
val POSITION_MASK = 0xEFFFFFF
val POSITION_MASK = 0x1FFFFFFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this is a subtle but important bug, but it looks like the intent is to use all but the top bit. That's 0x7FFFFFFF not 0x1FFFFFFF. Therefore the max position and size is 2^31-1, not 2^29, and that's already the max value of an int, so I don't think the check is needed. Well you could check for a negative value. Basically it's reusing the sign bit that would never otherwise be used since position and size must be positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to make it support 2^30 capacity, but support of 2^31 will require some hacks. In JDK8 maximum array size is 2^31 - 1, so we'd need to store the item with hashCode 2^31 - 1 somewhere else. It will require additional check that will probably affect performance.
As I remember, in JDK6 max array size is either 2^31 - 4 or 2^31 - 5, so JDK6 support will require some additional work.

I see following possibilities:

  1. Leave the fix as is
  2. Update it to support capacity 2^30
  3. Make it support 2^31 with some hacks
  4. Make it support even larger capacity by splitting value storage into several arrays.

IMO, second option is most reasonable, since 1B max capacity is definitely better than 500M. :)
On the other hand, options 3 & 4 look like an overkill: due to distributed nature of Spark, it's usually not necessary to collect more than a billion items on a single machine even when working with multi-billion datasets.

@srowen
Copy link
Member

srowen commented Jun 12, 2015

Yes, 2^31 is not possible at all. There are caveats to the actual max array size, yes, but this is really an orthogonal issue. I think it's best to not assert about the size here at all, or just assert about a negative value on overflow. I don't think anything else can or should be done.

The right value of POSITION_MAX is still 0x7FFFFFFF.

@SlavikBaranov
Copy link
Contributor Author

Sean,

I've updated request. I've verified that the OpenHashMap works fine with 2^30 capacity. I didn't make a test for it, since it requires -Xmx16g java flag. Hope it's what you've expected.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34778 has finished for PR 6763 at commit f9284fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -45,7 +45,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
loadFactor: Double)
extends Serializable {

require(initialCapacity <= (1 << 29), "Can't make capacity bigger than 2^29 elements")
require(initialCapacity <= (1 << 30), "Can't make capacity bigger than 2^30 elements")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, because it chooses the next greater power of 2 as a capacity, so this limit sounds correct.

Still I'm not sure why the code had 1 << 29 then, and actually the old POSITION_MAX was 0x0EFFFFFF not 0xEFFFFFFF. It looks like it really should be 0x7FFFFFFF but maybe we should check with @rxin to make sure I'm not missing something?

@srowen
Copy link
Member

srowen commented Jun 12, 2015

Oh, hah, you'll have to change this test in OpenHashMapSuite now:

    intercept[IllegalArgumentException] {
      new OpenHashMap[String, Int](1 << 30) // Invalid map size: bigger than 2^29
    }

Runs out of memory now since it succeeds.

@SlavikBaranov
Copy link
Contributor Author

@srowen Fixed, sorry about that. I wonder how could I miss it.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34783 has finished for PR 6763 at commit eaf1e68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 12, 2015

LGTM and an important fix, potentially. Let me leave it for a short while for review.

@JoshRosen
Copy link
Contributor

/cc @zsxwing

@zsxwing
Copy link
Member

zsxwing commented Jun 13, 2015

@SlavikBaranov Could you check how BytesToBytesMap.putNewKey grows the capacity? I think you can use a similar approach to increase the max capacity from 0.7 * (1 << 30) to 1 << 30.

@@ -278,7 +279,7 @@ object OpenHashSet {

val INVALID_POS = -1
val NONEXISTENCE_MASK = 0x80000000
val POSITION_MASK = 0xEFFFFFF
val POSITION_MASK = 0x7FFFFFFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change NONEXISTENCE_MASK to 1 << 31 and POSITION_MASK to (1 << 31) - 1? Just for readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these are masks, I'm not sure that's more readable. A hex string is what I would expect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I think it's very easy to miss some F in a hex string just like this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fair case-in-point here, yes. We fixed it. Well I'm not against the alternate expression.

@srowen
Copy link
Member

srowen commented Jun 14, 2015

@zsxwing I'm not sure that's entirely safe, since the code appears to rely on rehash making more space. If it just does nothing when already at the max size, eventually an add operation will go into an infinite loop.

@SlavikBaranov
Copy link
Contributor Author

@zsxwing are you talking about changing condition in rehashIfNeeded to something like this:

if (_size > _growThreshold && _capacity < MAX_CAPACITY) {
    rehash(k, allocateFunc, moveFunc)
}

Well... it's possible (and it's possible to guard against infinite loop), but as the load increases, both put and lookup time become O(n). I mean, in the worst case scenario, adding the last item takes 2^30 iterations & further lookup of that item takes the same number of iterations.

Is my understanding correct or I'm missing something?

@zsxwing
Copy link
Member

zsxwing commented Jun 15, 2015

I agree with the concern about the worse case scenario. Maybe the error message should be improved. Can't make capacity bigger than 2^30 elements will be confusing if the user finds he only inserts 0.7 * 2^30 items.

@JoshRosen what do you think about the max capacity issue?

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34934 has finished for PR 6763 at commit 4d5b954.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SlavikBaranov
Copy link
Contributor Author

@zsxwing Is the updated error message ok?

@@ -223,6 +224,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
*/
private def rehash(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) {
val newCapacity = _capacity * 2
require(newCapacity <= OpenHashSet.MAX_CAPACITY,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is the theoretically largest number of elements that can be in the set. The failure will occur any time that twice the grow threshold exceeds MAX_CAPACITY, which can happen when the collection is less full than this. So I am actually not sure what's clearer here. Up to you.

I think we still have a little problem here, because when capacity reaches 2^30, twice that number becomes negative and newCapacity <= OpenHashSet.MAX_CAPACITY is true, still, because of overflow. Check whether the existing capacity is <= OpenHashSet.MAX_CAPACITY / 2 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Integer overflow is not possible with current MAX_CAPACITY setting, since 2^31 is still positive number. Anyway, I've added check for positive capacity. IMO it's more clear way to guard against overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2^31 is positive, but it is not representable as a 32-bit signed int.

scala> val _capacity = 1 << 30
_capacity: Int = 1073741824

scala> val newCapacity = _capacity * 2
newCapacity: Int = -2147483648

So that's an important check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. Sorry :)

@SparkQA
Copy link

SparkQA commented Jun 16, 2015

Test build #34978 has finished for PR 6763 at commit 8557445.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 17, 2015
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <[email protected]>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

(cherry picked from commit c13da20)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in c13da20 Jun 17, 2015
asfgit pushed a commit that referenced this pull request Jun 17, 2015
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <[email protected]>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

(cherry picked from commit c13da20)
Signed-off-by: Sean Owen <[email protected]>
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <[email protected]>

Closes apache#6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap
asfgit pushed a commit that referenced this pull request Sep 28, 2018
…HashSetSuite and make it against OpenHashSet

## What changes were proposed in this pull request?

The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM.
By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set.

## How was this patch tested?

Existing tests.

Closes #22569 from viirya/SPARK-25542.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b7d8034)
Signed-off-by: Dongjoon Hyun <[email protected]>
asfgit pushed a commit that referenced this pull request Sep 28, 2018
…HashSetSuite and make it against OpenHashSet

## What changes were proposed in this pull request?

The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM.
By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set.

## How was this patch tested?

Existing tests.

Closes #22569 from viirya/SPARK-25542.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
daspalrahul pushed a commit to daspalrahul/spark that referenced this pull request Sep 29, 2018
…HashSetSuite and make it against OpenHashSet

## What changes were proposed in this pull request?

The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM.
By considering the original work apache#6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set.

## How was this patch tested?

Existing tests.

Closes apache#22569 from viirya/SPARK-25542.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…HashSetSuite and make it against OpenHashSet

## What changes were proposed in this pull request?

The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM.
By considering the original work apache#6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set.

## How was this patch tested?

Existing tests.

Closes apache#22569 from viirya/SPARK-25542.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants