-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap #6763
Conversation
Jenkins, this is ok to test. |
Test build #34701 timed out for PR 6763 at commit |
I can't figure out how this failure could be related to my fix. The test I've added takes only a few seconds to complete: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34701/testReport/org.apache.spark.util.collection/OpenHashMapSuite/ Could it be just a random timeout? |
@@ -278,7 +279,7 @@ object OpenHashSet { | |||
|
|||
val INVALID_POS = -1 | |||
val NONEXISTENCE_MASK = 0x80000000 | |||
val POSITION_MASK = 0xEFFFFFF | |||
val POSITION_MASK = 0x1FFFFFFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right that this is a subtle but important bug, but it looks like the intent is to use all but the top bit. That's 0x7FFFFFFF not 0x1FFFFFFF. Therefore the max position and size is 2^31-1, not 2^29, and that's already the max value of an int, so I don't think the check is needed. Well you could check for a negative value. Basically it's reusing the sign bit that would never otherwise be used since position and size must be positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to make it support 2^30 capacity, but support of 2^31 will require some hacks. In JDK8 maximum array size is 2^31 - 1, so we'd need to store the item with hashCode 2^31 - 1 somewhere else. It will require additional check that will probably affect performance.
As I remember, in JDK6 max array size is either 2^31 - 4 or 2^31 - 5, so JDK6 support will require some additional work.
I see following possibilities:
- Leave the fix as is
- Update it to support capacity 2^30
- Make it support 2^31 with some hacks
- Make it support even larger capacity by splitting value storage into several arrays.
IMO, second option is most reasonable, since 1B max capacity is definitely better than 500M. :)
On the other hand, options 3 & 4 look like an overkill: due to distributed nature of Spark, it's usually not necessary to collect more than a billion items on a single machine even when working with multi-billion datasets.
Yes, 2^31 is not possible at all. There are caveats to the actual max array size, yes, but this is really an orthogonal issue. I think it's best to not assert about the size here at all, or just assert about a negative value on overflow. I don't think anything else can or should be done. The right value of |
Sean, I've updated request. I've verified that the OpenHashMap works fine with 2^30 capacity. I didn't make a test for it, since it requires |
Test build #34778 has finished for PR 6763 at commit
|
@@ -45,7 +45,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( | |||
loadFactor: Double) | |||
extends Serializable { | |||
|
|||
require(initialCapacity <= (1 << 29), "Can't make capacity bigger than 2^29 elements") | |||
require(initialCapacity <= (1 << 30), "Can't make capacity bigger than 2^30 elements") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, because it chooses the next greater power of 2 as a capacity, so this limit sounds correct.
Still I'm not sure why the code had 1 << 29 then, and actually the old POSITION_MAX
was 0x0EFFFFFF
not 0xEFFFFFFF
. It looks like it really should be 0x7FFFFFFF
but maybe we should check with @rxin to make sure I'm not missing something?
Oh, hah, you'll have to change this test in
Runs out of memory now since it succeeds. |
@srowen Fixed, sorry about that. I wonder how could I miss it. |
Test build #34783 has finished for PR 6763 at commit
|
LGTM and an important fix, potentially. Let me leave it for a short while for review. |
/cc @zsxwing |
@SlavikBaranov Could you check how |
@@ -278,7 +279,7 @@ object OpenHashSet { | |||
|
|||
val INVALID_POS = -1 | |||
val NONEXISTENCE_MASK = 0x80000000 | |||
val POSITION_MASK = 0xEFFFFFF | |||
val POSITION_MASK = 0x7FFFFFFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change NONEXISTENCE_MASK
to 1 << 31
and POSITION_MASK
to (1 << 31) - 1
? Just for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As these are masks, I'm not sure that's more readable. A hex string is what I would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I think it's very easy to miss some F
in a hex string just like this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair case-in-point here, yes. We fixed it. Well I'm not against the alternate expression.
@zsxwing I'm not sure that's entirely safe, since the code appears to rely on rehash making more space. If it just does nothing when already at the max size, eventually an add operation will go into an infinite loop. |
@zsxwing are you talking about changing condition in
Well... it's possible (and it's possible to guard against infinite loop), but as the load increases, both put and lookup time become O(n). I mean, in the worst case scenario, adding the last item takes 2^30 iterations & further lookup of that item takes the same number of iterations. Is my understanding correct or I'm missing something? |
I agree with the concern about the worse case scenario. Maybe the error message should be improved. @JoshRosen what do you think about the max capacity issue? |
Test build #34934 has finished for PR 6763 at commit
|
@zsxwing Is the updated error message ok? |
@@ -223,6 +224,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag]( | |||
*/ | |||
private def rehash(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) { | |||
val newCapacity = _capacity * 2 | |||
require(newCapacity <= OpenHashSet.MAX_CAPACITY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is the theoretically largest number of elements that can be in the set. The failure will occur any time that twice the grow threshold exceeds MAX_CAPACITY
, which can happen when the collection is less full than this. So I am actually not sure what's clearer here. Up to you.
I think we still have a little problem here, because when capacity reaches 2^30, twice that number becomes negative and newCapacity <= OpenHashSet.MAX_CAPACITY
is true, still, because of overflow. Check whether the existing capacity is <= OpenHashSet.MAX_CAPACITY / 2
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Integer overflow is not possible with current MAX_CAPACITY setting, since 2^31 is still positive number. Anyway, I've added check for positive capacity. IMO it's more clear way to guard against overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2^31 is positive, but it is not representable as a 32-bit signed int.
scala> val _capacity = 1 << 30
_capacity: Int = 1073741824
scala> val newCapacity = _capacity * 2
newCapacity: Int = -2147483648
So that's an important check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes. Sorry :)
Test build #34978 has finished for PR 6763 at commit
|
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <[email protected]> Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap (cherry picked from commit c13da20) Signed-off-by: Sean Owen <[email protected]>
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <[email protected]> Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap (cherry picked from commit c13da20) Signed-off-by: Sean Owen <[email protected]>
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <[email protected]> Closes apache#6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap
…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes #22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b7d8034) Signed-off-by: Dongjoon Hyun <[email protected]>
…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes #22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work apache#6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes apache#22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…HashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work apache#6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes apache#22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
The problem occurs because the position mask
0xEFFFFFF
is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24,OpenHashMap
calculates incorrect index of value in_values
array.I've also added a size check in
rehash()
, so that it fails instead of reporting invalid item indices.