-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35521: [C++] Hash null bitmap only if null count is 0 #35522
GH-35521: [C++] Hash null bitmap only if null count is 0 #35522
Conversation
|
2c167e6
to
1fbaace
Compare
cpp/src/arrow/scalar.cc
Outdated
@@ -153,9 +153,10 @@ struct ScalarHashImpl { | |||
|
|||
Status ArrayHash(const ArrayData& a) { | |||
RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount())); | |||
if (a.buffers[0] != nullptr) { | |||
if (a.GetNullCount() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the null count should get cached from calling GetNullCount()
above, so this line does not have performance degradations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, however, a NullArray can have a non-zero null count but without a null bitmap. So you must make sure to check the buffers as well.
cpp/src/arrow/scalar_test.cc
Outdated
auto data = empty_bitmap_scalar.value->data()->buffers[1]; | ||
std::vector<uint8_t> bitmap_data = {0,0,0}; | ||
auto null_bitmap = std::make_shared<Buffer>(bitmap_data.data(), 3); | ||
|
||
std::shared_ptr<Int16Array> arr(new Int16Array(3, data, null_bitmap, 0)); | ||
ASSERT_TRUE(arr->null_count() == 0); | ||
// this line fails - I don't know how to create an array with a null bitmap | ||
// that is all 0s. | ||
ASSERT_TRUE(arr->data()->buffers[0] != nullptr); | ||
ScalarType set_bitmap_scalar(arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the description, I don't know how to make an array with a 0s null bitmap. I ran into this bug when using the filter()
function on a LargeListArray
of strings. The scalars in the array would have the null bitmap set even if there were no nulls, changing the hash values. I assume using that exact method is too specific to be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems we explicitly don't keep the bitmap around if there are no nulls in the Array construction:
arrow/cpp/src/arrow/array/data.cc
Lines 53 to 55 in ec29c6f
if (*null_count == 0) { | |
// In case there are no nulls, don't keep an allocated null bitmap around | |
(*buffers)[0] = nullptr; |
One way around this would be to create a ListArray, and then get one element of it as a scalar (like my code snippet in python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just rewrite the Python snippet (or some equivalent) in C++. It should actually be easy, using ArrayFromJSON
and Array::GetScalar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM @wjones127 @pitrou @westonpace Mind take a look?
Would you mind take a look at the failed CI tests:
? |
@micah-white I'm not sure what you mean here. What did the data look like, precisely? |
@pitrou small example using python (you don't actually need to filter or slice yourself, since a scalar is already a slice into its parent array):
Those two scalars are equal, so should have the same hash. But the one has a validity bitmap, and the other not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
cpp/src/arrow/scalar_test.cc
Outdated
auto data = empty_bitmap_scalar.value->data()->buffers[1]; | ||
std::vector<uint8_t> bitmap_data = {0,0,0}; | ||
auto null_bitmap = std::make_shared<Buffer>(bitmap_data.data(), 3); | ||
|
||
std::shared_ptr<Int16Array> arr(new Int16Array(3, data, null_bitmap, 0)); | ||
ASSERT_TRUE(arr->null_count() == 0); | ||
// this line fails - I don't know how to create an array with a null bitmap | ||
// that is all 0s. | ||
ASSERT_TRUE(arr->data()->buffers[0] != nullptr); | ||
ScalarType set_bitmap_scalar(arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems we explicitly don't keep the bitmap around if there are no nulls in the Array construction:
arrow/cpp/src/arrow/array/data.cc
Lines 53 to 55 in ec29c6f
if (*null_count == 0) { | |
// In case there are no nulls, don't keep an allocated null bitmap around | |
(*buffers)[0] = nullptr; |
One way around this would be to create a ListArray, and then get one element of it as a scalar (like my code snippet in python)
@jorisvandenbossche @pitrou Can you take another look? I fixed the case of the null array and fixed the test, which is passing locally. I don't see an official way to request review, so sorry if this isn't the standard. |
dd96fef
to
e1e1b43
Compare
@@ -153,9 +153,10 @@ struct ScalarHashImpl { | |||
|
|||
Status ArrayHash(const ArrayData& a) { | |||
RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount())); | |||
if (a.buffers[0] != nullptr) { | |||
if (a.GetNullCount() != 0 && a.buffers[0] != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, it's also possible that a.buffers[0] == nullptr
if all the elements are valid. Is it possible that we still get differing hashes in this case?
- All elements valid and equal and validity bitmap present (would hash the validity bitmap)
- All elements valid and equal and validity bitmap missing (would not hash the validity bitmap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the bug that I am trying to fix. Since both cases have the same semantic value, their hashes should be the same.
e1e1b43
to
d55a593
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
It might be nice to test the case of a NullScalar as well (for which you had to change some code), but the test coverage for hashing is quite low to start with ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a few cosmestic changes. Thank you @micah-white !
Thanks everyone! Much easier to contribute than I expected. Hope to do it again sometime. |
Benchmark runs are scheduled for baseline = 401ae19 and contender = e7a885d. e7a885d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…e#35522) ### Are these changes tested? I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice. --> ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: apache#35521 Lead-authored-by: micah-white <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…e#35522) ### Are these changes tested? I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice. --> ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: apache#35521 Lead-authored-by: micah-white <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Are these changes tested?
I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice.
-->
Are there any user-facing changes?
No.
This PR contains a "Critical Fix".