Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35521: [C++] Hash null bitmap only if null count is 0 #35522

Merged
merged 4 commits into from
May 11, 2023

Conversation

micah-white
Copy link
Contributor

@micah-white micah-white commented May 9, 2023

Are these changes tested?

I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice.
-->

Are there any user-facing changes?

No.

This PR contains a "Critical Fix".

@github-actions
Copy link

github-actions bot commented May 9, 2023

⚠️ GitHub issue #35521 has been automatically assigned in GitHub to PR creator.

@micah-white micah-white force-pushed the null-bitmap-array-scalar-hash branch from 2c167e6 to 1fbaace Compare May 9, 2023 22:26
@@ -153,9 +153,10 @@ struct ScalarHashImpl {

Status ArrayHash(const ArrayData& a) {
RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount()));
if (a.buffers[0] != nullptr) {
if (a.GetNullCount() != 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the null count should get cached from calling GetNullCount() above, so this line does not have performance degradations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, however, a NullArray can have a non-zero null count but without a null bitmap. So you must make sure to check the buffers as well.

Comment on lines 1114 to 1123
auto data = empty_bitmap_scalar.value->data()->buffers[1];
std::vector<uint8_t> bitmap_data = {0,0,0};
auto null_bitmap = std::make_shared<Buffer>(bitmap_data.data(), 3);

std::shared_ptr<Int16Array> arr(new Int16Array(3, data, null_bitmap, 0));
ASSERT_TRUE(arr->null_count() == 0);
// this line fails - I don't know how to create an array with a null bitmap
// that is all 0s.
ASSERT_TRUE(arr->data()->buffers[0] != nullptr);
ScalarType set_bitmap_scalar(arr);
Copy link
Contributor Author

@micah-white micah-white May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the description, I don't know how to make an array with a 0s null bitmap. I ran into this bug when using the filter() function on a LargeListArray of strings. The scalars in the array would have the null bitmap set even if there were no nulls, changing the hash values. I assume using that exact method is too specific to be used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems we explicitly don't keep the bitmap around if there are no nulls in the Array construction:

if (*null_count == 0) {
// In case there are no nulls, don't keep an allocated null bitmap around
(*buffers)[0] = nullptr;

One way around this would be to create a ListArray, and then get one element of it as a scalar (like my code snippet in python)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just rewrite the Python snippet (or some equivalent) in C++. It should actually be easy, using ArrayFromJSON and Array::GetScalar.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM @wjones127 @pitrou @westonpace Mind take a look?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 10, 2023
@mapleFU
Copy link
Member

mapleFU commented May 10, 2023

Would you mind take a look at the failed CI tests:

[ RUN      ] TestListScalar/0.TestHashing
C:/projects/arrow/cpp/src/arrow/scalar_test.cc(1122): error: Value of: arr->data()->buffers[0] != nullptr
  Actual: false
Expected: true
[  FAILED  ] TestListScalar/0.TestHashing, where TypeParam = class arrow::ListType (0 ms)
[----------] 3 tests from TestListScalar/0 (0 ms total)
[----------] 3 tests from TestListScalar/1, where TypeParam = class arrow::LargeListType
[ RUN      ] TestListScalar/1.Basics
[       OK ] TestListScalar/1.Basics (0 ms)
[ RUN      ] TestListScalar/1.ValidateErrors
[       OK ] TestListScalar/1.ValidateErrors (0 ms)
[ RUN      ] TestListScalar/1.TestHashing
C:/projects/arrow/cpp/src/arrow/scalar_test.cc(1122): error: Value of: arr->data()->buffers[0] != nullptr
  Actual: false
Expected: true
[  FAILED  ] TestListScalar/1.TestHashing, where TypeParam = class arrow::LargeListType (0 ms)
[----------] 3 tests from TestListScalar/1 (0 ms total)
[----------] 3 tests from TestListScalar/2, where TypeParam = class arrow::FixedSizeListType
[ RUN      ] TestListScalar/2.Basics
[       OK ] TestListScalar/2.Basics (0 ms)
[ RUN      ] TestListScalar/2.ValidateErrors
[       OK ] TestListScalar/2.ValidateErrors (0 ms)
[ RUN      ] TestListScalar/2.TestHashing
C:/projects/arrow/cpp/src/arrow/scalar_test.cc(1122): error: Value of: arr->data()->buffers[0] != nullptr
  Actual: false
Expected: true
[  FAILED  ] TestListScalar/2.TestHashing, where TypeParam = class arrow::FixedSizeListType (0 ms)

?

@pitrou
Copy link
Member

pitrou commented May 10, 2023

I ran into this bug when using the filter() function on a LargeListArray of strings. The scalars in the array would have the null bitmap set even if there were no nulls, changing the hash values.

@micah-white I'm not sure what you mean here. What did the data look like, precisely?

@jorisvandenbossche
Copy link
Member

@pitrou small example using python (you don't actually need to filter or slice yourself, since a scalar is already a slice into its parent array):

In [31]: arr1 = pa.array([[0, 1], [2, 3]])

In [32]: scalar1 = arr1[0]

In [33]: arr2 = pa.array([[0, 1], [2, None]])

In [34]: scalar2 = arr2[0]

In [35]: scalar1
Out[35]: <pyarrow.ListScalar: [0, 1]>

In [36]: scalar2
Out[36]: <pyarrow.ListScalar: [0, 1]>

In [37]: hash(scalar1)
Out[37]: 6972737373264176731

In [38]: hash(scalar2)
Out[38]: 5286180417804377197

In [39]: scalar1.values.buffers()
Out[39]: 
[None,
 <pyarrow.Buffer address=0x7fbc56a08200 size=32 is_cpu=True is_mutable=True>]

In [40]: scalar2.values.buffers()
Out[40]: 
[<pyarrow.Buffer address=0x7fbc56a08180 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7fbc56a08280 size=32 is_cpu=True is_mutable=True>]

Those two scalars are equal, so should have the same hash. But the one has a validity bitmap, and the other not.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

cpp/src/arrow/scalar.cc Outdated Show resolved Hide resolved
Comment on lines 1114 to 1123
auto data = empty_bitmap_scalar.value->data()->buffers[1];
std::vector<uint8_t> bitmap_data = {0,0,0};
auto null_bitmap = std::make_shared<Buffer>(bitmap_data.data(), 3);

std::shared_ptr<Int16Array> arr(new Int16Array(3, data, null_bitmap, 0));
ASSERT_TRUE(arr->null_count() == 0);
// this line fails - I don't know how to create an array with a null bitmap
// that is all 0s.
ASSERT_TRUE(arr->data()->buffers[0] != nullptr);
ScalarType set_bitmap_scalar(arr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems we explicitly don't keep the bitmap around if there are no nulls in the Array construction:

if (*null_count == 0) {
// In case there are no nulls, don't keep an allocated null bitmap around
(*buffers)[0] = nullptr;

One way around this would be to create a ListArray, and then get one element of it as a scalar (like my code snippet in python)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 10, 2023
@micah-white
Copy link
Contributor Author

@jorisvandenbossche @pitrou Can you take another look? I fixed the case of the null array and fixed the test, which is passing locally. I don't see an official way to request review, so sorry if this isn't the standard.

@micah-white micah-white force-pushed the null-bitmap-array-scalar-hash branch from dd96fef to e1e1b43 Compare May 10, 2023 16:43
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 10, 2023
@@ -153,9 +153,10 @@ struct ScalarHashImpl {

Status ArrayHash(const ArrayData& a) {
RETURN_NOT_OK(StdHash(a.length) & StdHash(a.GetNullCount()));
if (a.buffers[0] != nullptr) {
if (a.GetNullCount() != 0 && a.buffers[0] != nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, it's also possible that a.buffers[0] == nullptr if all the elements are valid. Is it possible that we still get differing hashes in this case?

  • All elements valid and equal and validity bitmap present (would hash the validity bitmap)
  • All elements valid and equal and validity bitmap missing (would not hash the validity bitmap)

Copy link
Contributor Author

@micah-white micah-white May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the bug that I am trying to fix. Since both cases have the same semantic value, their hashes should be the same.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 10, 2023
@micah-white micah-white requested a review from westonpace May 11, 2023 11:19
@micah-white micah-white force-pushed the null-bitmap-array-scalar-hash branch from e1e1b43 to d55a593 Compare May 11, 2023 11:24
@github-actions github-actions bot removed the awaiting changes Awaiting changes label May 11, 2023
@github-actions github-actions bot added the awaiting change review Awaiting change review label May 11, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

It might be nice to test the case of a NullScalar as well (for which you had to change some code), but the test coverage for hashing is quite low to start with ..

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 11, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few cosmestic changes. Thank you @micah-white !

@pitrou pitrou merged commit e7a885d into apache:main May 11, 2023
@micah-white micah-white deleted the null-bitmap-array-scalar-hash branch May 12, 2023 04:53
@micah-white
Copy link
Contributor Author

Thanks everyone! Much easier to contribute than I expected. Hope to do it again sometime.

@ursabot
Copy link

ursabot commented May 13, 2023

Benchmark runs are scheduled for baseline = 401ae19 and contender = e7a885d. e7a885d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.77% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.52% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️2.32% ⬆️0.3%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e7a885d8 ec2-t3-xlarge-us-east-2
[Finished] e7a885d8 test-mac-arm
[Finished] e7a885d8 ursa-i9-9960x
[Finished] e7a885d8 ursa-thinkcentre-m75q
[Finished] 401ae190 ec2-t3-xlarge-us-east-2
[Finished] 401ae190 test-mac-arm
[Finished] 401ae190 ursa-i9-9960x
[Finished] 401ae190 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 13, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…e#35522)

### Are these changes tested?

I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice.
-->

### Are there any user-facing changes?
No.

**This PR contains a "Critical Fix".**
* Closes: apache#35521

Lead-authored-by: micah-white <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…e#35522)

### Are these changes tested?

I am new to contributing and am having a hard time creating a good test for this. The steps to reproduce this bug originally are too complicated for a simple test. I included my attempt at making a good test in the PR, but some help would be nice.
-->

### Are there any user-facing changes?
No.

**This PR contains a "Critical Fix".**
* Closes: apache#35521

Lead-authored-by: micah-white <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Hashing array scalar with null bitmap and non-null 0s bitmap produces different hashes.
6 participants