Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5881: [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits #4829

Closed
wants to merge 4 commits into from

Conversation

liyafan82
Copy link
Contributor

These utilities can be used to efficiently determine, for example,

If all values in a vector are null
If a vector contains no null
If a vector contains any valid element
If a vector contains any invalid element

…f a validity buffer has completely 1 bits/0 bits
@emkornfield
Copy link
Contributor

@liyafan82 how are you imagining these would be used in comparison with the existing pop-count?

@liyafan82
Copy link
Contributor Author

@liyafan82 how are you imagining these would be used in comparison with the existing pop-count?

Good question. @emkornfield
These should provide better performance, because they provide short-circuit behaviors.
For some scenarios, the user only cares about if there is any 0/1 bit in the buffer. So if one is found, the methods return immediately.

Fundamentally, the count contains much more information than existence/non-existence.

@codecov-io
Copy link

Codecov Report

Merging #4829 into master will increase coverage by 2.16%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4829      +/-   ##
=========================================
+ Coverage   87.44%   89.6%   +2.16%     
=========================================
  Files         997     662     -335     
  Lines      139804   96427   -43377     
  Branches     1418       0    -1418     
=========================================
- Hits       122246   86401   -35845     
+ Misses      17196   10026    -7170     
+ Partials      362       0     -362
Impacted Files Coverage Δ
cpp/src/arrow/json/converter.cc 90.53% <0%> (-1.78%) ⬇️
cpp/src/arrow/json/chunked-builder.cc 79.91% <0%> (-1.68%) ⬇️
cpp/src/arrow/io/readahead.cc 95.91% <0%> (-1.03%) ⬇️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
... and 330 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f767ce...3deedaf. Read the comment docs.

@emkornfield
Copy link
Contributor

For some scenarios, the user only cares about if there is any 0/1 bit in the buffer. So if one is found, the methods return immediately.

What scenarios did you have in mind?

@liyafan82
Copy link
Contributor Author

For some scenarios, the user only cares about if there is any 0/1 bit in the buffer. So if one is found, the methods return immediately.

What scenarios did you have in mind?

In a distributed computational engine:

  1. Before shuffling a batch of vector data, if the vector is full of nulls, there is no need to send it.
  2. When the memory is insufficient, spilling is required (for example, in a Hash Join or Hash Aggregate). If the vector is full of null data, there is no need to spill it and no need to load it later.

In the current code, there is an example in AbstractStructVector#addOrGet: If the vector is full of nulls, it will be cleared and a new one will be created. Code for this example has been revised accordingly in this PR.

* @param valueCount the bit count.
* @return true if all bits are 0, and false otherwise.
*/
public static boolean allBitsNull(final ArrowBuf validityBuffer, final int valueCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a benchmark for this ? I'm curious if a simple loop that checks each byte performs any better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra thanks for the good suggestion.
I have added a benchmark for the method. The results are as follows:

check by bit (current implementation)
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 1472.140 ± 13.670 ns/op

check by byte
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 139.718 ± 0.268 ns/op

check by long (implementation introduced by this PR)
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 26.445 ± 0.097 ns/op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @liyafan82


int index = 0;
while (index + 8 <= fullBytesCount) {
long longValue = validityBuffer.getLong(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that these do bound checks on every call and too many of these can add up.

In #4847, I did one boundary check at the beginning of the function and bypassed the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra thanks for the suggestion.

I agree that it is a good idea to do the boundary check once at the beginning. An alternative is to turn of the flag in org.apache.arrow.memory.BoundsChecking.

I think the problem of which way is better is a fundamental decision. Maybe the two ways are suitable for different scenarios:

  1. checking at the beginning provides both performance and security benefits.
  2. disabling the flag provides better code readability (Vector APIs are better to understand and follow than PlatformDependent APIs)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer 1 since we have the boundary checks enabled by default (both in arrow and dremio).

For non-trivial code, we can create wrappers over the PlatformDependent code if it helps with readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra Changed to 1. Thanks for your suggestion.

final int fullBytesCount = remainder == 0 ? sizeInBytes : sizeInBytes - 1;

int index = 0;
while (index + 8 <= fullBytesCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge this with the code in above method..the only thing to change looks like the comparison value 0 and 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praveenbingo Greate suggestion. Thank you.

Copy link
Contributor

@praveenbingo praveenbingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @liyafan82 nice changes

Copy link
Contributor

@praveenbingo praveenbingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM. Thanks @liyafan82

@pravindra
Copy link
Contributor

Of the two failures, one is in rust and not related to this change. The other is in flight which I see is failing in the master builds too. so, i'm going ahead with merging this change.

@pravindra pravindra closed this in 5c61263 Jul 12, 2019
@liyafan82
Copy link
Contributor Author

@pravindra and @praveenbingo, thanks a lot for your valuable comments.

kszucs pushed a commit that referenced this pull request Jul 22, 2019
…f a validity buffer has completely 1 bits/0 bits

These utilities can be used to efficiently determine, for example,

If all values in a vector are null
If a vector contains no null
If a vector contains any valid element
If a vector contains any invalid element

Author: liyafan82 <[email protected]>

Closes #4829 from liyafan82/fly_0709_nullbit and squashes the following commits:

1762951 <liyafan82>  Merge methods and change method name
0dc0045 <liyafan82>  Do boundary check once at the beginning
c57cb6d <liyafan82>  Provide benchmark for allBitsNull
3deedaf <liyafan82>  Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants