-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5881: [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits #4829
Conversation
…f a validity buffer has completely 1 bits/0 bits
@liyafan82 how are you imagining these would be used in comparison with the existing pop-count? |
Good question. @emkornfield Fundamentally, the count contains much more information than existence/non-existence. |
Codecov Report
@@ Coverage Diff @@
## master #4829 +/- ##
=========================================
+ Coverage 87.44% 89.6% +2.16%
=========================================
Files 997 662 -335
Lines 139804 96427 -43377
Branches 1418 0 -1418
=========================================
- Hits 122246 86401 -35845
+ Misses 17196 10026 -7170
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
What scenarios did you have in mind? |
In a distributed computational engine:
In the current code, there is an example in AbstractStructVector#addOrGet: If the vector is full of nulls, it will be cleared and a new one will be created. Code for this example has been revised accordingly in this PR. |
* @param valueCount the bit count. | ||
* @return true if all bits are 0, and false otherwise. | ||
*/ | ||
public static boolean allBitsNull(final ArrowBuf validityBuffer, final int valueCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add a benchmark for this ? I'm curious if a simple loop that checks each byte performs any better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra thanks for the good suggestion.
I have added a benchmark for the method. The results are as follows:
check by bit (current implementation)
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 1472.140 ± 13.670 ns/op
check by byte
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 139.718 ± 0.268 ns/op
check by long (implementation introduced by this PR)
BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 26.445 ± 0.097 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @liyafan82
|
||
int index = 0; | ||
while (index + 8 <= fullBytesCount) { | ||
long longValue = validityBuffer.getLong(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that these do bound checks on every call and too many of these can add up.
In #4847, I did one boundary check at the beginning of the function and bypassed the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra thanks for the suggestion.
I agree that it is a good idea to do the boundary check once at the beginning. An alternative is to turn of the flag in org.apache.arrow.memory.BoundsChecking.
I think the problem of which way is better is a fundamental decision. Maybe the two ways are suitable for different scenarios:
- checking at the beginning provides both performance and security benefits.
- disabling the flag provides better code readability (Vector APIs are better to understand and follow than PlatformDependent APIs)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer 1 since we have the boundary checks enabled by default (both in arrow and dremio).
For non-trivial code, we can create wrappers over the PlatformDependent code if it helps with readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra Changed to 1. Thanks for your suggestion.
final int fullBytesCount = remainder == 0 ? sizeInBytes : sizeInBytes - 1; | ||
|
||
int index = 0; | ||
while (index + 8 <= fullBytesCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge this with the code in above method..the only thing to change looks like the comparison value 0 and 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@praveenbingo Greate suggestion. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @liyafan82 nice changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM. Thanks @liyafan82
Of the two failures, one is in rust and not related to this change. The other is in flight which I see is failing in the master builds too. so, i'm going ahead with merging this change. |
@pravindra and @praveenbingo, thanks a lot for your valuable comments. |
…f a validity buffer has completely 1 bits/0 bits These utilities can be used to efficiently determine, for example, If all values in a vector are null If a vector contains no null If a vector contains any valid element If a vector contains any invalid element Author: liyafan82 <[email protected]> Closes #4829 from liyafan82/fly_0709_nullbit and squashes the following commits: 1762951 <liyafan82> Merge methods and change method name 0dc0045 <liyafan82> Do boundary check once at the beginning c57cb6d <liyafan82> Provide benchmark for allBitsNull 3deedaf <liyafan82> Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
These utilities can be used to efficiently determine, for example,
If all values in a vector are null
If a vector contains no null
If a vector contains any valid element
If a vector contains any invalid element