-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38242: [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor #38305
Conversation
1d115bd
to
12c9d04
Compare
|
12c9d04
to
bb8d171
Compare
cc @jduo |
@@ -714,8 +714,22 @@ public int getBufferSizeFor(final int count) { | |||
if (count == 0) { | |||
return 0; | |||
} | |||
|
|||
int[] legCounts = new int[typeFields.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "leg" here? This is not common terminology in Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could call this typeCounts
or childCounts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive the name 😓, internal team term leakage!
Fixed by adapting setChildVectorValueCounts (using counts
)
legCounts[tid]++; | ||
} | ||
|
||
long legBytes = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: call this totalChildBytes
for example.
for (ValueVector v : internalStruct) { | ||
legBytes += (long)v.getBufferSizeFor(legCounts[tid]); | ||
tid++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right: internalStruct
has its fields indexed by physical child id (that is: 0, 1, 2...) but the type ids are logical values that are unrelated to the physical structure.
But I think that you should just copy what is already done in setChildVectorValueCounts
and adapt it for the use case of counting buffer bytes.
return (int) (count * TYPE_WIDTH + (long) count * OFFSET_WIDTH | ||
+ DataSizeRoundingUtil.divideBy8Ceil(count) + internalStruct.getBufferSizeFor(count)); | ||
+ DataSizeRoundingUtil.divideBy8Ceil(count) + legBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DataSizeRoundingUtil.divideBy8Ceil(count)
is a mistake: union vectors don't have a validity bitmap, so they don't need any buffer for it.
import org.apache.arrow.vector.util.DataSizeRoundingUtil; | ||
import org.junit.jupiter.api.Test; | ||
|
||
public class TestDUVBufferSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "DUV" is a common abbreviation, can you just spell it out in full? (also in the file name)
new Field("b", FieldType.notNullable(new ArrowType.Binary()), null) | ||
); | ||
|
||
duv.initializeChildrenFromFields(fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but that's the easy case as it will use auto-incremented type ids (0,1,2...).
Instead, I would recommend calling DenseUnionVector.addVector
which allows you to add child fields with non-trivial type ids (7,6,5 for example).
int ac = BaseValueVector.INITIAL_VALUE_ALLOCATION + 1; | ||
for (int i = 0; i < ac; i++) { | ||
duv.setTypeId(i, (byte) 0); | ||
duv.setSafe(i, nih); | ||
} | ||
|
||
int bc = 1; | ||
for (int i = 0; i < bc; i++) { | ||
duv.setTypeId(i + ac, (byte) 1); | ||
duv.setSafe(i + ac, nvbh); | ||
} | ||
|
||
int count = ac + bc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, can you avoid cryptic variable names? (nih
, ac
, etc.)
duv.setValueCount(count); | ||
|
||
// will not necessarily see an error unless bounds checking is on. | ||
assertDoesNotThrow(duv::getBufferSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also test getBufferSize
with an explicit argument?
bb8d171
to
d1a13b1
Compare
…UV getBufferSizeFor (apache#38242)
d1a13b1
to
680578a
Compare
Thanks a lot for the update @wotbrew ! CI failure is unrelated, so I'll merge. |
…enseUnionVector#getBufferSizeFor (apache#38305) ### What changes are included in this PR? Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`. Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors. ### Are these changes tested? Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix. **This PR contains a "Critical Fix".** For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off. * Closes: apache#38242 Authored-by: Dan Stone <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 3793560. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…enseUnionVector#getBufferSizeFor (apache#38305) ### What changes are included in this PR? Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`. Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors. ### Are these changes tested? Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix. **This PR contains a "Critical Fix".** For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off. * Closes: apache#38242 Authored-by: Dan Stone <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…enseUnionVector#getBufferSizeFor (apache#38305) ### What changes are included in this PR? Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`. Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors. ### Are these changes tested? Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix. **This PR contains a "Critical Fix".** For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off. * Closes: apache#38242 Authored-by: Dan Stone <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…enseUnionVector#getBufferSizeFor (apache#38305) ### What changes are included in this PR? Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`. Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors. ### Are these changes tested? Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix. **This PR contains a "Critical Fix".** For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off. * Closes: apache#38242 Authored-by: Dan Stone <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
What changes are included in this PR?
Fix incorrect implementation of
DenseUnionVector.getBufferSizeFor
.Sum the type id counts before calling
getBufferSizeFor
on the union's child vectors.Are these changes tested?
Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.
This PR contains a "Critical Fix".
For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.