Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38242: [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor #38305

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

wotbrew
Copy link
Contributor

@wotbrew wotbrew commented Oct 17, 2023

What changes are included in this PR?

Fix incorrect implementation of DenseUnionVector.getBufferSizeFor.
Sum the type id counts before calling getBufferSizeFor on the union's child vectors.

Are these changes tested?

Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.

This PR contains a "Critical Fix".

For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.

@wotbrew wotbrew changed the title GH-38242: [Java] Fix incorrect internal struct accounting for DUV getBufferSizeFor (#38242) GH-38242: [Java] Fix incorrect internal struct accounting for DUV getBufferSizeFor Oct 17, 2023
@github-actions github-actions bot added the awaiting review Awaiting review label Oct 17, 2023
@github-actions
Copy link

⚠️ GitHub issue #38242 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Oct 19, 2023

cc @jduo

@@ -714,8 +714,22 @@ public int getBufferSizeFor(final int count) {
if (count == 0) {
return 0;
}

int[] legCounts = new int[typeFields.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "leg" here? This is not common terminology in Arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could call this typeCounts or childCounts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive the name 😓, internal team term leakage!

Fixed by adapting setChildVectorValueCounts (using counts)

legCounts[tid]++;
}

long legBytes = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: call this totalChildBytes for example.

Comment on lines 726 to 732
for (ValueVector v : internalStruct) {
legBytes += (long)v.getBufferSizeFor(legCounts[tid]);
tid++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right: internalStruct has its fields indexed by physical child id (that is: 0, 1, 2...) but the type ids are logical values that are unrelated to the physical structure.

But I think that you should just copy what is already done in setChildVectorValueCounts and adapt it for the use case of counting buffer bytes.

return (int) (count * TYPE_WIDTH + (long) count * OFFSET_WIDTH
+ DataSizeRoundingUtil.divideBy8Ceil(count) + internalStruct.getBufferSizeFor(count));
+ DataSizeRoundingUtil.divideBy8Ceil(count) + legBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DataSizeRoundingUtil.divideBy8Ceil(count) is a mistake: union vectors don't have a validity bitmap, so they don't need any buffer for it.

import org.apache.arrow.vector.util.DataSizeRoundingUtil;
import org.junit.jupiter.api.Test;

public class TestDUVBufferSize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "DUV" is a common abbreviation, can you just spell it out in full? (also in the file name)

new Field("b", FieldType.notNullable(new ArrowType.Binary()), null)
);

duv.initializeChildrenFromFields(fields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but that's the easy case as it will use auto-incremented type ids (0,1,2...).
Instead, I would recommend calling DenseUnionVector.addVector which allows you to add child fields with non-trivial type ids (7,6,5 for example).

Comment on lines 56 to 68
int ac = BaseValueVector.INITIAL_VALUE_ALLOCATION + 1;
for (int i = 0; i < ac; i++) {
duv.setTypeId(i, (byte) 0);
duv.setSafe(i, nih);
}

int bc = 1;
for (int i = 0; i < bc; i++) {
duv.setTypeId(i + ac, (byte) 1);
duv.setSafe(i + ac, nvbh);
}

int count = ac + bc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can you avoid cryptic variable names? (nih, ac, etc.)

duv.setValueCount(count);

// will not necessarily see an error unless bounds checking is on.
assertDoesNotThrow(duv::getBufferSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test getBufferSize with an explicit argument?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 19, 2023
@lidavidm lidavidm changed the title GH-38242: [Java] Fix incorrect internal struct accounting for DUV getBufferSizeFor GH-38242: [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor Oct 20, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Oct 20, 2023
@pitrou
Copy link
Member

pitrou commented Oct 23, 2023

Thanks a lot for the update @wotbrew ! CI failure is unrelated, so I'll merge.

@pitrou pitrou merged commit 3793560 into apache:main Oct 23, 2023
17 of 18 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Oct 23, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…enseUnionVector#getBufferSizeFor (apache#38305)

### What changes are included in this PR?

Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`.
Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors.

### Are these changes tested?

Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.

**This PR contains a "Critical Fix".** 

For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.

* Closes: apache#38242

Authored-by: Dan Stone <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 3793560.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 25, 2023
…enseUnionVector#getBufferSizeFor (apache#38305)

### What changes are included in this PR?

Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`.
Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors.

### Are these changes tested?

Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.

**This PR contains a "Critical Fix".** 

For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.

* Closes: apache#38242

Authored-by: Dan Stone <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…enseUnionVector#getBufferSizeFor (apache#38305)

### What changes are included in this PR?

Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`.
Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors.

### Are these changes tested?

Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.

**This PR contains a "Critical Fix".** 

For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.

* Closes: apache#38242

Authored-by: Dan Stone <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…enseUnionVector#getBufferSizeFor (apache#38305)

### What changes are included in this PR?

Fix incorrect implementation of `DenseUnionVector.getBufferSizeFor`.
Sum the type id counts before calling `getBufferSizeFor` on the union's child vectors.

### Are these changes tested?

Yes. A test verifies the OOB read (requires bounds check), and an example count is correct, fails as expected before fix.

**This PR contains a "Critical Fix".** 

For users of DenseUnionVector the size can calculated incorrectly, as well as cause of out-of-bounds buffer reads which may return garbage (or potentially segfaulting) if bounds checking is off.

* Closes: apache#38242

Authored-by: Dan Stone <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Incorrect getBufferSize for a dense union vector
3 participants