Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38254: [Java] Add reusable buffer getters to char/binary vectors #38266

Merged
merged 2 commits into from
Oct 23, 2023

Conversation

jduo
Copy link
Member

@jduo jduo commented Oct 14, 2023

Rationale for this change

Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive
reallocations.

What changes are included in this PR?

Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content.

Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).

Are these changes tested?

Are there any user-facing changes?

Yes.

This PR includes breaking changes to public APIs.

@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from fcdf8cf to 0561808 Compare October 16, 2023 17:27
@jduo jduo marked this pull request as ready for review October 16, 2023 17:28
@jduo jduo requested a review from lidavidm as a code owner October 16, 2023 17:28
* @param index position of element.
* @param outputBuffer the buffer to write into.
*/
public void readBytes(int index, ReusableBuffer<?> outputBuffer) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we allow any reusable buffer here, when the reusable buffers may be specialized to byte vs string type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that users could implement ReusableBuffer on top of other byte-container types such as ByteBuffer or Netty ByteBuf objects.

*/
public void readBytes(int index, ReusableBuffer<?> outputBuffer) {
final long startOffset = getStartOffset(index);
final int dataLength =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getEndOffset instead? Do we need to worry about lengths exceeding max integer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, getEndOffset() is part of BaseVariableWidthVector and not part of BaseLargeVariableWidthVector, and BLVW doesn't extend VLW.

The existing getters on this class do the narrowing cast from long to int as well, so if this is a problem the scope is larger than this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose they don't inherit because of the different integer types.

The long vs int problem is baked heavily into the library (and I guess ByteBuffer), so I don't think it's a concern here, but new interfaces should probably use long to be forward-looking (e.g., MemorySegment uses long now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think that it's not possible for the current implementation to exceed int. However a Java 21+ memory module should actually be able to.

@danepitkin we should probably file some follow-up work for after your PR lands to shake out all the problems that are about to crop up.

*/
public void readBytes(int index, ReusableBuffer<?> outputBuffer) {
final long startOffset = getStartOffset(index);
final int dataLength =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above comment, I thought main purpose of the Large variants was to allow for values larger than max integer.

@adamkennedy
Copy link

There's no unit tests written for these new classes, and we probably should have some.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 16, 2023
@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from 0561808 to 519258f Compare October 16, 2023 23:09
@jduo
Copy link
Member Author

jduo commented Oct 16, 2023

There's no unit tests written for these new classes, and we probably should have some.

Added new tests.

*
* @return the number of valid bytes in the data
*/
int getLength();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use long for future compatibility?

*/
public void readBytes(int index, ReusableBuffer<?> outputBuffer) {
final long startOffset = getStartOffset(index);
final int dataLength =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose they don't inherit because of the different integer types.

The long vs int problem is baked heavily into the library (and I guess ByteBuffer), so I don't think it's a concern here, but new interfaces should probably use long to be forward-looking (e.g., MemorySegment uses long now)

*/
public void readBytes(int index, ReusableBuffer<?> outputBuffer) {
final long startOffset = getStartOffset(index);
final int dataLength =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think that it's not possible for the current implementation to exceed int. However a Java 21+ memory module should actually be able to.

@danepitkin we should probably file some follow-up work for after your PR lands to shake out all the problems that are about to crop up.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 20, 2023
@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from 519258f to cc90fb1 Compare October 20, 2023 18:28
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 20, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 20, 2023
@lidavidm
Copy link
Member

I'm going to mark this a "breaking change" since Text#getLength now returns long

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Oct 20, 2023
@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from cc90fb1 to 6540883 Compare October 20, 2023 20:02
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Oct 20, 2023
@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review labels Oct 20, 2023
* @param len the number of bytes we need
* @param keepData should the old data be kept
*/
protected void setCapacity(int len, boolean keepData) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what situation would we need to keep the existing data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is ported from Text. It gets set to true when used by the append() method there.

public void read(int index, ReusableBuffer<?> outputBuffer) {
final long startOffset = getStartOffset(index);
final long dataLength =
offsetBuffer.getLong((long) (index + 1) * OFFSET_WIDTH) - startOffset;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getEndOffset instead here rather than reimplement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEndOffset() is not implemented for LargeVectors. I can use it for VarVector and write an implementation for Large*Vectors.

@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from d1471b4 to 63e8cf4 Compare October 20, 2023 23:03
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 20, 2023
@lidavidm
Copy link
Member

Can you rebase to pick up the compilation fix from this weekend?

}

/**
* Copied from Arrays.hashCode so we don't have to copy the byte array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this comment? Unless you really did copy it from the JDK (in which case don't)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 23, 2023
…tors

Add a reusable buffer interface that can be populated by
character and binary vectors to avoid allocations when consuming
vector content.

Optimize getObject() on VarCharVector/LargeVarCharVector to
avoid an extra allocation of a byte array (copy from ArrowBuf
directly to the resulting Text).
- Add getEndOffset() to BaseLargeVariableWidthVector.
- Change BaseVariableWidthVector, BaseLargeVariableWidthVector, VarBinaryVector, VarCharVector,
  LargeVarBinaryVector, and LargeVarCharVector to use getStartOffset() and getEndOffset() when possible.
- Add tests for ReusableByteArray equals(), hashCode(), and toString() functions.
- Rename the outputBuffer parameter in read() functions to buffer.
@jduo jduo force-pushed the 38254-reusable-buffer-varcharvector branch from 63e8cf4 to ae9d6a2 Compare October 23, 2023 15:23
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 23, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 23, 2023
@lidavidm lidavidm merged commit 6f8f34b into apache:main Oct 23, 2023
15 checks passed
@lidavidm lidavidm added Breaking Change Includes a breaking change to the API and removed awaiting merge Awaiting merge labels Oct 23, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 6f8f34b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 25, 2023
…tors (apache#38266)

### Rationale for this change
Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive
reallocations.

### What changes are included in this PR?
Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content.

Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).

### Are these changes tested?

### Are there any user-facing changes?

Yes.

**This PR includes breaking changes to public APIs.**

* Closes: apache#38254

Authored-by: James Duong <[email protected]>
Signed-off-by: David Li <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…tors (apache#38266)

### Rationale for this change
Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive
reallocations.

### What changes are included in this PR?
Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content.

Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).

### Are these changes tested?

### Are there any user-facing changes?

Yes.

**This PR includes breaking changes to public APIs.**

* Closes: apache#38254

Authored-by: James Duong <[email protected]>
Signed-off-by: David Li <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tors (apache#38266)

### Rationale for this change
Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive
reallocations.

### What changes are included in this PR?
Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content.

Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).

### Are these changes tested?

### Are there any user-facing changes?

Yes.

**This PR includes breaking changes to public APIs.**

* Closes: apache#38254

Authored-by: James Duong <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: Java
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Allow passing pre-allocated buffers to VarCharVector
3 participants