-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38254: [Java] Add reusable buffer getters to char/binary vectors #38266
GH-38254: [Java] Add reusable buffer getters to char/binary vectors #38266
Conversation
fcdf8cf
to
0561808
Compare
java/memory/memory-core/src/main/java/org/apache/arrow/memory/ReusableBuffer.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java
Outdated
Show resolved
Hide resolved
* @param index position of element. | ||
* @param outputBuffer the buffer to write into. | ||
*/ | ||
public void readBytes(int index, ReusableBuffer<?> outputBuffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we allow any reusable buffer here, when the reusable buffers may be specialized to byte vs string type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that users could implement ReusableBuffer on top of other byte-container types such as ByteBuffer or Netty ByteBuf objects.
*/ | ||
public void readBytes(int index, ReusableBuffer<?> outputBuffer) { | ||
final long startOffset = getStartOffset(index); | ||
final int dataLength = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getEndOffset instead? Do we need to worry about lengths exceeding max integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly, getEndOffset() is part of BaseVariableWidthVector and not part of BaseLargeVariableWidthVector, and BLVW doesn't extend VLW.
The existing getters on this class do the narrowing cast from long to int as well, so if this is a problem the scope is larger than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose they don't inherit because of the different integer types.
The long
vs int
problem is baked heavily into the library (and I guess ByteBuffer), so I don't think it's a concern here, but new interfaces should probably use long
to be forward-looking (e.g., MemorySegment uses long
now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think that it's not possible for the current implementation to exceed int
. However a Java 21+ memory module should actually be able to.
@danepitkin we should probably file some follow-up work for after your PR lands to shake out all the problems that are about to crop up.
*/ | ||
public void readBytes(int index, ReusableBuffer<?> outputBuffer) { | ||
final long startOffset = getStartOffset(index); | ||
final int dataLength = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above comment, I thought main purpose of the Large variants was to allow for values larger than max integer.
java/vector/src/main/java/org/apache/arrow/vector/util/ReusableArrowBuf.java
Outdated
Show resolved
Hide resolved
There's no unit tests written for these new classes, and we probably should have some. |
0561808
to
519258f
Compare
Added new tests. |
* | ||
* @return the number of valid bytes in the data | ||
*/ | ||
int getLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use long
for future compatibility?
java/memory/memory-core/src/main/java/org/apache/arrow/memory/ReusableBuffer.java
Outdated
Show resolved
Hide resolved
*/ | ||
public void readBytes(int index, ReusableBuffer<?> outputBuffer) { | ||
final long startOffset = getStartOffset(index); | ||
final int dataLength = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose they don't inherit because of the different integer types.
The long
vs int
problem is baked heavily into the library (and I guess ByteBuffer), so I don't think it's a concern here, but new interfaces should probably use long
to be forward-looking (e.g., MemorySegment uses long
now)
*/ | ||
public void readBytes(int index, ReusableBuffer<?> outputBuffer) { | ||
final long startOffset = getStartOffset(index); | ||
final int dataLength = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think that it's not possible for the current implementation to exceed int
. However a Java 21+ memory module should actually be able to.
@danepitkin we should probably file some follow-up work for after your PR lands to shake out all the problems that are about to crop up.
java/vector/src/main/java/org/apache/arrow/vector/util/ReusableByteArray.java
Outdated
Show resolved
Hide resolved
519258f
to
cc90fb1
Compare
I'm going to mark this a "breaking change" since |
cc90fb1
to
6540883
Compare
* @param len the number of bytes we need | ||
* @param keepData should the old data be kept | ||
*/ | ||
protected void setCapacity(int len, boolean keepData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what situation would we need to keep the existing data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is ported from Text. It gets set to true when used by the append() method there.
java/vector/src/main/java/org/apache/arrow/vector/FixedSizeBinaryVector.java
Outdated
Show resolved
Hide resolved
public void read(int index, ReusableBuffer<?> outputBuffer) { | ||
final long startOffset = getStartOffset(index); | ||
final long dataLength = | ||
offsetBuffer.getLong((long) (index + 1) * OFFSET_WIDTH) - startOffset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getEndOffset instead here rather than reimplement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getEndOffset() is not implemented for LargeVectors. I can use it for VarVector and write an implementation for Large*Vectors.
d1471b4
to
63e8cf4
Compare
Can you rebase to pick up the compilation fix from this weekend? |
} | ||
|
||
/** | ||
* Copied from Arrays.hashCode so we don't have to copy the byte array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this comment? Unless you really did copy it from the JDK (in which case don't)
…tors Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content. Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).
- Add getEndOffset() to BaseLargeVariableWidthVector. - Change BaseVariableWidthVector, BaseLargeVariableWidthVector, VarBinaryVector, VarCharVector, LargeVarBinaryVector, and LargeVarCharVector to use getStartOffset() and getEndOffset() when possible. - Add tests for ReusableByteArray equals(), hashCode(), and toString() functions. - Rename the outputBuffer parameter in read() functions to buffer.
63e8cf4
to
ae9d6a2
Compare
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 6f8f34b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…tors (apache#38266) ### Rationale for this change Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive reallocations. ### What changes are included in this PR? Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content. Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text). ### Are these changes tested? ### Are there any user-facing changes? Yes. **This PR includes breaking changes to public APIs.** * Closes: apache#38254 Authored-by: James Duong <[email protected]> Signed-off-by: David Li <[email protected]>
…tors (apache#38266) ### Rationale for this change Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive reallocations. ### What changes are included in this PR? Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content. Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text). ### Are these changes tested? ### Are there any user-facing changes? Yes. **This PR includes breaking changes to public APIs.** * Closes: apache#38254 Authored-by: James Duong <[email protected]> Signed-off-by: David Li <[email protected]>
…tors (apache#38266) ### Rationale for this change Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive reallocations. ### What changes are included in this PR? Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content. Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text). ### Are these changes tested? ### Are there any user-facing changes? Yes. **This PR includes breaking changes to public APIs.** * Closes: apache#38254 Authored-by: James Duong <[email protected]> Signed-off-by: David Li <[email protected]>
Rationale for this change
Provide a way for a user to reuse a buffer when iterating over byte-array-based ValueVectors to avoid excessive
reallocations.
What changes are included in this PR?
Add a reusable buffer interface that can be populated by character and binary vectors to avoid allocations when consuming vector content.
Optimize getObject() on VarCharVector/LargeVarCharVector to avoid an extra allocation of a byte array (copy from ArrowBuf directly to the resulting Text).
Are these changes tested?
Are there any user-facing changes?
Yes.
This PR includes breaking changes to public APIs.