-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222
Conversation
Signed-off-by: Thomas Graves <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #7222 +/- ##
===============================================
+ Coverage 82.09% 82.17% +0.08%
===============================================
Files 97 99 +2
Lines 16474 16805 +331
===============================================
+ Hits 13524 13810 +286
- Misses 2950 2995 +45
Continue to review full report at Codecov.
|
*/ | ||
public final class ArrowColumnBuilder implements AutoCloseable { | ||
private DType type; | ||
private ArrayList<Long> data = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these container fields can be declared final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update
for (ColumnVector cv : allVecs) { | ||
cv.close(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use forEach method allVecs.forEach(v -> v.close());
|
||
@Override | ||
public String toString() { | ||
StringJoiner sj = new StringJoiner(","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sj
is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks will update
private ArrayList<Long> dataLength = new ArrayList<>(); | ||
private ArrayList<Long> validity = new ArrayList<>(); | ||
private ArrayList<Long> validityLength = new ArrayList<>(); | ||
private ArrayList<Long> offsets = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of things getting out of sync, would it be better for these to be an array of objects instead of an object of arrays? How many of these do you expect a user to pass in? And even if it is large is the access pattern for the metadata going to be one batch at a time or all of the offsets followed by all of the nullCounts, ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had briefly considered making a class to hold each Arrow batch data so we didn't have as many lists but honestly didn't think it all the way thru and kind of forgot about it, so glad you brought it up.
You are going to get one of these whenever you hit the row limit while iterating the ColumnBatches in HostColumnToGpu. That is on the Spark side at least.
I'm not sure I follow your last question. you can see how its used below, currently we build a column per entry here and then concatenate all of the column vectors at the end.
one thing about putting this into another class and making it an array of objects is we could then just extend that class to support nested types and this api shouldn't have to change (hopefully?)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to understand if there was a performance reason to use a object of arrays vs an array of objects. From what I have seen because of the access pattern an array of objects should not be a performance problem, and hopefully will make the code a bit more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I could just have this take in the Arrow ValueVector and have this do the work internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that then we have to depend on Arrow as a dependency for the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with how it is, it really was just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgravescs, looks good overall. I have some minor nits with doc and code cleanup, but nothing that is must-fix.
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
tests and use ILLEGAL_ARG_CLASS
@gpucibot merge |
private static ByteBuffer bufferAsDirect(ByteBuffer buf) { | ||
ByteBuffer bufferOut = buf; | ||
if (bufferOut != null && !bufferOut.isDirect()) { | ||
bufferOut = ByteBuffer.allocateDirect(buf.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ByteOrder an issue? if buf is in non-native byte order?
This adds in the JNI layer to be able to take build up Arrow column vectors which are just references to off heap arrow buffers and then convert those into CUDF ColumnVectors by directly copying the arrow data to the GPU.
The way this works is you create a ArrowColumnBuilder for each column you need. You call addBatch for each separate arrow buffer you want to add into that column and then you call buildAndPutOnDevice() on the Builder. That will cause the arrow pointer to be passed into CUDF, an Arrow Table with 1 column is created, that Arrow table gets passed into the cudf::from_arrow which returns a CUDF Table and we grab the 1 column from that and return it.
Note this only supports primitive types and Strings for now. List, Struct, Dictionary, and Decimal are not supported yet.
Signed-off-by: Thomas Graves [email protected]