Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222
Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222
Changes from 4 commits
1cc14cb
0bcef38
512bf97
35bdba3
4ead8b9
3b8cffc
b922e00
e73a91e
9e4a846
55eed1b
6418e50
bf48c1d
139f5ab
cc23f78
364c993
5cc697c
34962f2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these container fields can be declared
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of things getting out of sync, would it be better for these to be an array of objects instead of an object of arrays? How many of these do you expect a user to pass in? And even if it is large is the access pattern for the metadata going to be one batch at a time or all of the offsets followed by all of the nullCounts, ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had briefly considered making a class to hold each Arrow batch data so we didn't have as many lists but honestly didn't think it all the way thru and kind of forgot about it, so glad you brought it up.
You are going to get one of these whenever you hit the row limit while iterating the ColumnBatches in HostColumnToGpu. That is on the Spark side at least.
I'm not sure I follow your last question. you can see how its used below, currently we build a column per entry here and then concatenate all of the column vectors at the end.
one thing about putting this into another class and making it an array of objects is we could then just extend that class to support nested types and this api shouldn't have to change (hopefully?)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to understand if there was a performance reason to use a object of arrays vs an array of objects. From what I have seen because of the access pattern an array of objects should not be a performance problem, and hopefully will make the code a bit more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I could just have this take in the Arrow ValueVector and have this do the work internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that then we have to depend on Arrow as a dependency for the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with how it is, it really was just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use forEach method
allVecs.forEach(v -> v.close());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sj
is unusedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API would be more Java-like if it took
ByteBuffer
instances (which can beDirectByteBuffer
for off-heap). Not sure if that makes it easier to wield for the intended use-case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing me to the jni bytebuffer api, I wasn't aware of a way to get the address and size