Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

Merged
merged 17 commits into from
Jan 28, 2021

Conversation

tgravescs
Copy link
Contributor

@tgravescs tgravescs commented Jan 27, 2021

This adds in the JNI layer to be able to take build up Arrow column vectors which are just references to off heap arrow buffers and then convert those into CUDF ColumnVectors by directly copying the arrow data to the GPU.

The way this works is you create a ArrowColumnBuilder for each column you need. You call addBatch for each separate arrow buffer you want to add into that column and then you call buildAndPutOnDevice() on the Builder. That will cause the arrow pointer to be passed into CUDF, an Arrow Table with 1 column is created, that Arrow table gets passed into the cudf::from_arrow which returns a CUDF Table and we grab the 1 column from that and return it.

Note this only supports primitive types and Strings for now. List, Struct, Dictionary, and Decimal are not supported yet.

Signed-off-by: Thomas Graves [email protected]

@tgravescs tgravescs added Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer improvement Improvement / enhancement to an existing function labels Jan 27, 2021
@tgravescs tgravescs self-assigned this Jan 27, 2021
@tgravescs tgravescs requested a review from a team as a code owner January 27, 2021 02:25
@tgravescs tgravescs added the non-breaking Non-breaking change label Jan 27, 2021
@codecov
Copy link

codecov bot commented Jan 27, 2021

Codecov Report

Merging #7222 (35bdba3) into branch-0.18 (8860baf) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7222      +/-   ##
===============================================
+ Coverage        82.09%   82.17%   +0.08%     
===============================================
  Files               97       99       +2     
  Lines            16474    16805     +331     
===============================================
+ Hits             13524    13810     +286     
- Misses            2950     2995      +45     
Impacted Files Coverage Δ
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 87.75% <ø> (-0.39%) ⬇️
python/cudf/cudf/core/column/decimal.py 94.87% <ø> (ø)
python/cudf/cudf/core/column/lists.py 91.57% <ø> (-0.18%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.19% <ø> (-0.22%) ⬇️
python/cudf/cudf/core/column/string.py 86.56% <ø> (-0.09%) ⬇️
python/cudf/cudf/core/dataframe.py 90.48% <ø> (-0.23%) ⬇️
python/cudf/cudf/core/dtypes.py 90.10% <ø> (-0.28%) ⬇️
python/cudf/cudf/core/frame.py 89.88% <ø> (-0.10%) ⬇️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a4c760...34962f2. Read the comment docs.

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnVectorJni.cpp Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnVectorJni.cpp Outdated Show resolved Hide resolved
@jlowe jlowe changed the title Add JNI support for converting Arrow buffers to CUDF ColumnVectors Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] Jan 27, 2021
*/
public final class ArrowColumnBuilder implements AutoCloseable {
private DType type;
private ArrayList<Long> data = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these container fields can be declared final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

Comment on lines 85 to 87
for (ColumnVector cv : allVecs) {
cv.close();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use forEach method allVecs.forEach(v -> v.close());


@Override
public String toString() {
StringJoiner sj = new StringJoiner(",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sj is unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks will update

private ArrayList<Long> dataLength = new ArrayList<>();
private ArrayList<Long> validity = new ArrayList<>();
private ArrayList<Long> validityLength = new ArrayList<>();
private ArrayList<Long> offsets = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of things getting out of sync, would it be better for these to be an array of objects instead of an object of arrays? How many of these do you expect a user to pass in? And even if it is large is the access pattern for the metadata going to be one batch at a time or all of the offsets followed by all of the nullCounts, ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had briefly considered making a class to hold each Arrow batch data so we didn't have as many lists but honestly didn't think it all the way thru and kind of forgot about it, so glad you brought it up.
You are going to get one of these whenever you hit the row limit while iterating the ColumnBatches in HostColumnToGpu. That is on the Spark side at least.
I'm not sure I follow your last question. you can see how its used below, currently we build a column per entry here and then concatenate all of the column vectors at the end.

one thing about putting this into another class and making it an array of objects is we could then just extend that class to support nested types and this api shouldn't have to change (hopefully?)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying to understand if there was a performance reason to use a object of arrays vs an array of objects. From what I have seen because of the access pattern an array of objects should not be a performance problem, and hopefully will make the code a bit more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I could just have this take in the Arrow ValueVector and have this do the work internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that then we have to depend on Arrow as a dependency for the API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with how it is, it really was just a nit

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs, looks good overall. I have some minor nits with doc and code cleanup, but nothing that is must-fix.

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnVectorJni.cpp Outdated Show resolved Hide resolved
@tgravescs
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cbc0394 into rapidsai:branch-0.18 Jan 28, 2021
private static ByteBuffer bufferAsDirect(ByteBuffer buf) {
ByteBuffer bufferOut = buf;
if (bufferOut != null && !bufferOut.isDirect()) {
bufferOut = ByteBuffer.allocateDirect(buf.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ByteOrder an issue? if buf is in non-native byte order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants