Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

tgravescs · 2021-01-27T02:25:42Z

This adds in the JNI layer to be able to take build up Arrow column vectors which are just references to off heap arrow buffers and then convert those into CUDF ColumnVectors by directly copying the arrow data to the GPU.

The way this works is you create a ArrowColumnBuilder for each column you need. You call addBatch for each separate arrow buffer you want to add into that column and then you call buildAndPutOnDevice() on the Builder. That will cause the arrow pointer to be passed into CUDF, an Arrow Table with 1 column is created, that Arrow table gets passed into the cudf::from_arrow which returns a CUDF Table and we grab the 1 column from that and return it.

Note this only supports primitive types and Strings for now. List, Struct, Dictionary, and Decimal are not supported yet.

Signed-off-by: Thomas Graves [email protected]

Signed-off-by: Thomas Graves <[email protected]>

codecov · 2021-01-27T06:32:51Z

Codecov Report

Merging #7222 (35bdba3) into branch-0.18 (8860baf) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           branch-0.18    #7222      +/-   ##
===============================================
+ Coverage        82.09%   82.17%   +0.08%     
===============================================
  Files               97       99       +2     
  Lines            16474    16805     +331     
===============================================
+ Hits             13524    13810     +286     
- Misses            2950     2995      +45

Impacted Files	Coverage Δ
python/cudf/cudf/_lib/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/core/column/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/core/column/column.py	`87.75% <ø> (-0.39%)`	⬇️
python/cudf/cudf/core/column/decimal.py	`94.87% <ø> (ø)`
python/cudf/cudf/core/column/lists.py	`91.57% <ø> (-0.18%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.19% <ø> (-0.22%)`	⬇️
python/cudf/cudf/core/column/string.py	`86.56% <ø> (-0.09%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.48% <ø> (-0.23%)`	⬇️
python/cudf/cudf/core/dtypes.py	`90.10% <ø> (-0.28%)`	⬇️
python/cudf/cudf/core/frame.py	`89.88% <ø> (-0.10%)`	⬇️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a4c760...34962f2. Read the comment docs.

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

java/src/main/java/ai/rapids/cudf/ColumnVector.java

java/src/main/native/src/ColumnVectorJni.cpp

java/src/main/java/ai/rapids/cudf/ColumnVector.java

java/src/main/native/src/ColumnVectorJni.cpp

review

gerashegalov · 2021-01-27T18:12:50Z

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

+ */
+public final class ArrowColumnBuilder implements AutoCloseable {
+    private DType type;
+    private ArrayList<Long> data = new ArrayList<>();


all these container fields can be declared final

will update

gerashegalov · 2021-01-27T18:41:49Z

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

+          for (ColumnVector cv : allVecs) {
+            cv.close();
+          }


nit: could use forEach method allVecs.forEach(v -> v.close());

gerashegalov · 2021-01-27T18:42:55Z

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

+
+    @Override
+    public String toString() {
+      StringJoiner sj = new StringJoiner(",");


sj is unused

thanks will update

variables

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

revans2 · 2021-01-27T19:14:46Z

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

+    private ArrayList<Long> dataLength = new ArrayList<>();
+    private ArrayList<Long> validity = new ArrayList<>();
+    private ArrayList<Long> validityLength = new ArrayList<>();
+    private ArrayList<Long> offsets = new ArrayList<>();


Speaking of things getting out of sync, would it be better for these to be an array of objects instead of an object of arrays? How many of these do you expect a user to pass in? And even if it is large is the access pattern for the metadata going to be one batch at a time or all of the offsets followed by all of the nullCounts, ...?

I had briefly considered making a class to hold each Arrow batch data so we didn't have as many lists but honestly didn't think it all the way thru and kind of forgot about it, so glad you brought it up.
You are going to get one of these whenever you hit the row limit while iterating the ColumnBatches in HostColumnToGpu. That is on the Spark side at least.
I'm not sure I follow your last question. you can see how its used below, currently we build a column per entry here and then concatenate all of the column vectors at the end.

one thing about putting this into another class and making it an array of objects is we could then just extend that class to support nested types and this api shouldn't have to change (hopefully?)...

I was just trying to understand if there was a performance reason to use a object of arrays vs an array of objects. From what I have seen because of the access pattern an array of objects should not be a performance problem, and hopefully will make the code a bit more readable.

actually I could just have this take in the Arrow ValueVector and have this do the work internally.

If we do that then we have to depend on Arrow as a dependency for the API

I'm fine with how it is, it really was just a nit

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

jlowe

Thanks @tgravescs, looks good overall. I have some minor nits with doc and code cleanup, but nothing that is must-fix.

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

java/src/main/java/ai/rapids/cudf/ColumnVector.java

java/src/main/native/src/ColumnVectorJni.cpp

java/src/test/java/ai/rapids/cudf/ArrowColumnVectorTest.java

Co-authored-by: Jason Lowe <[email protected]>

tests and use ILLEGAL_ARG_CLASS

tgravescs · 2021-01-28T00:18:01Z

@gpucibot merge

gerashegalov · 2021-01-28T00:31:52Z

java/src/main/java/ai/rapids/cudf/ColumnVector.java

+  private static ByteBuffer bufferAsDirect(ByteBuffer buf) {
+    ByteBuffer bufferOut = buf;
+    if (bufferOut != null && !bufferOut.isDirect()) {
+      bufferOut = ByteBuffer.allocateDirect(buf.remaining());


is ByteOrder an issue? if buf is in non-native byte order?

Add support for converting Arrow buffers to CUDF ColumnVectors

1cc14cb

Signed-off-by: Thomas Graves <[email protected]>

tgravescs added Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer improvement Improvement / enhancement to an existing function labels Jan 27, 2021

tgravescs self-assigned this Jan 27, 2021

tgravescs requested a review from a team as a code owner January 27, 2021 02:25

tgravescs added the non-breaking Non-breaking change label Jan 27, 2021

tgravescs commented Jan 27, 2021

View reviewed changes

java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java Outdated Show resolved Hide resolved

tgravescs added 3 commits January 27, 2021 07:27

remove extra logging and close properly

0bcef38

remove tabs

512bf97

Fix more tabs and extend try finally

35bdba3

jlowe requested changes Jan 27, 2021

View reviewed changes

jlowe removed the 4 - Needs cuDF (Java) Reviewer label Jan 27, 2021

jlowe changed the title ~~Add JNI support for converting Arrow buffers to CUDF ColumnVectors~~ Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] Jan 27, 2021

tgravescs added 2 commits January 27, 2021 12:32

remove column name, fix formatting and add docs, fix indentation, other

4ead8b9

review

update java doc

3b8cffc

gerashegalov reviewed Jan 27, 2021

View reviewed changes

tgravescs added 2 commits January 27, 2021 13:27

Add comments about not supporting nested types, add final to some

b922e00

variables

change to use foreach

e73a91e

revans2 reviewed Jan 27, 2021

View reviewed changes

tgravescs added 4 commits January 27, 2021 14:27

Change to use ByteBuffer instead of address and length

9e4a846

handle converting byte buffers to direct ones if they aren't

55eed1b

Add tests for dealing with on heap buffers

6418e50

fix some description and spacing

bf48c1d

revans2 approved these changes Jan 27, 2021

View reviewed changes

jlowe approved these changes Jan 27, 2021

View reviewed changes

tgravescs and others added 5 commits January 27, 2021 16:51

Update java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

139f5ab

Co-authored-by: Jason Lowe <[email protected]>

Update java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

cc23f78

Co-authored-by: Jason Lowe <[email protected]>

Update java/src/main/java/ai/rapids/cudf/ColumnVector.java

364c993

Co-authored-by: Jason Lowe <[email protected]>

Update java/src/main/java/ai/rapids/cudf/ArrowColumnBuilder.java

5cc697c

Co-authored-by: Jason Lowe <[email protected]>

create helper function for direct byte buffer, use try with resource in

34962f2

tests and use ILLEGAL_ARG_CLASS

jlowe approved these changes Jan 27, 2021

View reviewed changes

rapids-bot bot merged commit cbc0394 into rapidsai:branch-0.18 Jan 28, 2021

gerashegalov reviewed Jan 28, 2021

View reviewed changes

tgravescs mentioned this pull request Jan 29, 2021

Support faster copy for a custom DataSource V2 which supplies Arrow data NVIDIA/spark-rapids#1622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

tgravescs commented Jan 27, 2021 •

edited

Loading

codecov bot commented Jan 27, 2021 •

edited

Loading

gerashegalov Jan 27, 2021

tgravescs Jan 27, 2021

gerashegalov Jan 27, 2021

gerashegalov Jan 27, 2021

tgravescs Jan 27, 2021

revans2 Jan 27, 2021

tgravescs Jan 27, 2021

revans2 Jan 27, 2021

tgravescs Jan 27, 2021

revans2 Jan 27, 2021

revans2 Jan 27, 2021

jlowe left a comment

tgravescs commented Jan 28, 2021

gerashegalov Jan 28, 2021

Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

Add JNI support for converting Arrow buffers to CUDF ColumnVectors [skip ci] #7222

Conversation

tgravescs commented Jan 27, 2021 • edited Loading

codecov bot commented Jan 27, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

tgravescs commented Jan 28, 2021

Choose a reason for hiding this comment

tgravescs commented Jan 27, 2021 •

edited

Loading

codecov bot commented Jan 27, 2021 •

edited

Loading