Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

tgravescs · 2021-01-29T03:45:06Z

This provides support for more efficiently copying data to the GPU when a datasource V2 source provides the data as an ArrowColumnVector. The CUDF side of this has already been merged. rapidsai/cudf#7222

This currently only supports primitive types and strings. Decimal types and nested types are not supported. It will fallback to the regular copy code if it sees one of those types not supported.

The integration test require an extra jar which contains a datasource v2 which supplies ArrowColumnVector. I'm looking into pulling that code in and how best to automate those tests, filed #1620 to track that.

You will notice I added a new AccessibleArrowColumnVector. This is because the Spark ArrowColumnVector doesn't have the Arrow ValueVector publicly accessible. I use reflection in here to get a hold of that, but another option is just for user to use AccessibleArrowColumnVector. I would like peoples feedback on whether to keep that or not?

I had to add a shim layer for the Arrow code because Spark 3.1.0 changed the arrow version and ArrowBuf class is now different.

off

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2021-01-29T03:47:22Z

build

tgravescs · 2021-01-29T03:50:08Z

build

tgravescs · 2021-01-29T14:26:47Z

upmerging

tgravescs · 2021-01-29T14:35:54Z

build

integration_tests/src/main/python/datasourcev2_read.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarToGpu.scala

tgravescs · 2021-01-29T17:01:53Z

build

revans2 · 2021-01-29T17:07:41Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarToGpu.scala

+              "access its Arrow ValueVector", e)
+        }
+      case av: AccessibleArrowColumnVector =>
+        // val arrowVec = av.asInstanceOf[AccessibleArrowColumnVector]


nit: I think this can be removed.

oops missed that

tgravescs · 2021-01-29T17:11:21Z

build

…ata (NVIDIA#1622) * Add in data source v2, csv file and test for arrow copy * remove commented out line

tgravescs and others added 30 commits January 8, 2021 09:08

Add in data source v2, csv file and stub test

6509c8a

update datasource v2 test

1777628

fix up test issues

dbbaf30

add no partitioned test

927abfe

Fix up test to properly work with in memory table datasource

39dc434

logs in hostcolumnar

78a54ff

Add test code and AccessibleArrowcolumnVector

19d9685

Fix accesible retrieval

e012476

working

29f3d25

more debug

a983d33

first go

14844f2

building

6e9fdb9

more changes

d3db4ce

fix static

fb006cb

more changes

eb1dd14

more cudf changes

1414079

debug

396a488

working checkpoint

17e446b

working without mem crash on free

47de41b

remove use of HostmMeoryBuffer

ecc953a

check null count

016bad0

changes

a342c05

Merge remote-tracking branch 'origin/branch-0.4' into sktDatasourceV2

d72f255

comment

40e4115

working

0e441ef

working

7a7fc22

Update to the new CUDF code and add a config to toggle arrow copy on and

2f42c08

off

have the reflection code to access ArrowColumnVEctor working

3bdf7f1

remove logging and commonize

5b7561c

cleanup

95a9e63

tgravescs added feature request New feature or request P0 Must have for release labels Jan 29, 2021

tgravescs added this to the Jan 18 - Jan 29 milestone Jan 29, 2021

tgravescs self-assigned this Jan 29, 2021

comment test

f5bfeae

Signed-off-by: Thomas Graves <[email protected]>

update comment in exception

5617421

Merge remote-tracking branch 'origin/branch-0.4' into sktDatasourceV2

5b5f9db

fix merge conflicts

c68c26d

revans2 reviewed Jan 29, 2021

View reviewed changes

tgravescs added 3 commits January 29, 2021 10:28

use case, logDebug, update test to use tmp table

513cea0

make some of the reflection lazy vals so it only happens once

9b1a2fa

fix line length

0c28106

revans2 reviewed Jan 29, 2021

View reviewed changes

remove commented out line

7ce4bef

revans2 approved these changes Jan 29, 2021

View reviewed changes

tgravescs merged commit f4c912a into NVIDIA:branch-0.4 Jan 29, 2021

tgravescs mentioned this pull request Jan 29, 2021

[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072

Closed

tgravescs deleted the sktDatasourceV2 branch January 29, 2021 18:58

tgravescs restored the sktDatasourceV2 branch February 10, 2021 21:34

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Support faster copy for a custom DataSource V2 which supplies Arrow d…

5180d0f

…ata (NVIDIA#1622) * Add in data source v2, csv file and test for arrow copy * remove commented out line

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Support faster copy for a custom DataSource V2 which supplies Arrow d…

c2b3558

…ata (NVIDIA#1622) * Add in data source v2, csv file and test for arrow copy * remove commented out line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

revans2 Jan 29, 2021

tgravescs Jan 29, 2021

tgravescs commented Jan 29, 2021

Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

Support faster copy for a custom DataSource V2 which supplies Arrow data #1622

Conversation

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

tgravescs commented Jan 29, 2021

revans2 Jan 29, 2021

Choose a reason for hiding this comment

tgravescs Jan 29, 2021

Choose a reason for hiding this comment

tgravescs commented Jan 29, 2021