Add GPU metrics to GpuFileSourceScanExec #547

jlowe · 2020-08-11T22:11:34Z

Fixes #524.

This provides GPU-specific metrics to the FileSourceScanExec override in the plugin. Apache Spark doesn't support format-specific metrics directly, so this adds a workaround by piggybacking the metrics map onto the format options map.

This looks like it's adding a ton of code to GpuFileSourceScanExec, and it is, but almost all the code comes from Spark's FileSourceScanExec. With this change, GpuFileSourceScanExec no longer tries to construct a wrapped instance of FileSourceScanExec (which lead to a number of cases where shims needed to be built across the Spark versions), but instead provides implementations of the overridden methods directly.

Opening this PR in draft form first to get some feedback on the approach. The key part of the implementation is the new ReaderOptionsWithMetrics class (I'm terrible at naming things). I'm open to other suggestions on how to handle the GPU metrics. This PR only updates the Spark 3.0.0 shim version of GpuFileSourceScanExec for now. If the approach is approved then I will apply a similar change to the other shim versions and potentially be able to combine the implementations such that it can be extracted from the shim layer.

Signed-off-by: Jason Lowe <[email protected]>

revans2 · 2020-08-12T12:39:20Z

It looks OK to me.

A lot of the scan code appears to be related to partitioning and planning how to read bucketed code. We probably should add in more tests for reading bucketed code, and look at the code coverage for these areas just to be sure we didn't mess anything up, and to give us more confidence that we are doing it right in newer versions of Spark.

tgravescs · 2020-08-12T13:34:40Z

fyi spark 3.1.0 has another param in the FileSourceScanExec - optionalNumCoalescedBuckets. so probably still need a shim or some sort if we want to keep that functionality. (https://issues.apache.org/jira/browse/SPARK-31350)

I was actually in process of adding a bucketing test because I ended up copying a bunch of functions as well for the small file optimization, I was trying to use tables and its being a bit of a pain to add, seeing some weirdness the catalog.

One issue, like you mention, which we already had but this makes it worse is how we keep this code up to date and discover changes in spark side. One example of this is already the optionalNumCoalescedBuckets we could have worked fine without noticing that was there. We may need to think up more extensive auditing mechanisms.

Signed-off-by: Jason Lowe <[email protected]>

jlowe · 2020-08-13T15:30:53Z

@tgravescs @revans2 I found what I think is a cleaner way to implement the metrics, which is leveraging the fact that we know we're using a file format that we control. That seems cleaner than abusing the options map to piggyback metrics on it.

This is just about ready to go. Running into a single, weird test failure in the Mortgage test where the first query results in zero rows. Not sure how that's happening yet.

Signed-off-by: Jason Lowe <[email protected]>

jlowe · 2020-08-13T18:05:20Z

Thanks to @revans2 the test failure has been fixed. This is now ready for review.

jlowe · 2020-08-13T18:05:26Z

build

tgravescs

still reviewing here are a few.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadFileFormatWithMetrics.scala

Signed-off-by: Jason Lowe <[email protected]>

tgravescs · 2020-08-13T19:16:22Z

shims/spark300db/src/main/scala/com/nvidia/spark/rapids/shims/spark300db/Spark300dbShims.scala

@@ -94,6 +94,8 @@ class Spark300dbShims extends Spark300Shims {
              wrapped.requiredSchema,
              wrapped.partitionFilters,
              wrapped.optionalBucketSet,
+              // TODO: Does Databricks have coalesced bucketing implemented?


we probably need to go look at the source code for the various runtimes, if we have a jira we may be able to look at the release notes for each one as well. added a comment to #502

jlowe · 2020-08-13T19:22:38Z

build

jlowe · 2020-08-13T22:51:34Z

build

* Add GPU metrics to GpuFileSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Extract GpuFileSourceScanExec from shims Signed-off-by: Jason Lowe <[email protected]> * Pass metrics via GPU file format rather than custom options map Signed-off-by: Jason Lowe <[email protected]> * Update code checking for DataSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Fix scaladoc warning and unused imports Signed-off-by: Jason Lowe <[email protected]> * Fix copyright Signed-off-by: Jason Lowe <[email protected]>

…IDIA#547) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

Add GPU metrics to GpuFileSourceScanExec

a457dd8

Signed-off-by: Jason Lowe <[email protected]>

jlowe added feature request New feature or request SQL part of the SQL/Dataframe plugin labels Aug 11, 2020

jlowe added this to the Aug 3 - Aug 14 milestone Aug 11, 2020

jlowe self-assigned this Aug 11, 2020

jlowe added 2 commits August 12, 2020 16:27

Extract GpuFileSourceScanExec from shims

341c400

Signed-off-by: Jason Lowe <[email protected]>

Pass metrics via GPU file format rather than custom options map

b0fd541

Signed-off-by: Jason Lowe <[email protected]>

jlowe added 3 commits August 13, 2020 12:55

Update code checking for DataSourceScanExec

ee8c0b5

Signed-off-by: Jason Lowe <[email protected]>

Merge branch 'branch-0.2' into scan-metrics

c98f818

Fix scaladoc warning and unused imports

3474fcf

Signed-off-by: Jason Lowe <[email protected]>

jlowe marked this pull request as ready for review August 13, 2020 18:04

tgravescs reviewed Aug 13, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuExec.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadFileFormatWithMetrics.scala Show resolved Hide resolved

Fix copyright

e2fef62

Signed-off-by: Jason Lowe <[email protected]>

tgravescs reviewed Aug 13, 2020

View reviewed changes

tgravescs previously approved these changes Aug 13, 2020

View reviewed changes

Merge branch 'branch-0.2' into scan-metrics

2b7c91b

jlowe dismissed tgravescs’s stale review via 2b7c91b August 13, 2020 22:50

tgravescs approved these changes Aug 13, 2020

View reviewed changes

tgravescs merged commit 20afca1 into NVIDIA:branch-0.2 Aug 14, 2020

tgravescs mentioned this pull request Aug 17, 2020

[FEA]Spark 3.1 FileSourceScanExec adds parameter optionalNumCoalescedBuckets #354

Closed

jlowe deleted the scan-metrics branch September 10, 2021 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU metrics to GpuFileSourceScanExec #547

Add GPU metrics to GpuFileSourceScanExec #547

jlowe commented Aug 11, 2020

revans2 commented Aug 12, 2020

tgravescs commented Aug 12, 2020

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020

tgravescs left a comment

tgravescs Aug 13, 2020

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020

Add GPU metrics to GpuFileSourceScanExec #547

Add GPU metrics to GpuFileSourceScanExec #547

Conversation

jlowe commented Aug 11, 2020

revans2 commented Aug 12, 2020

tgravescs commented Aug 12, 2020

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs Aug 13, 2020

Choose a reason for hiding this comment

jlowe commented Aug 13, 2020

jlowe commented Aug 13, 2020