Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU metrics to GpuFileSourceScanExec #547

Merged
merged 8 commits into from
Aug 14, 2020

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Aug 11, 2020

Fixes #524.

This provides GPU-specific metrics to the FileSourceScanExec override in the plugin. Apache Spark doesn't support format-specific metrics directly, so this adds a workaround by piggybacking the metrics map onto the format options map.

This looks like it's adding a ton of code to GpuFileSourceScanExec, and it is, but almost all the code comes from Spark's FileSourceScanExec. With this change, GpuFileSourceScanExec no longer tries to construct a wrapped instance of FileSourceScanExec (which lead to a number of cases where shims needed to be built across the Spark versions), but instead provides implementations of the overridden methods directly.

Opening this PR in draft form first to get some feedback on the approach. The key part of the implementation is the new ReaderOptionsWithMetrics class (I'm terrible at naming things). I'm open to other suggestions on how to handle the GPU metrics. This PR only updates the Spark 3.0.0 shim version of GpuFileSourceScanExec for now. If the approach is approved then I will apply a similar change to the other shim versions and potentially be able to combine the implementations such that it can be extracted from the shim layer.

@jlowe jlowe added feature request New feature or request SQL part of the SQL/Dataframe plugin labels Aug 11, 2020
@jlowe jlowe added this to the Aug 3 - Aug 14 milestone Aug 11, 2020
@jlowe jlowe self-assigned this Aug 11, 2020
@revans2
Copy link
Collaborator

revans2 commented Aug 12, 2020

It looks OK to me.

A lot of the scan code appears to be related to partitioning and planning how to read bucketed code. We probably should add in more tests for reading bucketed code, and look at the code coverage for these areas just to be sure we didn't mess anything up, and to give us more confidence that we are doing it right in newer versions of Spark.

@tgravescs
Copy link
Collaborator

fyi spark 3.1.0 has another param in the FileSourceScanExec - optionalNumCoalescedBuckets. so probably still need a shim or some sort if we want to keep that functionality. (https://issues.apache.org/jira/browse/SPARK-31350)

I was actually in process of adding a bucketing test because I ended up copying a bunch of functions as well for the small file optimization, I was trying to use tables and its being a bit of a pain to add, seeing some weirdness the catalog.

One issue, like you mention, which we already had but this makes it worse is how we keep this code up to date and discover changes in spark side. One example of this is already the optionalNumCoalescedBuckets we could have worked fine without noticing that was there. We may need to think up more extensive auditing mechanisms.

@jlowe
Copy link
Member Author

jlowe commented Aug 13, 2020

@tgravescs @revans2 I found what I think is a cleaner way to implement the metrics, which is leveraging the fact that we know we're using a file format that we control. That seems cleaner than abusing the options map to piggyback metrics on it.

This is just about ready to go. Running into a single, weird test failure in the Mortgage test where the first query results in zero rows. Not sure how that's happening yet.

@jlowe jlowe marked this pull request as ready for review August 13, 2020 18:04
@jlowe
Copy link
Member Author

jlowe commented Aug 13, 2020

Thanks to @revans2 the test failure has been fixed. This is now ready for review.

@jlowe
Copy link
Member Author

jlowe commented Aug 13, 2020

build

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing here are a few.

Signed-off-by: Jason Lowe <[email protected]>
@@ -94,6 +94,8 @@ class Spark300dbShims extends Spark300Shims {
wrapped.requiredSchema,
wrapped.partitionFilters,
wrapped.optionalBucketSet,
// TODO: Does Databricks have coalesced bucketing implemented?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to go look at the source code for the various runtimes, if we have a jira we may be able to look at the release notes for each one as well. added a comment to #502

tgravescs
tgravescs previously approved these changes Aug 13, 2020
@jlowe
Copy link
Member Author

jlowe commented Aug 13, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Aug 13, 2020

build

@tgravescs tgravescs merged commit 20afca1 into NVIDIA:branch-0.2 Aug 14, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add GPU metrics to GpuFileSourceScanExec

Signed-off-by: Jason Lowe <[email protected]>

* Extract GpuFileSourceScanExec from shims

Signed-off-by: Jason Lowe <[email protected]>

* Pass metrics via GPU file format rather than custom options map

Signed-off-by: Jason Lowe <[email protected]>

* Update code checking for DataSourceScanExec

Signed-off-by: Jason Lowe <[email protected]>

* Fix scaladoc warning and unused imports

Signed-off-by: Jason Lowe <[email protected]>

* Fix copyright

Signed-off-by: Jason Lowe <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add GPU metrics to GpuFileSourceScanExec

Signed-off-by: Jason Lowe <[email protected]>

* Extract GpuFileSourceScanExec from shims

Signed-off-by: Jason Lowe <[email protected]>

* Pass metrics via GPU file format rather than custom options map

Signed-off-by: Jason Lowe <[email protected]>

* Update code checking for DataSourceScanExec

Signed-off-by: Jason Lowe <[email protected]>

* Fix scaladoc warning and unused imports

Signed-off-by: Jason Lowe <[email protected]>

* Fix copyright

Signed-off-by: Jason Lowe <[email protected]>
@jlowe jlowe deleted the scan-metrics branch September 10, 2021 15:31
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#547)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add GPU specific metrics to GpuFileSourceScanExec
3 participants