-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU metrics to GpuFileSourceScanExec #547
Conversation
Signed-off-by: Jason Lowe <[email protected]>
It looks OK to me. A lot of the scan code appears to be related to partitioning and planning how to read bucketed code. We probably should add in more tests for reading bucketed code, and look at the code coverage for these areas just to be sure we didn't mess anything up, and to give us more confidence that we are doing it right in newer versions of Spark. |
fyi spark 3.1.0 has another param in the FileSourceScanExec - optionalNumCoalescedBuckets. so probably still need a shim or some sort if we want to keep that functionality. (https://issues.apache.org/jira/browse/SPARK-31350) I was actually in process of adding a bucketing test because I ended up copying a bunch of functions as well for the small file optimization, I was trying to use tables and its being a bit of a pain to add, seeing some weirdness the catalog. One issue, like you mention, which we already had but this makes it worse is how we keep this code up to date and discover changes in spark side. One example of this is already the optionalNumCoalescedBuckets we could have worked fine without noticing that was there. We may need to think up more extensive auditing mechanisms. |
Signed-off-by: Jason Lowe <[email protected]>
Signed-off-by: Jason Lowe <[email protected]>
@tgravescs @revans2 I found what I think is a cleaner way to implement the metrics, which is leveraging the fact that we know we're using a file format that we control. That seems cleaner than abusing the options map to piggyback metrics on it. This is just about ready to go. Running into a single, weird test failure in the Mortgage test where the first query results in zero rows. Not sure how that's happening yet. |
Signed-off-by: Jason Lowe <[email protected]>
Signed-off-by: Jason Lowe <[email protected]>
Thanks to @revans2 the test failure has been fixed. This is now ready for review. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reviewing here are a few.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadFileFormatWithMetrics.scala
Show resolved
Hide resolved
Signed-off-by: Jason Lowe <[email protected]>
@@ -94,6 +94,8 @@ class Spark300dbShims extends Spark300Shims { | |||
wrapped.requiredSchema, | |||
wrapped.partitionFilters, | |||
wrapped.optionalBucketSet, | |||
// TODO: Does Databricks have coalesced bucketing implemented? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to go look at the source code for the various runtimes, if we have a jira we may be able to look at the release notes for each one as well. added a comment to #502
build |
build |
* Add GPU metrics to GpuFileSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Extract GpuFileSourceScanExec from shims Signed-off-by: Jason Lowe <[email protected]> * Pass metrics via GPU file format rather than custom options map Signed-off-by: Jason Lowe <[email protected]> * Update code checking for DataSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Fix scaladoc warning and unused imports Signed-off-by: Jason Lowe <[email protected]> * Fix copyright Signed-off-by: Jason Lowe <[email protected]>
* Add GPU metrics to GpuFileSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Extract GpuFileSourceScanExec from shims Signed-off-by: Jason Lowe <[email protected]> * Pass metrics via GPU file format rather than custom options map Signed-off-by: Jason Lowe <[email protected]> * Update code checking for DataSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Fix scaladoc warning and unused imports Signed-off-by: Jason Lowe <[email protected]> * Fix copyright Signed-off-by: Jason Lowe <[email protected]>
…IDIA#547) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>
Fixes #524.
This provides GPU-specific metrics to the
FileSourceScanExec
override in the plugin. Apache Spark doesn't support format-specific metrics directly, so this adds a workaround by piggybacking the metrics map onto the format options map.This looks like it's adding a ton of code to
GpuFileSourceScanExec
, and it is, but almost all the code comes from Spark'sFileSourceScanExec
. With this change,GpuFileSourceScanExec
no longer tries to construct a wrapped instance ofFileSourceScanExec
(which lead to a number of cases where shims needed to be built across the Spark versions), but instead provides implementations of the overridden methods directly.Opening this PR in draft form first to get some feedback on the approach. The key part of the implementation is the new
ReaderOptionsWithMetrics
class (I'm terrible at naming things). I'm open to other suggestions on how to handle the GPU metrics. This PR only updates the Spark 3.0.0 shim version ofGpuFileSourceScanExec
for now. If the approach is approved then I will apply a similar change to the other shim versions and potentially be able to combine the implementations such that it can be extracted from the shim layer.