-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display Spark read metrics on Spark SQL UI #7447
Conversation
karuppayya
commented
Apr 27, 2023
b27ab80
to
8fdd705
Compare
seqAsJavaListConverter(df.queryExecution().executedPlan().collectLeaves()).asJava(); | ||
Map<String, SQLMetric> metrics = sparkPlans.get(0).metrics(); | ||
|
||
Assert.assertTrue(metrics.contains(TotalFileSize.name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.assertTrue(metrics.contains(TotalFileSize.name)); | |
Assertions.assertThat(metrics).contains(TotalFileSize.name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as that will show the content of the map in case the assertion ever fails
public CustomTaskMetric[] reportDriverMetrics() { | ||
List<CustomTaskMetric> customTaskMetrics = Lists.newArrayList(); | ||
MetricsReport metricsReport = sparkReadMetricReporter.getMetricsReport(); | ||
ScanReport scanReport = (ScanReport) metricsReport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than casting here I think what would be better is to provide the necessary type of metrics report in the SparkReadMetricsReporter
? This is because report(..)
doesn't guarantee that you will only get a ScanReport
|
||
public class TotalPlanningDuration implements CustomMetric { | ||
|
||
static final String name = "planningDuration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be totalPlanningDuration
instead? Same in the description
} | ||
|
||
public Optional<ScanReport> getScanReport() { | ||
return Optional.ofNullable((ScanReport) metricsReport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Optional.ofNullable((ScanReport) metricsReport); | |
return metricsReport instanceof ScanReport | |
? Optional.of((ScanReport) metricsReport) | |
: Optional.empty(); |
@@ -256,4 +275,56 @@ public String toString() { | |||
runtimeFilterExpressions, | |||
caseSensitive()); | |||
} | |||
|
|||
@Override | |||
public CustomTaskMetric[] reportDriverMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why implement this only in SparkBatchQueryScan
? Can we do this in SparkScan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -65,6 +82,7 @@ class SparkBatchQueryScan extends SparkPartitioningAwareScan<PartitionScanTask> | |||
private final Long endSnapshotId; | |||
private final Long asOfTimestamp; | |||
private final String tag; | |||
private final SparkReadMetricReporter sparkReadMetricReporter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using Supplier<ScanReport> scanReportSupplier
to make this independent from a particular metrics reporter? We can use metricsReporter::scanReport
closure to construct it.
InMemoryMetricsReporter metricsReporter = new InMemoryMetricsReporter();
...
scan.metricsReporter(metricsReporter)
...
return new SparkBatchQueryScan(..., metricsReporter::scanReport);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Override | ||
public CustomTaskMetric[] reportDriverMetrics() { | ||
List<CustomTaskMetric> customTaskMetrics = Lists.newArrayList(); | ||
Optional<ScanReport> scanReportOptional = sparkReadMetricReporter.getScanReport(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg historically uses null instead of Optional
and I think we should continue to follow that. Also, Spotless makes these closures really hard to heard.
What about something like this?
@Override
public CustomTaskMetric[] reportDriverMetrics() {
ScanReport scanReport = scanReportSupplier.get();
if (scanReport == null) {
return new CustomTaskMetric[0];
}
List<CustomTaskMetric> driverMetrics = Lists.newArrayList();
driverMetrics.add(TaskTotalFileSize.from(scanReport));
...
return driverMetrics.toArray(new CustomTaskMetric[0]);
}
Where TaskTotalFileSize
would be defined as follows:
public class TaskTotalFileSize implements CustomTaskMetric {
private final long value;
public static TaskTotalFileSize from(ScanReport scanReport) {
CounterResult counter = scanReport.scanMetrics().totalFileSizeInBytes();
long value = counter != null ? counter.value() : -1;
return new TotalFileSizeTaskMetric(value);
}
private TaskTotalFileSize(long value) {
this.value = value;
}
@Override
public String name() {
return TotalFileSize.NAME;
}
@Override
public long value() {
return value;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
@Override | ||
public CustomMetric[] supportedCustomMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overrides supportedCustomMetrics()
in SparkScan
and breaks it. Can we move this logic there and combine with existing metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -420,12 +423,15 @@ private Scan buildBatchScan() { | |||
private Scan buildBatchScan(Long snapshotId, Long asOfTimestamp, String branch, String tag) { | |||
Schema expectedSchema = schemaWithMetadataColumns(); | |||
|
|||
sparkReadMetricReporter = new SparkReadMetricReporter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why init it here again? We should either use local vars or not init it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
||
@Override | ||
public String description() { | ||
return "Result data files"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow what Spark does in other places. Specifically, its name does not with capital letters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import java.util.Arrays; | ||
import org.apache.spark.sql.connector.metric.CustomMetric; | ||
|
||
public class ScannedDataManifests implements CustomMetric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments above apply to all metrics below.
import org.apache.iceberg.metrics.MetricsReporter; | ||
import org.apache.iceberg.metrics.ScanReport; | ||
|
||
public class SparkReadMetricReporter implements MetricsReporter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name? Shall we call it InMemoryMetricsReporter
and more to core
?
this.metricsReport = report; | ||
} | ||
|
||
public Optional<ScanReport> getScanReport() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use Optional
and getXXX
prefix.
public ScanReport scanReport() {
return (ScanReport) metricsReport;
}
We can check if the value is null later.
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/metrics/TaskResultDataFiles.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public ScanReport get() { | ||
return (ScanReport) metricsReport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no guarantee that this will always return a ScanReport
. There are also other types of reports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code to check type
940a6c2
to
268d9fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems close. I did another round.
core/src/main/java/org/apache/iceberg/metrics/InMemoryReadMetricReporter.java
Outdated
Show resolved
Hide resolved
|
||
super(spark, table, scan, readConf, expectedSchema, filters); | ||
List<Expression> filters, | ||
Supplier<ScanReport> metricsReportSupplier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling git scanReportSupplier
to be a bit specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually called scanReportSupplier
in other places, let's make it consistent.
List<Expression> filters) { | ||
List<Expression> filters, | ||
Supplier<ScanReport> metricsReportSupplier) { | ||
this.metricsReportSupplier = metricsReportSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this assignment as the last line in the constructors to follow the existing style?
Also, let's call it scanReportSupplier
.
Table table, | ||
SparkReadConf readConf, | ||
Supplier<ScanReport> scanReportSupplier) { | ||
super(spark, table, readConf, table.schema(), ImmutableList.of(), scanReportSupplier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we simply pass null
here and remove the supplier from the SparkStagedScan
constructor? We know there would be no metrics report available as it is a staged scan.
@@ -39,6 +40,7 @@ class SparkStagedScanBuilder implements ScanBuilder { | |||
|
|||
@Override | |||
public Scan build() { | |||
return new SparkStagedScan(spark, table, readConf); | |||
return new SparkStagedScan( | |||
spark, table, readConf, (new InMemoryReadMetricReporter())::scanReport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would not be needed if we pass null
to parent constructor in SparkStagedScan
.
import java.util.Arrays; | ||
import org.apache.spark.sql.connector.metric.CustomMetric; | ||
|
||
public class TotalFileSize implements CustomMetric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether I already asked, can we extend CustomSumMetric
and rely on built-in method for aggregating the result? It applies to all our CustomMetric
implementations.
} | ||
|
||
List<CustomTaskMetric> driverMetrics = Lists.newArrayList(); | ||
driverMetrics.add(TaskTotalFileSize.from(scanReport)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we don't include all metrics from ScanMetricsResult
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added all metrics from ScanMetricsResult at the point in time when the change was done,
Should we add the remaining now or take in a different PR(since this cange is already reviewed almost)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a list of the metrics:
TimerResult totalPlanningDuration(); // DONE
CounterResult resultDataFiles(); // DONE
CounterResult resultDeleteFiles(); // MISSING
CounterResult totalDataManifests(); // MISSING
CounterResult totalDeleteManifests(); // MISSING
CounterResult scannedDataManifests(); // DONE
CounterResult skippedDataManifests(); // DONE
CounterResult totalFileSizeInBytes(); // DONE
CounterResult totalDeleteFileSizeInBytes(); // MISSING
CounterResult skippedDataFiles(); // DONE
CounterResult skippedDeleteFiles(); // MISSING
CounterResult scannedDeleteManifests(); // MISSING
CounterResult skippedDeleteManifests(); // MISSING
CounterResult indexedDeleteFiles(); // MISSING
CounterResult equalityDeleteFiles(); // MISSING
CounterResult positionalDeleteFiles(); // MISSING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done in a follow-up.
@@ -96,6 +97,7 @@ public class SparkScanBuilder | |||
private boolean caseSensitive; | |||
private List<Expression> filterExpressions = null; | |||
private Filter[] pushedFilters = NO_FILTERS; | |||
private final InMemoryReadMetricReporter metricsReporter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This should go to the block with final variables above, you may init it in in the definition (up to you).
private final SparkReadConf readConf;
private final List<String> metaColumns = Lists.newArrayList();
private final InMemoryMetricsReporter metricsReporter = new InMemoryMetricsReporter();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still applies.
} | ||
|
||
@Test | ||
public void testReadMetrics() throws NoSuchTableException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add proper tests? I think the approach is correct, so we can add tests now.
|
||
@Override | ||
public void report(MetricsReport report) { | ||
this.metricsReport = (ScanReport) report; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cast needs to be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this seems to be a generic container allowing to intercept a report.
I'd also call it InMemoryMetricsReporter
, there is nothing specific to reads right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also should use Metrics
vs Metric
in the name to match the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to both suggestions
|
||
public static TaskScannedDataManifests from(ScanReport scanReport) { | ||
CounterResult counter = scanReport.scanMetrics().scannedDataManifests(); | ||
long value = counter != null ? counter.value() : -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is -1
typical for Spark metrics to indicate that no data was available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any CustomTaskMetricImpl in Spark handling no data.
Should we send 0 here
(I think "N/A" might not be a good idea and might convey that this data scan manifest was not releveant, WDYT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would probably need to check built-in Spark metrics as a reference. It will probably be 0 in other cases as Spark passes an empty array of task values and then does sum on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it 0 based on this
public class TaskScannedDataManifests implements CustomTaskMetric { | ||
private final long value; | ||
|
||
public TaskScannedDataManifests(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be private because there's the from(ScanReport)
method I would have assumed? Same applies for all the other constructors.
seqAsJavaListConverter(df.queryExecution().executedPlan().collectLeaves()).asJava(); | ||
Map<String, SQLMetric> metricsMap = | ||
JavaConverters.mapAsJavaMapConverter(sparkPlans.get(0).metrics()).asJava(); | ||
Assertions.assertEquals(1, metricsMap.get("scannedDataManifests").value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're currently moving away from Junit assertions to AssertJ, can you please update this to Assertions.assertThat(...).isEqualTo(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also would it make sense to add some more metric checks here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use org.assertj.core.api.Assertions
Added more checks
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
Show resolved
Hide resolved
@@ -74,9 +76,10 @@ abstract class SparkPartitioningAwareScan<T extends PartitionScanTask> extends S | |||
Scan<?, ? extends ScanTask, ? extends ScanTaskGroup<?>> scan, | |||
SparkReadConf readConf, | |||
Schema expectedSchema, | |||
List<Expression> filters) { | |||
List<Expression> filters, | |||
Supplier<ScanReport> metricsReportSupplier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: metricsReportSupplier
-> scanReportSupplier
@@ -67,7 +84,8 @@ abstract class SparkScan implements Scan, SupportsReportStatistics { | |||
Table table, | |||
SparkReadConf readConf, | |||
Schema expectedSchema, | |||
List<Expression> filters) { | |||
List<Expression> filters, | |||
Supplier<ScanReport> metricsReportSupplier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: metricsReportSupplier
-> scanReportSupplier
|
||
@Override | ||
public String description() { | ||
return "result data files"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to understand if we show number of scanned data files
in the UI?
|
||
@Override | ||
public String description() { | ||
return "num scanned data manifests"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Spark using number of ...
for output records? If so, can we match whatever Spark does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on DataSourceScanExec, changing the description
8c4652f
to
aeb9569
Compare
} | ||
|
||
List<CustomTaskMetric> driverMetrics = Lists.newArrayList(); | ||
driverMetrics.add(TaskTotalFileSize.from(scanReport)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added all metrics from ScanMetricsResult at the point in time when the change was done,
Should we add the remaining now or take in a different PR(since this cange is already reviewed almost)?
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
Show resolved
Hide resolved
|
||
public static TaskScannedDataManifests from(ScanReport scanReport) { | ||
CounterResult counter = scanReport.scanMetrics().scannedDataManifests(); | ||
long value = counter != null ? counter.value() : -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any CustomTaskMetricImpl in Spark handling no data.
Should we send 0 here
(I think "N/A" might not be a good idea and might convey that this data scan manifest was not releveant, WDYT)
seqAsJavaListConverter(df.queryExecution().executedPlan().collectLeaves()).asJava(); | ||
Map<String, SQLMetric> metricsMap = | ||
JavaConverters.mapAsJavaMapConverter(sparkPlans.get(0).metrics()).asJava(); | ||
Assertions.assertEquals(1, metricsMap.get("scannedDataManifests").value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use org.assertj.core.api.Assertions
Added more checks
import org.junit.jupiter.api.Assertions; | ||
import scala.collection.JavaConverters; | ||
|
||
public class TestSparkReadMetrics extends SparkTestBaseWithCatalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for v1 and v2 tables.
Metatdata tables(apart from positional deletes table) , dont update org.apache.iceberg.SnapshotScan#scanMetrics
.
Since we dont have delete manifest metric in this PR, skipping the test for positional delete metadata table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final minor comments and should be good to go. We will need to add tests for CoW and MoR plans in a separate PR.
@@ -74,9 +76,10 @@ abstract class SparkPartitioningAwareScan<T extends PartitionScanTask> extends S | |||
Scan<?, ? extends ScanTask, ? extends ScanTaskGroup<?>> scan, | |||
SparkReadConf readConf, | |||
Schema expectedSchema, | |||
List<Expression> filters) { | |||
List<Expression> filters, | |||
Supplier<ScanReport> scanReportSupplier) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the empty line here.
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkCopyOnWriteScan.java
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public String aggregateTaskMetrics(long[] taskMetrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant as we extend CustomSumMetric
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for some debugging, removed.
|
||
import org.apache.spark.sql.connector.metric.CustomSumMetric; | ||
|
||
public class scannedDataFiles extends CustomSumMetric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a typo? Should start with a capital letter?
|
||
public static TaskTotalPlanningDuration from(ScanReport scanReport) { | ||
TimerResult timerResult = scanReport.scanMetrics().totalPlanningDuration(); | ||
long value = timerResult != null ? timerResult.count() : -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be totalDuration().toMillis()
? Shall we also use 0 as the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the default at -1 based on Spark default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to add a comment why this particular default value was chosen - with a reference to the spark default (here and at the other place)
|
||
@Override | ||
public String description() { | ||
return "total planning duration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding info on what values we show?
total planning duration (ms)
|
||
@Override | ||
public String description() { | ||
return "total file size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about total file size (bytes)
?
JavaConverters.mapAsJavaMapConverter(sparkPlans.get(0).metrics()).asJava(); | ||
Assertions.assertThat(metricsMap.get("skippedDataFiles").value()).isEqualTo(1); | ||
Assertions.assertThat(metricsMap.get("scannedDataManifests").value()).isEqualTo(2); | ||
Assertions.assertThat(metricsMap.get("resultDataFiles").value()).isEqualTo(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some of these were renamed.
0b31a7b
to
6d55476
Compare
The test failures doesn't seem to be related |
public ScanReport scanReport() { | ||
Preconditions.checkArgument( | ||
metricsReport == null || metricsReport instanceof ScanReport, | ||
"Metric report is not a scan report"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Metric report is not a scan report"); | |
"Metrics report is not a scan report"); |
JavaConverters.mapAsJavaMapConverter(sparkPlans.get(0).metrics()).asJava(); | ||
Assertions.assertThat(metricsMap.get("skippedDataFiles").value()).isEqualTo(1); | ||
Assertions.assertThat(metricsMap.get("scannedDataManifests").value()).isEqualTo(2); | ||
Assertions.assertThat(metricsMap.get("scannedDataFiles").value()).isEqualTo(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add checks for all the other metrics here as well (even if they are 0)?
JavaConverters.mapAsJavaMapConverter(sparkPlans.get(0).metrics()).asJava(); | ||
Assertions.assertThat(metricsMap.get("skippedDataFiles").value()).isEqualTo(1); | ||
Assertions.assertThat(metricsMap.get("scannedDataManifests").value()).isEqualTo(2); | ||
Assertions.assertThat(metricsMap.get("scannedDataFiles").value()).isEqualTo(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
This LGTM once CI passes, @karuppayya could you rebase onto latest master please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @karuppayya, could you rebase to fix CI?
Can we also follow up to add missing metrics discussed here and add tests for row-level operations?
1682dde
to
831a170
Compare
@@ -449,8 +453,14 @@ private Scan buildBatchScan(Long snapshotId, Long asOfTimestamp, String branch, | |||
} | |||
|
|||
scan = configureSplitPlanning(scan); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this empty line?
It is awesome to get this done, @karuppayya! Thanks for reviewing, @nastra! |
@karuppayya thanks for the contributions, would you mind porting these to lower spark versions? (I particularly interested in spark 3.2). Thanks |
@puchengy This cannot be ported as it is dependent on Spark changes to support driver metrics, which is availble from Spark 3.4 |
@karuppayya Thanks for sharing! It seems to much work to first to backport spark changes to spark 3.2, make a release, and backport this change back to spark 3.2 iceberg repo. |
Thanks for this pr. Could I backport this to Spark 3.3. @karuppayya |