-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Fix skipped file counts in ManifestReader with deleted entries #8432
Conversation
} | ||
|
||
private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) { | ||
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add hasRowFilter()
and hasPartitionFilter()
to avoid cyclomatic complexity warnings. I also think it makes it more readable.
|
||
return CloseableIterable.filter( | ||
content == FileType.DATA_FILES | ||
? scanMetrics.skippedDataFiles() | ||
: scanMetrics.skippedDeleteFiles(), | ||
open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)), | ||
onlyLive ? filterLiveEntries(entries) : entries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleted entries must be discarded prior to evaluating stats primarily to have correct skipped counts. It should also help the performance.
@@ -215,19 +215,24 @@ public void scanningWithSkippedDataFiles() throws IOException { | |||
Table table = | |||
TestTables.create( | |||
tableDir, tableName, SCHEMA, SPEC, SortOrder.unsorted(), formatVersion, reporter); | |||
table.newAppend().appendFile(FILE_A).appendFile(FILE_D).commit(); | |||
table.newAppend().appendFile(FILE_B).appendFile(FILE_C).commit(); | |||
table.newAppend().appendFile(FILE_A).appendFile(FILE_B).appendFile(FILE_D).commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add FILE_B
here so that we have a manifest with one skipped data file as the original manifest will be rewritten with FILE_A
as deleted during the overwrite below.
@@ -236,9 +241,9 @@ public void scanningWithSkippedDataFiles() throws IOException { | |||
assertThat(result.resultDeleteFiles().value()).isEqualTo(0); | |||
assertThat(result.scannedDataManifests().value()).isEqualTo(1); | |||
assertThat(result.scannedDeleteManifests().value()).isEqualTo(0); | |||
assertThat(result.skippedDataManifests().value()).isEqualTo(1); | |||
assertThat(result.skippedDataManifests().value()).isEqualTo(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overwrite rewrites the original manifest and adds a new one for the new data file.
} | ||
|
||
private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) { | ||
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider the condition?
partitionSet.isEmpty()
If the partitionSet is empty, we seems to filter out all entries. Maybe we can return empty entries directly when the partitionSet is empty. I'm not sure whether we have guaranteed the partitionSet is not empty. Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe an empty partitionSet
indicates there is no match and acts similar to alwaysFalse
. We could add a branch for cases when the filters are always false or the matching partition set is empty, but we would have to analyze the usages first to make sure those are possible scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
return entries(true /* only live entries */); | ||
} | ||
|
||
private CloseableIterable<ManifestEntry<F>> filterLiveEntries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: these last two methods could be static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way so that I don't have to add F extends ContentFile<F>
to these static methods and reuse the bound already defined in the class instead. I could use ?
in the second one but not here.
This PR fixes skipped file counts in
ManifestReader
with deleted entries as discussed here.