-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Support commits with DVs #11495
Conversation
@@ -769,6 +794,82 @@ protected void validateDataFilesExist( | |||
} | |||
} | |||
|
|||
// validates there are no concurrently added DVs for referenced data files | |||
protected void validateAddedDVs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simply validates there is at most one DV per file. There would be a separate PR to automatically rewrite conflicting DVs during retries. The goal is to make DVs fully usable before optimizing.
e88993e
to
52e4898
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to review tests but the code changes themselves overall look good, just some minor comments. Thanks @aokolnychyi!
} | ||
|
||
// builds a set of data file locations referenced by new DVs | ||
private Set<String> dvRefs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this referencedDataFiles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a private method so name is not a big deal
I think that dvRefs()
(or dvReferences()
) makes more sense here as it checks if the deleteFile
is a DV. So, having DV in the method name makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have dv
in the name. If so, I'd prefer a shorter var name to avoid splitting statements on multiple lines.
.stopOnFailure() | ||
.throwFailureWhenFinished() | ||
.executeWith(workerPool()) | ||
.run(m -> validateAddedDVs(m, conflictDetectionFilter, newSnapshotIds, dvRefs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we use the full word manifest
in the lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity I agree with @amogh-jahagirdar (manifest
instead of m
). I'm fine to use DV
instead of DeleteVector
but manifest
here would help 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was me trying to work-around Spotless formatting for closures. I updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first quick pass. I will to another round asap. Thanks @aokolnychyi great work as usual !
@@ -139,6 +139,10 @@ protected void validate(TableMetadata base, Snapshot parent) { | |||
if (validateNewDeleteFiles) { | |||
validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, parent); | |||
} | |||
|
|||
if (base.formatVersion() >= 3 && addsDeleteFiles()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the format version here ? Why not just using addsDeleteFiles
flag ? (just wondering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've restructured this part a bit to compute dv refs as deletes are being added. No longer applies.
"Must not use DVs for position deletes in V2: %s", | ||
ContentFileUtil.dvDesc(file)); | ||
Preconditions.checkArgument( | ||
formatVersion() == 2 || ContentFileUtil.isDV(file), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, why not just checking if it's DV: we know that if it's DV it's format version 3+ right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method validates input from the user and ensures it is not possible to add a DV to a V2 table.
.stopOnFailure() | ||
.throwFailureWhenFinished() | ||
.executeWith(workerPool()) | ||
.run(m -> validateAddedDVs(m, conflictDetectionFilter, newSnapshotIds, dvRefs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity I agree with @amogh-jahagirdar (manifest
instead of m
). I'm fine to use DV
instead of DeleteVector
but manifest
here would help 😃
} | ||
|
||
// builds a set of data file locations referenced by new DVs | ||
private Set<String> dvRefs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a private method so name is not a big deal
I think that dvRefs()
(or dvReferences()
) makes more sense here as it checks if the deleteFile
is a DV. So, having DV in the method name makes sense to me.
return formatVersion >= 3 ? FILE_B_DV : FILE_B_DELETES; | ||
} | ||
|
||
protected DeleteFile newDeletes(DataFile dataFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe newDeleteFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other newDeleteFile
methods below return V2 position deletes, so I wanted to differentiate.
I am OK renaming too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe callers won't care as long as you get a correct DeleteFile
for the correct formatVersion
, so I was thinking that we most likely don't need to differentiate here with the naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference, to be honest. I'll update in the next round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went to rename and I remembered why I did it this way. I wanted a way to construct a V2 delete file independently of the format version and to not need to update the other methods that create delete files now.
@@ -78,7 +78,7 @@ public void testFileSizeSummary() { | |||
|
|||
@TestTemplate | |||
public void testFileSizeSummaryWithDeletes() { | |||
assumeThat(formatVersion).isGreaterThan(1); | |||
assumeThat(formatVersion).isEqualTo(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you planning to update these tests eventually to run against format version 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. There are new tests for DVs in this class. I think this one needs V2 position deletes, we simply didn't have the validation before. V2 position deletes are not allowed in V3 tables. What do you think, @nastra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I forgot that you already added testFileSizeSummaryWithDVs
, so this LGTM
@@ -719,7 +719,7 @@ public void testRewriteLargeManifestsEvolvedUnpartitionedV1Table() throws IOExce | |||
|
|||
@TestTemplate | |||
public void testRewriteSmallDeleteManifestsNonPartitionedTable() throws IOException { | |||
assumeThat(formatVersion).isGreaterThan(1); | |||
assumeThat(formatVersion).isEqualTo(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm updating this test class in #11485
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, will take a look!
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
@@ -956,6 +956,62 @@ public void testRewriteLargeDeleteManifestsPartitionedTable() throws IOException | |||
assertThat(deleteManifests).hasSizeGreaterThanOrEqualTo(2); | |||
} | |||
|
|||
@TestTemplate | |||
public void testUpgradeToV3() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a little awkward about the scope of this test. It's testing upgrade and manifest rewrite, but there are specific aspects that are included.
I feel like there are more scenarios to cover here (e.g. writing a DV to an existing file that has positional delete after the upgrade.). I would say either refine the name to be more explicit (e.g. testUpgradeToV3WithExistingV2Deletes
) so that we can add more test cases for the upgrade path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the name is too generic, I'll update. I did cover adding a DV to an existing file with positional delete in a previous PR here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also an extra test in TestRowDelta
. This one was for the action that rewrites manifests, but I renamed both.
Thanks for reviewing, @nastra @danielcweeks @jbonofre @amogh-jahagirdar! |
This PR adds supports for committing operations with DVs.
This work is part of #11122.