-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: add ingest-time splitting of ssts into virtual ones #2582
Conversation
On closer thought, we can't do the optimization of dropping We would either need another mechanism to error out compactions that straddle the excise span / newly ingested spans, or we can just hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 18 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
ingest.go
line 617 at r2 (raw file):
meta *fileMetadata, suggestSplit bool, ) (int, *fileMetadata, error) {
[nit] name the return values and add a function level comment
ingest.go
line 1578 at r2 (raw file):
// ingestFile is the file being ingested, and splitFile is the file that // needs to be split to allow ingestFile to slot into `level` level. ingestFile, splitFile *fileMetadata
[nit] separate the fields and comments
ingest.go
line 1599 at r2 (raw file):
) error { // Note the opposite order: Unlock() followed by Lock(). d.mu.Unlock()
I'm not sure this makes sense in here, we should move to the calling code instead
ingest.go
line 1616 at r2 (raw file):
for i := range replaced { if replaced[i].Meta.Overlaps(d.cmp, s.ingestFile.Smallest.UserKey, s.ingestFile.Largest.UserKey, s.ingestFile.Largest.IsExclusiveSentinel()) { splitFile = replaced[i].Meta
We expect this to be the case at most once, right? We can assert that updatedSplitFile
is false.
ingest.go
line 1652 at r2 (raw file):
replacedFiles[splitFile.FileNum] = added for i := range added { if s.ingestFile.Overlaps(d.cmp, added[i].Meta.Smallest.UserKey, added[i].Meta.Largest.UserKey, added[i].Meta.Largest.IsExclusiveSentinel()) {
Nice sanity checks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing some preliminary comments
Reviewed 2 of 7 files at r2, all commit messages.
Reviewable status: 2 of 18 files reviewed, 11 unresolved discussions (waiting on @itsbilal)
compaction.go
line 1861 at r2 (raw file):
d.newIters, d.tableNewRangeKeyIter, iterOpts, d.cmp, c.version, baseLevel, d.mu.compact.inProgress, file.FileMetadata, false, /* suggestSplit */
why is suggestSplit false in this case. Don't we want to ingest as low as possible in this case too?
ingest.go
line 769 at r2 (raw file):
// latter should never happen today, but possibly some future version has // started writing FileMetadatas with key bounds that are loose on user // keys. Handle this defensively.
Can we log a warning here with the file bound details?
ingest.go
line 1591 at r2 (raw file):
// unconditionally released and then re-acquired during this method as it does // IO. func (d *DB) ingestSplit(
(note to self) read this method
ingest.go
line 1699 at r2 (raw file):
baseLevel := d.mu.versions.picker.getBaseLevel() iterOps := IterOptions{logger: d.opts.Logger} // filesToSplit is a map of files being ingested -> files being split.
nit: it's not quite a map, though I realize this comment was intending to be more abstract.
We could say something more verbose like:
// filesToSplit is a list where each element is a pair consisting of a file being ingested and an existing file that needs to be split to make room for it. Each ingested file will appear at most once in this list.
ingest.go
line 1774 at r2 (raw file):
// replacedFiles maps files excised due to exciseSpan (or splitFiles returned // by ingestTargetLevel), to files that were created to replace it. This map //
incomplete sentence.
Why is this a map in that a file that is split is replaced with multiple files?
options.go
line 527 at r2 (raw file):
// sstables into two virtual sstables to allow ingestion sstables to slot // into a lower level than they otherwise would have. IngestSplit bool
to be on the safe side, I think we should make this dynamically changeable in a running cluster.
btw, do we have a metric counting the unique backing ssts? We should be able to see if such splitting is causing the count of filemetadata / count of backing ssts to become high.
3e9295c
to
d5eaf3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Should have addressed all the comments so far, and should have also addressed my own concerns around how this interacted with in-progress compactions. The cancel cases for compactions are getting more complex, so I'll take a stab at a unit test that better tests the cancel cases tomorrow. Otherwise this should be good to review.
Reviewable status: 0 of 18 files reviewed, 5 unresolved discussions (waiting on @sumeerbhola)
compaction.go
line 1861 at r2 (raw file):
Previously, sumeerbhola wrote…
why is suggestSplit false in this case. Don't we want to ingest as low as possible in this case too?
Done. This was because this code didn't correctly handle metrics and cancel cases for ingest-time splitting, but now it does.
ingest.go
line 769 at r2 (raw file):
Previously, sumeerbhola wrote…
Can we log a warning here with the file bound details?
Done. Rather N/A because we now do loose-file-bound virtual sstables (for online restore). Procrastination can have pros like that!
ingest.go
line 1578 at r2 (raw file):
Previously, RaduBerinde wrote…
[nit] separate the fields and comments
Done.
ingest.go
line 1599 at r2 (raw file):
Previously, RaduBerinde wrote…
I'm not sure this makes sense in here, we should move to the calling code instead
Done. (Removed for the same reason as this was removed in the IngestAndExcise
PR - it messes with the view of in-progress compactions).
ingest.go
line 1616 at r2 (raw file):
Previously, RaduBerinde wrote…
We expect this to be the case at most once, right? We can assert that
updatedSplitFile
is false.
Yes. Asserted.
ingest.go
line 1699 at r2 (raw file):
Previously, sumeerbhola wrote…
nit: it's not quite a map, though I realize this comment was intending to be more abstract.
We could say something more verbose like:
// filesToSplit is a list where each element is a pair consisting of a file being ingested and an existing file that needs to be split to make room for it. Each ingested file will appear at most once in this list.
Done.
ingest.go
line 1774 at r2 (raw file):
Previously, sumeerbhola wrote…
incomplete sentence.
Why is this a map in that a file that is split is replaced with multiple files?
Clarified in the comments. It's because the entries are FileMetadatas, and we create two FileMetadatas every time we split a file.
options.go
line 527 at r2 (raw file):
Previously, sumeerbhola wrote…
to be on the safe side, I think we should make this dynamically changeable in a running cluster.
btw, do we have a metric counting the unique backing ssts? We should be able to see if such splitting is causing the count of filemetadata / count of backing ssts to become high.
Done.
We don't have a metric right now, but we probably should have one to be able to track stuff like this - this is a good call. Will file an issue shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r2, 7 of 17 files at r3, all commit messages.
Reviewable status: 8 of 18 files reviewed, 14 unresolved discussions (waiting on @itsbilal)
-- commits
line 17 at r3:
nit: can you update this slightly stale comment
ingest.go
line 807 at r3 (raw file):
// - if there is only a file overlap at a given level, and no data overlap, // we can still slot a file at that level. We return the fileMetadata with // which we have file boundary overlap (must be only one file by definition, as
is this "must be only one file by definition" stale due to loose online restore sst bounds?
ingest.go
line 921 at r3 (raw file):
// Check boundary overlap. splitFile = nil
why are we setting this to nil here -- if we had advanced targetLevel
in a previous loop iteration, we don't want to lose that in case compactions (or something else) prevent us from advancing it here? I think we need to maintain consistency via setting splitFile
exactly where we set targetLevel
, otherwise it becomes to hard to reason about.
candidateSplitFile
is nil at this point, then the following blocks will try to set and unset candidateSplitFile
and if we eventually fall through we would do
if !overlaps {
targetLevel = level
splitFile = candidateSplitFile
}
ingest.go
line 926 at r3 (raw file):
if !boundaryOverlaps.Empty() { // We are already guaranteed to not have any data overlaps with files // in boundaryOverlaps, otherwise we'd have gone into the above if
nit: otherwise we'd have returned in the above if ...
ingest.go
line 1870 at r3 (raw file):
if replaced[i].Meta.Overlaps(d.cmp, s.ingestFile.Smallest.UserKey, s.ingestFile.Largest.UserKey, s.ingestFile.Largest.IsExclusiveSentinel()) { if updatedSplitFile { panic("updated with two files in ingestSplit")
This could use a comment with a visual example. I believe this is an invariant because the various ingested files in files
are non-overlapping.
ingest.go
line 1879 at r3 (raw file):
// None of the replaced files overlapped with the file being ingested. // This is only okay if we have an excise span that overlaps with // the ingested file.
I am not sure this is limited to the excise case. If we had a file in the version that had a span [a,z) and one ingested file at [b,c) caused it to split into [a,b) and [j,z) (because the file had no keys in [b,j)), then a second ingested file with span [d,e) can slide into the gap that was left.
ingest.go
line 1895 at r3 (raw file):
// as we're guaranteed to not have any data overlap between splitFile and // s.ingestFile, so panic if we do see a newly added file with an endKey // equalling s.ingestFile.Largest, and !s.ingestFile.Largest.IsExclusiveSentinel()
I am getting a bit confused by how this tree structure gets reflected in the versionEdit. My understanding is that we can have something like:
file1 (from the existing version) is split into file11, file12. Then file12 is split into file121, file122. replacedFiles
will have file1, file12 as keys. This is so that we can traverse down the tree when someone else tries to split file1, and we need to effectively split file121. This all make sense.
d.excise
will add file1 to ve.DeletedFiles
and then we are adding it again? It will also add file11, file12 to ve.NewFiles
. Then later we will add file12 to ve.DeletedFiles
and file121, file122 to ve.NewFiles
. The versionEdit documentation says:
// A file num may be present in both deleted files and new files when it
// is moved from a lower level to a higher level (when the compaction
// found that there was no overlapping file at the higher level).
Perhaps this is working out ok, because the code happens to work, even though file12 was never in the version?
Or we could flatten this tree after all the updates to replacedFiles
are done, and at that time update the versionEdit.
Do we have tests exercising this repeated splitting (I didn't see one in testdata/ingest)?
ingest.go
line 2016 at r3 (raw file):
// check from findTargetLevel, as that requires d.mu to be held. shouldIngestSplit := d.opts.Experimental.IngestSplit != nil && d.opts.Experimental.IngestSplit() && d.FormatMajorVersion() >= ExperimentalFormatVirtualSSTables
I think we should hoist this to the top of the function, and not do this in every iteration. It is simpler that way, and we don't need the setting change to apply in the middle of an ingest.
ingest.go
line 2094 at r3 (raw file):
for m := iter.First(); m != nil; m = iter.Next() { excised, err := d.excise(exciseSpan, m, ve, level)
nit: excised is not a good name for a slice containing NewFileEntry
.
d5eaf3c
to
c54bb3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
This should be ready for another look; I've rebased and addressed known issues so far. I'll keep seeing if there are more ways to harden it up or improve testing, but from my perspective this is looking good again.
Reviewable status: 2 of 18 files reviewed, 11 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
Previously, sumeerbhola wrote…
nit: can you update this slightly stale comment
Done.
ingest.go
line 807 at r3 (raw file):
Previously, sumeerbhola wrote…
is this "must be only one file by definition" stale due to loose online restore sst bounds?
Yep. Clarified that now.
ingest.go
line 921 at r3 (raw file):
Previously, sumeerbhola wrote…
why are we setting this to nil here -- if we had advanced
targetLevel
in a previous loop iteration, we don't want to lose that in case compactions (or something else) prevent us from advancing it here? I think we need to maintain consistency via settingsplitFile
exactly where we settargetLevel
, otherwise it becomes to hard to reason about.
candidateSplitFile
is nil at this point, then the following blocks will try to set and unsetcandidateSplitFile
and if we eventually fall through we would doif !overlaps { targetLevel = level splitFile = candidateSplitFile }
Good catch! Done. I added a candidateSplitFile
for this purpose.
ingest.go
line 1870 at r3 (raw file):
Previously, sumeerbhola wrote…
This could use a comment with a visual example. I believe this is an invariant because the various ingested files in
files
are non-overlapping.
I added a longer comment and referenced the visual examples in ingestTargetLevel
, hope that's sufficient?
ingest.go
line 1879 at r3 (raw file):
Previously, sumeerbhola wrote…
I am not sure this is limited to the excise case. If we had a file in the version that had a span [a,z) and one ingested file at [b,c) caused it to split into [a,b) and [j,z) (because the file had no keys in [b,j)), then a second ingested file with span [d,e) can slide into the gap that was left.
Done. Good catch; adding a test case with this, thanks!
ingest.go
line 1895 at r3 (raw file):
Previously, sumeerbhola wrote…
I am getting a bit confused by how this tree structure gets reflected in the versionEdit. My understanding is that we can have something like:
file1 (from the existing version) is split into file11, file12. Then file12 is split into file121, file122.replacedFiles
will have file1, file12 as keys. This is so that we can traverse down the tree when someone else tries to split file1, and we need to effectively split file121. This all make sense.
d.excise
will add file1 tove.DeletedFiles
and then we are adding it again? It will also add file11, file12 tove.NewFiles
. Then later we will add file12 tove.DeletedFiles
and file121, file122 tove.NewFiles
. The versionEdit documentation says:// A file num may be present in both deleted files and new files when it // is moved from a lower level to a higher level (when the compaction // found that there was no overlapping file at the higher level).
Perhaps this is working out ok, because the code happens to work, even though file12 was never in the version?
Or we could flatten this tree after all the updates toreplacedFiles
are done, and at that time update the versionEdit.Do we have tests exercising this repeated splitting (I didn't see one in testdata/ingest)?
Very good catch. I did the flattening in ingestSplit
. Also added tests for these multi-split cases that actually traverse the tree.
ingest.go
line 2016 at r3 (raw file):
Previously, sumeerbhola wrote…
I think we should hoist this to the top of the function, and not do this in every iteration. It is simpler that way, and we don't need the setting change to apply in the middle of an ingest.
Done.
ingest.go
line 2094 at r3 (raw file):
Previously, sumeerbhola wrote…
nit: excised is not a good name for a slice containing
NewFileEntry
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r4, all commit messages.
Reviewable status: 12 of 18 files reviewed, 15 unresolved discussions (waiting on @itsbilal)
ingest.go
line 955 at r4 (raw file):
// produce since compaction hasn't been done yet. However, there's no need // to check since all keys in them will either be from c.startLevel or // c.outputLevel, both levels having their data overlap already tested
since we now have multi-level compactions, consider saying something like:
... be from levels in [c.startLevel, c.outputLevel], and all those levels have their data overlap ...
ingest.go
line 966 at r4 (raw file):
overlaps = true break }
I'm forgetting something now: are we doing this check since we would prefer to not have to cancel compactions? In that if we didn't do this here and ingested, we would be ok?
I guess the answer is no -- these compactions could finish before the ingest and then this ingest would be incorrect since it could try to add a file that overlaps with others in the level.
And the check we do in terms of canceling compactions when installing the version edit for the ingest is for compactions that started after the ingest decision was made, yes?
The code could use some more commentary on this.
ingest.go
line 1861 at r4 (raw file):
// replacedFiles can be thought of as a tree, where we start iterating with // s.splitFile and run its fileNum through replacedFiles, then find which // of the replaced files overlaps with s.ingestFile, check its
I think this should say something about splitFile, since its mentioned at the end of the comment. Something like:
...overlaps with s.ingestFile, which becomes the new splitFile, check splitFile's replacements ...
ingest.go
line 1863 at r4 (raw file):
// of the replaced files overlaps with s.ingestFile, check its // replacements in replacedFiles again for overlap with s.ingestFile, and // so on until we can't find the current splitFile in replacedFiles.
how about expanding this comment like:
... so on until either we can't find the current splitFile in replacedFiles (which means the splitFile should be split) or we don't find a file that overlaps with s.ingestFile (which means another file being ingested already split things enough to accommodate s.ingestFile).
Currently, if we identify boundary overlap in a level during ingest target level calculation, but no data overlap, we are forced to find a target level above the file we saw the overlap with (if we can't fall below it, such as if the existing file is in L6, which happens commonly). This change takes advantage of virtual sstables to split existing sstables into two virtual sstables when an ingested sstable would be able to go into the same level had the sstables been split that way to begin with. Doing this split reduces a lot of write-amp as it avoids us from having to compact the newly-ingested sstable with the sstable it boundary-overlapped with. Fixes cockroachdb#1683.
c54bb3a
to
1c7a374
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 5 of 18 files reviewed, 15 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
ingest.go
line 955 at r4 (raw file):
Previously, sumeerbhola wrote…
since we now have multi-level compactions, consider saying something like:
... be from levels in [c.startLevel, c.outputLevel], and all those levels have their data overlap ...
Done.
ingest.go
line 966 at r4 (raw file):
Previously, sumeerbhola wrote…
I'm forgetting something now: are we doing this check since we would prefer to not have to cancel compactions? In that if we didn't do this here and ingested, we would be ok?
I guess the answer is no -- these compactions could finish before the ingest and then this ingest would be incorrect since it could try to add a file that overlaps with others in the level.
And the check we do in terms of canceling compactions when installing the version edit for the ingest is for compactions that started after the ingest decision was made, yes?The code could use some more commentary on this.
This code actually predates compaction cancellation and isn't related to ingest splitting - so the understanding was that if we saw an overlapping compaction going into a level, we'd also consider that level as having overlaps with us and stay out of that level. Today we can probably remove this code but that'd come at the cost of greater compaction cancellations and/or more coordination with that compaction to not create a boundary overlap with this ingested file. Something that we can/should consider as a future expansion though.
And yes , the versionEdit install-time compaction cancellation will catch anything that gets started in the meantime (but also anything that's running with a split file as an input file, which this code won't catch. this conditional only concerns with output levels).
Added a comment here and in the compaction cancellation code, good call.
ingest.go
line 1861 at r4 (raw file):
Previously, sumeerbhola wrote…
I think this should say something about splitFile, since its mentioned at the end of the comment. Something like:
...overlaps with s.ingestFile, which becomes the new splitFile, check splitFile's replacements ...
Done.
ingest.go
line 1863 at r4 (raw file):
Previously, sumeerbhola wrote…
how about expanding this comment like:
... so on until either we can't find the current splitFile in replacedFiles (which means the splitFile should be split) or we don't find a file that overlaps with s.ingestFile (which means another file being ingested already split things enough to accommodate s.ingestFile).
Done.
Ingest-time splitting was added in cockroachdb#2582. This change adds testing for it to the metamorphic test, by flipping it off/on in random options.
Ingest-time splitting was added in #2582. This change adds testing for it to the metamorphic test, by flipping it off/on in random options.
Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).
This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.
Biggest part of #1683. First commit is #2538, which this shares
a lot of logic with (mostly just the excise function).