Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: add ingest-time splitting of ssts into virtual ones #2582

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

itsbilal
Copy link
Member

Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Biggest part of #1683. First commit is #2538, which this shares
a lot of logic with (mostly just the excise function).

@itsbilal itsbilal requested a review from a team May 31, 2023 21:09
@itsbilal itsbilal self-assigned this May 31, 2023
@itsbilal itsbilal requested a review from sumeerbhola May 31, 2023 21:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Member Author

itsbilal commented Jun 1, 2023

On closer thought, we can't do the optimization of dropping d.mu while we call Excise, because it doesn't guarantee that a new compaction won't get scheduled at that time that produces a new sstable that spans the excise span or newly-ingested file, resulting in a panic when that compaction finishes in logAndApply. #2538 is likely also affected with this race.

We would either need another mechanism to error out compactions that straddle the excise span / newly ingested spans, or we can just hold db.mu throughout. The latter is similar to what we already do for ingestTargetLevelFunc which also does read IO while holding the mutex, so I'm inclined to just do that for now for both PRs and file a follow-up issue to optimize locking around all this read IO.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


ingest.go line 617 at r2 (raw file):

	meta *fileMetadata,
	suggestSplit bool,
) (int, *fileMetadata, error) {

[nit] name the return values and add a function level comment


ingest.go line 1578 at r2 (raw file):

	// ingestFile is the file being ingested, and splitFile is the file that
	// needs to be split to allow ingestFile to slot into `level` level.
	ingestFile, splitFile *fileMetadata

[nit] separate the fields and comments


ingest.go line 1599 at r2 (raw file):

) error {
	// Note the opposite order: Unlock() followed by Lock().
	d.mu.Unlock()

I'm not sure this makes sense in here, we should move to the calling code instead


ingest.go line 1616 at r2 (raw file):

			for i := range replaced {
				if replaced[i].Meta.Overlaps(d.cmp, s.ingestFile.Smallest.UserKey, s.ingestFile.Largest.UserKey, s.ingestFile.Largest.IsExclusiveSentinel()) {
					splitFile = replaced[i].Meta

We expect this to be the case at most once, right? We can assert that updatedSplitFile is false.


ingest.go line 1652 at r2 (raw file):

		replacedFiles[splitFile.FileNum] = added
		for i := range added {
			if s.ingestFile.Overlaps(d.cmp, added[i].Meta.Smallest.UserKey, added[i].Meta.Largest.UserKey, added[i].Meta.Largest.IsExclusiveSentinel()) {

Nice sanity checks!

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some preliminary comments

Reviewed 2 of 7 files at r2, all commit messages.
Reviewable status: 2 of 18 files reviewed, 11 unresolved discussions (waiting on @itsbilal)


compaction.go line 1861 at r2 (raw file):

			d.newIters, d.tableNewRangeKeyIter, iterOpts, d.cmp,
			c.version, baseLevel, d.mu.compact.inProgress, file.FileMetadata,
			false, /* suggestSplit */

why is suggestSplit false in this case. Don't we want to ingest as low as possible in this case too?


ingest.go line 769 at r2 (raw file):

				// latter should never happen today, but possibly some future version has
				// started writing FileMetadatas with key bounds that are loose on user
				// keys. Handle this defensively.

Can we log a warning here with the file bound details?


ingest.go line 1591 at r2 (raw file):

// unconditionally released and then re-acquired during this method as it does
// IO.
func (d *DB) ingestSplit(

(note to self) read this method


ingest.go line 1699 at r2 (raw file):

	baseLevel := d.mu.versions.picker.getBaseLevel()
	iterOps := IterOptions{logger: d.opts.Logger}
	// filesToSplit is a map of files being ingested -> files being split.

nit: it's not quite a map, though I realize this comment was intending to be more abstract.
We could say something more verbose like:
// filesToSplit is a list where each element is a pair consisting of a file being ingested and an existing file that needs to be split to make room for it. Each ingested file will appear at most once in this list.


ingest.go line 1774 at r2 (raw file):

	// replacedFiles maps files excised due to exciseSpan (or splitFiles returned
	// by ingestTargetLevel), to files that were created to replace it. This map
	//

incomplete sentence.
Why is this a map in that a file that is split is replaced with multiple files?


options.go line 527 at r2 (raw file):

		// sstables into two virtual sstables to allow ingestion sstables to slot
		// into a lower level than they otherwise would have.
		IngestSplit bool

to be on the safe side, I think we should make this dynamically changeable in a running cluster.

btw, do we have a metric counting the unique backing ssts? We should be able to see if such splitting is causing the count of filemetadata / count of backing ssts to become high.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Should have addressed all the comments so far, and should have also addressed my own concerns around how this interacted with in-progress compactions. The cancel cases for compactions are getting more complex, so I'll take a stab at a unit test that better tests the cancel cases tomorrow. Otherwise this should be good to review.

Reviewable status: 0 of 18 files reviewed, 5 unresolved discussions (waiting on @sumeerbhola)


compaction.go line 1861 at r2 (raw file):

Previously, sumeerbhola wrote…

why is suggestSplit false in this case. Don't we want to ingest as low as possible in this case too?

Done. This was because this code didn't correctly handle metrics and cancel cases for ingest-time splitting, but now it does.


ingest.go line 769 at r2 (raw file):

Previously, sumeerbhola wrote…

Can we log a warning here with the file bound details?

Done. Rather N/A because we now do loose-file-bound virtual sstables (for online restore). Procrastination can have pros like that!


ingest.go line 1578 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] separate the fields and comments

Done.


ingest.go line 1599 at r2 (raw file):

Previously, RaduBerinde wrote…

I'm not sure this makes sense in here, we should move to the calling code instead

Done. (Removed for the same reason as this was removed in the IngestAndExcise PR - it messes with the view of in-progress compactions).


ingest.go line 1616 at r2 (raw file):

Previously, RaduBerinde wrote…

We expect this to be the case at most once, right? We can assert that updatedSplitFile is false.

Yes. Asserted.


ingest.go line 1699 at r2 (raw file):

Previously, sumeerbhola wrote…

nit: it's not quite a map, though I realize this comment was intending to be more abstract.
We could say something more verbose like:
// filesToSplit is a list where each element is a pair consisting of a file being ingested and an existing file that needs to be split to make room for it. Each ingested file will appear at most once in this list.

Done.


ingest.go line 1774 at r2 (raw file):

Previously, sumeerbhola wrote…

incomplete sentence.
Why is this a map in that a file that is split is replaced with multiple files?

Clarified in the comments. It's because the entries are FileMetadatas, and we create two FileMetadatas every time we split a file.


options.go line 527 at r2 (raw file):

Previously, sumeerbhola wrote…

to be on the safe side, I think we should make this dynamically changeable in a running cluster.

btw, do we have a metric counting the unique backing ssts? We should be able to see if such splitting is causing the count of filemetadata / count of backing ssts to become high.

Done.

We don't have a metric right now, but we probably should have one to be able to track stuff like this - this is a good call. Will file an issue shortly.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r2, 7 of 17 files at r3, all commit messages.
Reviewable status: 8 of 18 files reviewed, 14 unresolved discussions (waiting on @itsbilal)


-- commits line 17 at r3:
nit: can you update this slightly stale comment


ingest.go line 807 at r3 (raw file):

	//   - if there is only a file overlap at a given level, and no data overlap,
	//     we can still slot a file at that level. We return the fileMetadata with
	//     which we have file boundary overlap (must be only one file by definition, as

is this "must be only one file by definition" stale due to loose online restore sst bounds?


ingest.go line 921 at r3 (raw file):

		// Check boundary overlap.
		splitFile = nil

why are we setting this to nil here -- if we had advanced targetLevel in a previous loop iteration, we don't want to lose that in case compactions (or something else) prevent us from advancing it here? I think we need to maintain consistency via setting splitFile exactly where we set targetLevel, otherwise it becomes to hard to reason about.

candidateSplitFile is nil at this point, then the following blocks will try to set and unset candidateSplitFile and if we eventually fall through we would do

		if !overlaps {
			targetLevel = level
			splitFile = candidateSplitFile
		}

ingest.go line 926 at r3 (raw file):

		if !boundaryOverlaps.Empty() {
			// We are already guaranteed to not have any data overlaps with files
			// in boundaryOverlaps, otherwise we'd have gone into the above if

nit: otherwise we'd have returned in the above if ...


ingest.go line 1870 at r3 (raw file):

				if replaced[i].Meta.Overlaps(d.cmp, s.ingestFile.Smallest.UserKey, s.ingestFile.Largest.UserKey, s.ingestFile.Largest.IsExclusiveSentinel()) {
					if updatedSplitFile {
						panic("updated with two files in ingestSplit")

This could use a comment with a visual example. I believe this is an invariant because the various ingested files in files are non-overlapping.


ingest.go line 1879 at r3 (raw file):

				// None of the replaced files overlapped with the file being ingested.
				// This is only okay if we have an excise span that overlaps with
				// the ingested file.

I am not sure this is limited to the excise case. If we had a file in the version that had a span [a,z) and one ingested file at [b,c) caused it to split into [a,b) and [j,z) (because the file had no keys in [b,j)), then a second ingested file with span [d,e) can slide into the gap that was left.


ingest.go line 1895 at r3 (raw file):

		// as we're guaranteed to not have any data overlap between splitFile and
		// s.ingestFile, so panic if we do see a newly added file with an endKey
		// equalling s.ingestFile.Largest, and !s.ingestFile.Largest.IsExclusiveSentinel()

I am getting a bit confused by how this tree structure gets reflected in the versionEdit. My understanding is that we can have something like:
file1 (from the existing version) is split into file11, file12. Then file12 is split into file121, file122. replacedFiles will have file1, file12 as keys. This is so that we can traverse down the tree when someone else tries to split file1, and we need to effectively split file121. This all make sense.

d.excise will add file1 to ve.DeletedFiles and then we are adding it again? It will also add file11, file12 to ve.NewFiles. Then later we will add file12 to ve.DeletedFiles and file121, file122 to ve.NewFiles. The versionEdit documentation says:

// A file num may be present in both deleted files and new files when it  
// is moved from a lower level to a higher level (when the compaction  
// found that there was no overlapping file at the higher level).

Perhaps this is working out ok, because the code happens to work, even though file12 was never in the version?
Or we could flatten this tree after all the updates to replacedFiles are done, and at that time update the versionEdit.

Do we have tests exercising this repeated splitting (I didn't see one in testdata/ingest)?


ingest.go line 2016 at r3 (raw file):

				// check from findTargetLevel, as that requires d.mu to be held.
				shouldIngestSplit := d.opts.Experimental.IngestSplit != nil &&
					d.opts.Experimental.IngestSplit() && d.FormatMajorVersion() >= ExperimentalFormatVirtualSSTables

I think we should hoist this to the top of the function, and not do this in every iteration. It is simpler that way, and we don't need the setting change to apply in the middle of an ingest.


ingest.go line 2094 at r3 (raw file):

			for m := iter.First(); m != nil; m = iter.Next() {
				excised, err := d.excise(exciseSpan, m, ve, level)

nit: excised is not a good name for a slice containing NewFileEntry.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

This should be ready for another look; I've rebased and addressed known issues so far. I'll keep seeing if there are more ways to harden it up or improve testing, but from my perspective this is looking good again.

Reviewable status: 2 of 18 files reviewed, 11 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


-- commits line 17 at r3:

Previously, sumeerbhola wrote…

nit: can you update this slightly stale comment

Done.


ingest.go line 807 at r3 (raw file):

Previously, sumeerbhola wrote…

is this "must be only one file by definition" stale due to loose online restore sst bounds?

Yep. Clarified that now.


ingest.go line 921 at r3 (raw file):

Previously, sumeerbhola wrote…

why are we setting this to nil here -- if we had advanced targetLevel in a previous loop iteration, we don't want to lose that in case compactions (or something else) prevent us from advancing it here? I think we need to maintain consistency via setting splitFile exactly where we set targetLevel, otherwise it becomes to hard to reason about.

candidateSplitFile is nil at this point, then the following blocks will try to set and unset candidateSplitFile and if we eventually fall through we would do

		if !overlaps {
			targetLevel = level
			splitFile = candidateSplitFile
		}

Good catch! Done. I added a candidateSplitFile for this purpose.


ingest.go line 1870 at r3 (raw file):

Previously, sumeerbhola wrote…

This could use a comment with a visual example. I believe this is an invariant because the various ingested files in files are non-overlapping.

I added a longer comment and referenced the visual examples in ingestTargetLevel, hope that's sufficient?


ingest.go line 1879 at r3 (raw file):

Previously, sumeerbhola wrote…

I am not sure this is limited to the excise case. If we had a file in the version that had a span [a,z) and one ingested file at [b,c) caused it to split into [a,b) and [j,z) (because the file had no keys in [b,j)), then a second ingested file with span [d,e) can slide into the gap that was left.

Done. Good catch; adding a test case with this, thanks!


ingest.go line 1895 at r3 (raw file):

Previously, sumeerbhola wrote…

I am getting a bit confused by how this tree structure gets reflected in the versionEdit. My understanding is that we can have something like:
file1 (from the existing version) is split into file11, file12. Then file12 is split into file121, file122. replacedFiles will have file1, file12 as keys. This is so that we can traverse down the tree when someone else tries to split file1, and we need to effectively split file121. This all make sense.

d.excise will add file1 to ve.DeletedFiles and then we are adding it again? It will also add file11, file12 to ve.NewFiles. Then later we will add file12 to ve.DeletedFiles and file121, file122 to ve.NewFiles. The versionEdit documentation says:

// A file num may be present in both deleted files and new files when it  
// is moved from a lower level to a higher level (when the compaction  
// found that there was no overlapping file at the higher level).

Perhaps this is working out ok, because the code happens to work, even though file12 was never in the version?
Or we could flatten this tree after all the updates to replacedFiles are done, and at that time update the versionEdit.

Do we have tests exercising this repeated splitting (I didn't see one in testdata/ingest)?

Very good catch. I did the flattening in ingestSplit. Also added tests for these multi-split cases that actually traverse the tree.


ingest.go line 2016 at r3 (raw file):

Previously, sumeerbhola wrote…

I think we should hoist this to the top of the function, and not do this in every iteration. It is simpler that way, and we don't need the setting change to apply in the middle of an ingest.

Done.


ingest.go line 2094 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: excised is not a good name for a slice containing NewFileEntry.

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 10 of 10 files at r4, all commit messages.
Reviewable status: 12 of 18 files reviewed, 15 unresolved discussions (waiting on @itsbilal)


ingest.go line 955 at r4 (raw file):

		// produce since compaction hasn't been done yet. However, there's no need
		// to check since all keys in them will either be from c.startLevel or
		// c.outputLevel, both levels having their data overlap already tested

since we now have multi-level compactions, consider saying something like:

... be from levels in [c.startLevel, c.outputLevel], and all those levels have their data overlap ...


ingest.go line 966 at r4 (raw file):

				overlaps = true
				break
			}

I'm forgetting something now: are we doing this check since we would prefer to not have to cancel compactions? In that if we didn't do this here and ingested, we would be ok?
I guess the answer is no -- these compactions could finish before the ingest and then this ingest would be incorrect since it could try to add a file that overlaps with others in the level.
And the check we do in terms of canceling compactions when installing the version edit for the ingest is for compactions that started after the ingest decision was made, yes?

The code could use some more commentary on this.


ingest.go line 1861 at r4 (raw file):

		// replacedFiles can be thought of as a tree, where we start iterating with
		// s.splitFile and run its fileNum through replacedFiles, then find which
		// of the replaced files overlaps with s.ingestFile, check its

I think this should say something about splitFile, since its mentioned at the end of the comment. Something like:

...overlaps with s.ingestFile, which becomes the new splitFile, check splitFile's replacements ...


ingest.go line 1863 at r4 (raw file):

		// of the replaced files overlaps with s.ingestFile, check its
		// replacements in replacedFiles again for overlap with s.ingestFile, and
		// so on until we can't find the current splitFile in replacedFiles.

how about expanding this comment like:

... so on until either we can't find the current splitFile in replacedFiles (which means the splitFile should be split) or we don't find a file that overlaps with s.ingestFile (which means another file being ingested already split things enough to accommodate s.ingestFile).

Currently, if we identify boundary overlap in a level
during ingest target level calculation, but no data overlap,
we are forced to find a target level above the file we saw
the overlap with (if we can't fall below it, such as if the
existing file is in L6, which happens commonly).

This change takes advantage of virtual sstables to split
existing sstables into two virtual sstables when an ingested
sstable would be able to go into the same level had the sstables
been split that way to begin with. Doing this split reduces a
lot of write-amp as it avoids us from having to compact the
newly-ingested sstable with the sstable it boundary-overlapped with.

Fixes cockroachdb#1683.
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 5 of 18 files reviewed, 15 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


ingest.go line 955 at r4 (raw file):

Previously, sumeerbhola wrote…

since we now have multi-level compactions, consider saying something like:

... be from levels in [c.startLevel, c.outputLevel], and all those levels have their data overlap ...

Done.


ingest.go line 966 at r4 (raw file):

Previously, sumeerbhola wrote…

I'm forgetting something now: are we doing this check since we would prefer to not have to cancel compactions? In that if we didn't do this here and ingested, we would be ok?
I guess the answer is no -- these compactions could finish before the ingest and then this ingest would be incorrect since it could try to add a file that overlaps with others in the level.
And the check we do in terms of canceling compactions when installing the version edit for the ingest is for compactions that started after the ingest decision was made, yes?

The code could use some more commentary on this.

This code actually predates compaction cancellation and isn't related to ingest splitting - so the understanding was that if we saw an overlapping compaction going into a level, we'd also consider that level as having overlaps with us and stay out of that level. Today we can probably remove this code but that'd come at the cost of greater compaction cancellations and/or more coordination with that compaction to not create a boundary overlap with this ingested file. Something that we can/should consider as a future expansion though.

And yes , the versionEdit install-time compaction cancellation will catch anything that gets started in the meantime (but also anything that's running with a split file as an input file, which this code won't catch. this conditional only concerns with output levels).

Added a comment here and in the compaction cancellation code, good call.


ingest.go line 1861 at r4 (raw file):

Previously, sumeerbhola wrote…

I think this should say something about splitFile, since its mentioned at the end of the comment. Something like:

...overlaps with s.ingestFile, which becomes the new splitFile, check splitFile's replacements ...

Done.


ingest.go line 1863 at r4 (raw file):

Previously, sumeerbhola wrote…

how about expanding this comment like:

... so on until either we can't find the current splitFile in replacedFiles (which means the splitFile should be split) or we don't find a file that overlaps with s.ingestFile (which means another file being ingested already split things enough to accommodate s.ingestFile).

Done.

@itsbilal itsbilal merged commit 14b8ccd into cockroachdb:master Sep 26, 2023
11 checks passed
itsbilal added a commit to itsbilal/pebble that referenced this pull request Sep 26, 2023
Ingest-time splitting was added in cockroachdb#2582. This change adds
testing for it to the metamorphic test, by flipping it
off/on in random options.
itsbilal added a commit that referenced this pull request Sep 27, 2023
Ingest-time splitting was added in #2582. This change adds
testing for it to the metamorphic test, by flipping it
off/on in random options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants