Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver,storage: ingest small snapshot as writes #110943

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

sumeerbhola
Copy link
Collaborator

Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment).

Fixes #109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.

@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsbilal I owe tests, but would like a quick opinion about the structure before adding them.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)

@sumeerbhola sumeerbhola force-pushed the small_snap branch 3 times, most recently from 0bd4625 to d4f9420 Compare September 20, 2023 02:38
Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall design LGTM

Reviewed 3 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/engine.go line 1058 at r1 (raw file):

	// a WriteBatch and commits the batch with sync=true. The files represented
	// in paths must not be overlapping -- this is the same contract as
	// IngestLocalFiles*. Additionally, exciseSpans represents the spans which

nit: maybe exciseSpans can be called clearedSpans as that's what the rest of the code is using, and it's more logically equivalent as in the excise case we won't be excising all of these spans, just one of them.


pkg/storage/pebble.go line 2420 at r1 (raw file):

		}
	}
	valid, err := iter.SeekEngineKeyGE(EngineKey{Key: roachpb.KeyMin})

I wonder if the real solution is to surface some sort of internal iterator here that exposes the rangedels as well. That would be less brittle and less overhead than passing around the cleared ranges separately. For now I'm good with this to get the performance win, but it's something we can probably track separately (and won't even be that hard to do). Maybe we can do a ScanInternalSST that uses the scanInternalIterator and takes in visitor functions?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/storage/pebble.go line 2420 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I wonder if the real solution is to surface some sort of internal iterator here that exposes the rangedels as well. That would be less brittle and less overhead than passing around the cleared ranges separately. For now I'm good with this to get the performance win, but it's something we can probably track separately (and won't even be that hard to do). Maybe we can do a ScanInternalSST that uses the scanInternalIterator and takes in visitor functions?

I started with that idea. The difficulty I was having with it is that we want to easily extend this to the disagg storage case which has a stack of ssts: if we are going to ingest that as a batch, it's nice to have the external Iterator behavior where everything is collapsed down. We really only want the RANGEDELs and the RANGEKEYDELs we have written in the process of clearing LSM state for this incoming snapshot.

@sumeerbhola sumeerbhola force-pushed the small_snap branch 3 times, most recently from db2ac9c to 4ef5f3c Compare September 24, 2023 22:12
@sumeerbhola sumeerbhola marked this pull request as ready for review September 24, 2023 22:13
@sumeerbhola sumeerbhola requested a review from a team as a code owner September 24, 2023 22:13
@sumeerbhola sumeerbhola requested a review from a team September 24, 2023 22:13
@sumeerbhola sumeerbhola requested a review from a team as a code owner September 24, 2023 22:13
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added. This is ready for full review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)


pkg/kv/kvserver/replica_raftstorage.go line 51 at r2 (raw file):

		"kv.snapshot.ingest_as_write_threshold",
		100<<10, /* default value is 100KiB */
		1<<30,   /* metamorphic value is 1GiB */

nit: Consider using ConstantWithMetamorphicTestChoice() to list a few representative values (e.g. 0 bytes and 1 GB). You may want to make that helper generic too, to avoid the interface assertions.


pkg/kv/kvserver/replica_raftstorage.go line 364 at r2 (raw file):

	// cleared by doing the Ingest*. This is tracked so that we can convert the
	// ssts into a WriteBatch if the total size of the ssts is small.
	clearedSpans []roachpb.Span

Something smells a bit wrong about both passing in inSnap.clearedSpans and then mutating it during application. This isn't really something that should ever diverge from the descriptor either, and passing it separately muddies that relationship. Should we derive this from the descriptor during application via MakeReplicatedKeySpans() and keep is as a local?


pkg/kv/kvserver/replica_raftstorage.go line 657 at r2 (raw file):

		} else {
			appliedAsWrite = true
			err := r.store.TODOEngine().ConvertFilesToBatchAndCommit(

Is it ok that we're not contributing ingest stats here? In principle, operators could set the threshold to e.g. 1 GB, or we could be ingesting lots of these. Would this then bypass admission control entirely?


pkg/storage/pebble.go line 2395 at r2 (raw file):

		f, err := p.FS.Open(fileName)
		if err != nil {
			closeFiles()

nit: consider a brief comment saying that NewSSTEngineIterator takes ownership of the file handles (even in the error case), so we only need to close prior to calling it.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)


pkg/kv/kvserver/replica_raftstorage.go line 657 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Is it ok that we're not contributing ingest stats here? In principle, operators could set the threshold to e.g. 1 GB, or we could be ingesting lots of these. Would this then bypass admission control entirely?

This was a poor oversight -- glad you noticed!
The bypassing part is fine, since snapshot ingestion also bypasses. But snapshots are special in that we don't have them consume tokens either, so that they don't affect foreground latency. But we do account for them in the AC linear models. Fixing this properly will require some plumbing changes in AC.
I'll ping when this is ready for another look.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Sep 27, 2023
This is needed for cockroachdb#110943
since we are sometime applying incoming snapshots as normal writes.
As a reminder, we do not deduct tokens for data ingested/written to the
LSM for range snapshots since this can starve out normal writes. We
also compensate for these ignored bytes when fitting the linear model.
The change in that compensation logic is what this PR addresses.

Informs cockroachdb#109808

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Sep 28, 2023
111367: admission: allow range snapshot ignored bytes to be normal writes r=aadityasondhi a=sumeerbhola

This is needed for #110943 since we are sometime applying incoming snapshots as normal writes. As a reminder, we do not deduct tokens for data ingested/written to the LSM for range snapshots since this can starve out normal writes. We also compensate for these ignored bytes when fitting the linear model. The change in that compensation logic is what this PR addresses.

Informs #109808

Epic: none

Release note: None

111398: kv: include replicated locks in replica consistency checks r=nvanbenschoten a=nvanbenschoten

Fixes #111294.

This commit adds handling of replicated locks in replicate consistency checks. They are iterated over as part of MVCCStats calculations and hashed similarly to point keys.

The commit also ensures that the iteration is properly throttled.

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)


pkg/kv/kvserver/replica_raftstorage.go line 51 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: Consider using ConstantWithMetamorphicTestChoice() to list a few representative values (e.g. 0 bytes and 1 GB). You may want to make that helper generic too, to avoid the interface assertions.

Done


pkg/kv/kvserver/replica_raftstorage.go line 364 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Something smells a bit wrong about both passing in inSnap.clearedSpans and then mutating it during application. This isn't really something that should ever diverge from the descriptor either, and passing it separately muddies that relationship. Should we derive this from the descriptor during application via MakeReplicatedKeySpans() and keep is as a local?

I don't want to have 2 places derive this separately from the descriptor since we can update one and forget to update the other. So I prefer to initialize this in the same place that is creating the newMultiSSTWriter.

Regarding the modifications to add more spans later, it is fair to not modify inSnap.clearedSpans. I've changed it to do a clearedSpans := inSnap.clearedSpans first and then modify clearedSpans.


pkg/storage/engine.go line 1058 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: maybe exciseSpans can be called clearedSpans as that's what the rest of the code is using, and it's more logically equivalent as in the excise case we won't be excising all of these spans, just one of them.

Done


pkg/storage/pebble.go line 2395 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: consider a brief comment saying that NewSSTEngineIterator takes ownership of the file handles (even in the error case), so we only need to close prior to calling it.

The actual situation seems more complicated. I've added a TODO below.
Our testing of error paths is poor.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)


pkg/kv/kvserver/replica_raftstorage.go line 657 at r2 (raw file):

Previously, sumeerbhola wrote…

This was a poor oversight -- glad you noticed!
The bypassing part is fine, since snapshot ingestion also bypasses. But snapshots are special in that we don't have them consume tokens either, so that they don't affect foreground latency. But we do account for them in the AC linear models. Fixing this properly will require some plumbing changes in AC.
I'll ping when this is ready for another look.

This is fixed. The additional AC plumbing was added in #111367

Small snapshots cause LSM overload by resulting in many tiny memtable
flushes, which result in high sub-level count, which then needs to be
compensated by running many inefficient compactions from L0 to Lbase.
Despite some compaction scoring changes, we have not been able to
fully eliminate impact of this in foreground traffic as discussed in
cockroachdb/pebble#2832 (comment).

Fixes cockroachdb#109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below
which snapshots are converted to regular writes. It defaults to 100KiB.
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)

@sumeerbhola
Copy link
Collaborator Author

TFTRs!

@sumeerbhola
Copy link
Collaborator Author

bors r=itsbilal,erikgrinaker

@craig
Copy link
Contributor

craig bot commented Oct 3, 2023

Build succeeded:

@craig craig bot merged commit e9bace3 into cockroachdb:master Oct 3, 2023
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
This is needed for cockroachdb#110943
since we are sometime applying incoming snapshots as normal writes.
As a reminder, we do not deduct tokens for data ingested/written to the
LSM for range snapshots since this can starve out normal writes. We
also compensate for these ignored bytes when fitting the linear model.
The change in that compensation logic is what this PR addresses.

Informs cockroachdb#109808

Epic: none

Release note: None
aliher1911 pushed a commit to aliher1911/cockroach that referenced this pull request Oct 9, 2023
This is needed for cockroachdb#110943
since we are sometime applying incoming snapshots as normal writes.
As a reminder, we do not deduct tokens for data ingested/written to the
LSM for range snapshots since this can starve out normal writes. We
also compensate for these ignored bytes when fitting the linear model.
The change in that compensation logic is what this PR addresses.

Informs cockroachdb#109808

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: do not apply small snapshots as storage ingests
4 participants