-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver,storage: ingest small snapshot as writes #110943
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itsbilal I owe tests, but would like a quick opinion about the structure before adding them.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)
0bd4625
to
d4f9420
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall design LGTM
Reviewed 3 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/engine.go
line 1058 at r1 (raw file):
// a WriteBatch and commits the batch with sync=true. The files represented // in paths must not be overlapping -- this is the same contract as // IngestLocalFiles*. Additionally, exciseSpans represents the spans which
nit: maybe exciseSpans
can be called clearedSpans
as that's what the rest of the code is using, and it's more logically equivalent as in the excise case we won't be excising all of these spans, just one of them.
pkg/storage/pebble.go
line 2420 at r1 (raw file):
} } valid, err := iter.SeekEngineKeyGE(EngineKey{Key: roachpb.KeyMin})
I wonder if the real solution is to surface some sort of internal iterator here that exposes the rangedels as well. That would be less brittle and less overhead than passing around the cleared ranges separately. For now I'm good with this to get the performance win, but it's something we can probably track separately (and won't even be that hard to do). Maybe we can do a ScanInternalSST
that uses the scanInternalIterator and takes in visitor functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)
pkg/storage/pebble.go
line 2420 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I wonder if the real solution is to surface some sort of internal iterator here that exposes the rangedels as well. That would be less brittle and less overhead than passing around the cleared ranges separately. For now I'm good with this to get the performance win, but it's something we can probably track separately (and won't even be that hard to do). Maybe we can do a
ScanInternalSST
that uses the scanInternalIterator and takes in visitor functions?
I started with that idea. The difficulty I was having with it is that we want to easily extend this to the disagg storage case which has a stack of ssts: if we are going to ingest that as a batch, it's nice to have the external Iterator behavior where everything is collapsed down. We really only want the RANGEDELs and the RANGEKEYDELs we have written in the process of clearing LSM state for this incoming snapshot.
db2ac9c
to
4ef5f3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added. This is ready for full review.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)
pkg/kv/kvserver/replica_raftstorage.go
line 51 at r2 (raw file):
"kv.snapshot.ingest_as_write_threshold", 100<<10, /* default value is 100KiB */ 1<<30, /* metamorphic value is 1GiB */
nit: Consider using ConstantWithMetamorphicTestChoice()
to list a few representative values (e.g. 0 bytes and 1 GB). You may want to make that helper generic too, to avoid the interface assertions.
pkg/kv/kvserver/replica_raftstorage.go
line 364 at r2 (raw file):
// cleared by doing the Ingest*. This is tracked so that we can convert the // ssts into a WriteBatch if the total size of the ssts is small. clearedSpans []roachpb.Span
Something smells a bit wrong about both passing in inSnap.clearedSpans
and then mutating it during application. This isn't really something that should ever diverge from the descriptor either, and passing it separately muddies that relationship. Should we derive this from the descriptor during application via MakeReplicatedKeySpans()
and keep is as a local?
pkg/kv/kvserver/replica_raftstorage.go
line 657 at r2 (raw file):
} else { appliedAsWrite = true err := r.store.TODOEngine().ConvertFilesToBatchAndCommit(
Is it ok that we're not contributing ingest stats here? In principle, operators could set the threshold to e.g. 1 GB, or we could be ingesting lots of these. Would this then bypass admission control entirely?
pkg/storage/pebble.go
line 2395 at r2 (raw file):
f, err := p.FS.Open(fileName) if err != nil { closeFiles()
nit: consider a brief comment saying that NewSSTEngineIterator takes ownership of the file handles (even in the error case), so we only need to close prior to calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)
pkg/kv/kvserver/replica_raftstorage.go
line 657 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is it ok that we're not contributing ingest stats here? In principle, operators could set the threshold to e.g. 1 GB, or we could be ingesting lots of these. Would this then bypass admission control entirely?
This was a poor oversight -- glad you noticed!
The bypassing part is fine, since snapshot ingestion also bypasses. But snapshots are special in that we don't have them consume tokens either, so that they don't affect foreground latency. But we do account for them in the AC linear models. Fixing this properly will require some plumbing changes in AC.
I'll ping when this is ready for another look.
This is needed for cockroachdb#110943 since we are sometime applying incoming snapshots as normal writes. As a reminder, we do not deduct tokens for data ingested/written to the LSM for range snapshots since this can starve out normal writes. We also compensate for these ignored bytes when fitting the linear model. The change in that compensation logic is what this PR addresses. Informs cockroachdb#109808 Epic: none Release note: None
111367: admission: allow range snapshot ignored bytes to be normal writes r=aadityasondhi a=sumeerbhola This is needed for #110943 since we are sometime applying incoming snapshots as normal writes. As a reminder, we do not deduct tokens for data ingested/written to the LSM for range snapshots since this can starve out normal writes. We also compensate for these ignored bytes when fitting the linear model. The change in that compensation logic is what this PR addresses. Informs #109808 Epic: none Release note: None 111398: kv: include replicated locks in replica consistency checks r=nvanbenschoten a=nvanbenschoten Fixes #111294. This commit adds handling of replicated locks in replicate consistency checks. They are iterated over as part of MVCCStats calculations and hashed similarly to point keys. The commit also ensures that the iteration is properly throttled. Release note: None Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
4ef5f3c
to
51796af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)
pkg/kv/kvserver/replica_raftstorage.go
line 51 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: Consider using
ConstantWithMetamorphicTestChoice()
to list a few representative values (e.g. 0 bytes and 1 GB). You may want to make that helper generic too, to avoid the interface assertions.
Done
pkg/kv/kvserver/replica_raftstorage.go
line 364 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Something smells a bit wrong about both passing in
inSnap.clearedSpans
and then mutating it during application. This isn't really something that should ever diverge from the descriptor either, and passing it separately muddies that relationship. Should we derive this from the descriptor during application viaMakeReplicatedKeySpans()
and keep is as a local?
I don't want to have 2 places derive this separately from the descriptor since we can update one and forget to update the other. So I prefer to initialize this in the same place that is creating the newMultiSSTWriter
.
Regarding the modifications to add more spans later, it is fair to not modify inSnap.clearedSpans
. I've changed it to do a clearedSpans := inSnap.clearedSpans
first and then modify clearedSpans
.
pkg/storage/engine.go
line 1058 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
nit: maybe
exciseSpans
can be calledclearedSpans
as that's what the rest of the code is using, and it's more logically equivalent as in the excise case we won't be excising all of these spans, just one of them.
Done
pkg/storage/pebble.go
line 2395 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: consider a brief comment saying that NewSSTEngineIterator takes ownership of the file handles (even in the error case), so we only need to close prior to calling it.
The actual situation seems more complicated. I've added a TODO below.
Our testing of error paths is poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @itsbilal)
pkg/kv/kvserver/replica_raftstorage.go
line 657 at r2 (raw file):
Previously, sumeerbhola wrote…
This was a poor oversight -- glad you noticed!
The bypassing part is fine, since snapshot ingestion also bypasses. But snapshots are special in that we don't have them consume tokens either, so that they don't affect foreground latency. But we do account for them in the AC linear models. Fixing this properly will require some plumbing changes in AC.
I'll ping when this is ready for another look.
This is fixed. The additional AC plumbing was added in #111367
Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment). Fixes cockroachdb#109808 Epic: none Release note (ops change): The cluster setting kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.
51796af
to
d00178b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)
TFTRs! |
bors r=itsbilal,erikgrinaker |
Build succeeded: |
This is needed for cockroachdb#110943 since we are sometime applying incoming snapshots as normal writes. As a reminder, we do not deduct tokens for data ingested/written to the LSM for range snapshots since this can starve out normal writes. We also compensate for these ignored bytes when fitting the linear model. The change in that compensation logic is what this PR addresses. Informs cockroachdb#109808 Epic: none Release note: None
This is needed for cockroachdb#110943 since we are sometime applying incoming snapshots as normal writes. As a reminder, we do not deduct tokens for data ingested/written to the LSM for range snapshots since this can starve out normal writes. We also compensate for these ignored bytes when fitting the linear model. The change in that compensation logic is what this PR addresses. Informs cockroachdb#109808 Epic: none Release note: None
Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment).
Fixes #109808
Epic: none
Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.