-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: allow excises to unconditionally be flushable ingests #3897
Conversation
7822bc8
to
47a772a
Compare
The earlier meta test issue is fixed now, so this should be RFAL. |
67b388b
to
59ff3eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jackson should take a look at the replayWAL
changes.
Reviewable status: 0 of 28 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi and @itsbilal)
ingest.go
line 1289 at r3 (raw file):
// The excise span is going to outlive this ingestion call. Copy it. exciseSpan = KeyRange{ Start: append([]byte(nil), exciseSpan.Start...),
[nit] slices.Clone
open.go
line 874 at r3 (raw file):
return errors.Wrapf(err, "running compaction during WAL replay") } ve.NewFiles = append(ve.NewFiles, newVE.NewFiles...)
[nit] maybe this should be a VersionEdit.Merge
method?
open.go
line 883 at r3 (raw file):
} } if newVE.CreatedBackingTables != nil {
[nit] no need for the nil check
open.go
line 1007 at r3 (raw file):
} if kind == base.InternalKeyKindExcise { exciseSpan.Start = append([]byte(nil), encodedFileNum...)
[nit] slices.Clone
open.go
line 1028 at r3 (raw file):
panic("pebble: multiple excise spans in a single batch") } exciseSpan.Start = append([]byte(nil), key...)
[nit] slices.Clone
59ff3eb
to
c76349c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 28 files reviewed, 1 unresolved discussion (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)
ingest.go
line 1289 at r3 (raw file):
Previously, RaduBerinde wrote…
[nit]
slices.Clone
Done.
open.go
line 874 at r3 (raw file):
Previously, RaduBerinde wrote…
[nit] maybe this should be a
VersionEdit.Merge
method?
In this case we can make some simplifying assumptions that allows us to just directly merge two VEs (that is, we'll either create brand new files, or we'll delete files that existed from the very start, we will never delete a file that was created in the same VE). In other cases we might need something like BulkVersionEdit.Accumulate. So maybe it's best to just leave it here, to avoid misuse of a generic Merge method where these assumptions don't hold.
open.go
line 883 at r3 (raw file):
Previously, RaduBerinde wrote…
[nit] no need for the nil check
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 28 files at r1, 2 of 6 files at r2, 2 of 3 files at r3, 3 of 5 files at r4.
Reviewable status: 21 of 28 files reviewed, 6 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)
flushable.go
line 167 at r4 (raw file):
// flush. exciseSpan KeyRange seqNum base.SeqNum
nit: should this be called exciseSeqNum
? or add a comment above seqNum
discussing the sequence number assignment (with files adopting the len(files)
sequence numbers starting at seqNum+1
if exciseSpan.Valid()
)
flushable.go
line 305 at r4 (raw file):
) miter.Init(s.comparer, keyspan.NoopTransform, new(keyspanimpl.MergingBuffers), liter) if rkeydelIter != nil {
if rkdelIter == nil
should we just return liter
directly (and avoid allocating a MergingIter
)?
open.go
line 936 at r4 (raw file):
if kind, encodedFileNum, val, ok, err := br.Next(); err != nil { return nil, 0, err } else if ok && (kind == InternalKeyKindIngestSST || kind == InternalKeyKindExcise) {
can we pull this out into its own function? replayWAL
is already a bit hard to follow
open.go
line 940 at r4 (raw file):
// // Ingests require an up-to-date view of the LSM to determine the target // level of ingested sstables, and to accurately compute excises. Outside
would it simplify things to just flushable ingest these into L0 during recovery? I don't think there's much of a loss in doing that and would love to keep the recovery code as simple as we can.
internal/base/internal.go
line 163 at r4 (raw file):
// if the current InternalKeyKindMax is a kind that is never added to an // sstable or memtable (eg. InternalKeyKindExcise). InternalKeyKindMaxForSstable InternalKeyKind = InternalKeyKindDeleteSized
nit: I think we tend to consistently use SSTable
capitalization
c76349c
to
0775601
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 21 of 28 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)
flushable.go
line 167 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: should this be called
exciseSeqNum
? or add a comment aboveseqNum
discussing the sequence number assignment (with files adopting thelen(files)
sequence numbers starting atseqNum+1
ifexciseSpan.Valid()
)
Done.
flushable.go
line 305 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
if
rkdelIter == nil
should we just returnliter
directly (and avoid allocating aMergingIter
)?
Done.
open.go
line 936 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
can we pull this out into its own function?
replayWAL
is already a bit hard to follow
Done. I kept the versionEdit/toFlush interactions here but moved the flushableEntry construction and batch reading into a function and it's much cleaner now.
open.go
line 940 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
would it simplify things to just flushable ingest these into L0 during recovery? I don't think there's much of a loss in doing that and would love to keep the recovery code as simple as we can.
The issue isn't so much just the L0 ingestions, but also the fact that we need to excise files we've written up to this point. I think it's easier to reason about correctness when the recovery code shares more code with the regular ingest pipeline than to have it do something unique and special.
internal/base/internal.go
line 163 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: I think we tend to consistently use
SSTable
capitalization
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 28 files at r1, 1 of 5 files at r4, 3 of 4 files at r5, all commit messages.
Reviewable status: 26 of 28 files reviewed, 7 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)
open.go
line 940 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
The issue isn't so much just the L0 ingestions, but also the fact that we need to excise files we've written up to this point. I think it's easier to reason about correctness when the recovery code shares more code with the regular ingest pipeline than to have it do something unique and special.
Thanks, I see. I still think this code structure is getting too unwieldy, and we're doing something unique and special right here. IIUC, before this PR replayWAL
performed no calls to logAndApply
. Its responsibility was a) building up the flushables
list and b) corresponding set of version edit mutations. Then the caller was responsible for either just going on its way with the flushables
as-is (read-only mode) or committing the version edit (read-write mode). (Are we respecting read-only mode here?)
open.go
line 792 at r5 (raw file):
} if kind == base.InternalKeyKindExcise { exciseSpan.Start = slices.Clone(encodedFileNum)
rename encodedFileNum
? it's not an encoded file num here.
open.go
line 795 at r5 (raw file):
exciseSpan.End = slices.Clone(val) } else { addFileNum(encodedFileNum)
I didn't follow why the first iteration is handled up outside the loop?
open.go
line 823 at r5 (raw file):
return nil, err } else if ok { panic("pebble: invalid number of entries in batch.")
nit: stray trailing period
open.go
line 950 at r5 (raw file):
// updateVE is used to update ve with information about new files created // during the flush of any flushable not of type ingestedFlushable. For the // flushable of type ingestedFlushable we use custom handling below.
this comment is outdated now
open.go
line 1047 at r5 (raw file):
entry = d.mu.mem.queue[len(d.mu.mem.queue)-1] d.mu.mem.mutable = nil d.mu.mem.queue = d.mu.mem.queue[:len(d.mu.mem.queue)-1]
this line is confusing. the mutable memtable is the last item in the queue. when we flush, we'll flush the entirety of the queue. why is this trimming off just the mutable memtable at the end? are we guaranteed that the queue is a single element, the mutable memtable? if so we should add a comment describing the invariant, enforce it and use d.mu.mem.queue[:0]
here
open.go
line 1061 at r5 (raw file):
if d.mu.versions.logSeqNum.Load() < maxSeqNum { d.mu.versions.logSeqNum.Store(maxSeqNum) }
why do we need to update the logSeqNum/visibleSeqNum early instead of waiting for the caller of replayWAL
to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this code structure is getting too unwieldy
In thinking about this more, why do we need to actually perform the excise immediately in in replayWAL? The goal of ingested flushables is to defer the operation until flush time. Shouldn't we able to queue the appropriate flushables during replayWAL
and then perform flush operations to drain the queue of flushables? I.e, it seems like we uncouple the replaying into memtables/flushables from the construction of the version edits. We'll still be able to use the existing ingestion excise logic to perform the excise. If we go this route, we should first refactor the existing code to have the structure we want before we attempt to extend it to support excises with flushable ingestions.
Reviewable status: 26 of 28 files reviewed, 7 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)
e07a9ca
to
df24195
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
I agree that the cleaner approach is to just always replay WALs into the flushable queue, and then do flushes until the flushable queue is empty. That way we can maintain one code path for flushes and flushable ingests, and makes this replay code much easier to reason about. I've gone ahead and done that refactor.
There's still one "memtable buffer not freed" bug in one test (more like, when one test is run with other tests and not on its own) that I spent a lot of time taking a look at, but wasn't able to quite get to the bottom of yet. Other than that, this is ready for a look.
Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)
open.go
line 792 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
rename
encodedFileNum
? it's not an encoded file num here.
Done.
open.go
line 823 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: stray trailing period
Done.
open.go
line 950 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
this comment is outdated now
Done.
open.go
line 1047 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
this line is confusing. the mutable memtable is the last item in the queue. when we flush, we'll flush the entirety of the queue. why is this trimming off just the mutable memtable at the end? are we guaranteed that the queue is a single element, the mutable memtable? if so we should add a comment describing the invariant, enforce it and use
d.mu.mem.queue[:0]
here
Done.
open.go
line 1061 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
why do we need to update the logSeqNum/visibleSeqNum early instead of waiting for the caller of
replayWAL
to do it?
Done.
df24195
to
b46c6f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still one "memtable buffer not freed" bug in one test (more like, when one test is run with other tests and not on its own) that I spent a lot of time taking a look at, but wasn't able to quite get to the bottom of yet.
This is addressed now.
Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)
f3e286f
to
6cffbb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go this route, we should first refactor the existing code to have the structure we want before we attempt to extend it to support excises with flushable ingestions.
Do you mind pulling out the refactor into a separate PR?
Reviewable status: 22 of 39 files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)
2f29b72
to
ee0beae
Compare
Done: #3955 |
Previously, we'd only allow IngestAndExcise to become flushable ingests if we had a guarantee from the caller that the ingestion contained rangedel/rangekeydel overlapping the excise span. This change reworks flushable ingests to return rangedels/rangekeydels from their own iterators, even if one does not exist in the sstables themselves. This, plus the update to the WAL to be able to persist the excise span, allows for IngestAndExcise to allow flushable ingests in more cases than before. Also resolves TODOs in replayWAL() around properly sequencing flushable ingests so their sstables can go in levels lower than L0. This sequencing was necessary to make excises work correctly; as excises require an up-to-date view of the version to work correctly. Fixes cockroachdb#2676.
ee0beae
to
3e8e6fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Reviewed 4 of 16 files at r6, 1 of 1 files at r8, 12 of 12 files at r9, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @aadityasondhi, @itsbilal, and @RaduBerinde)
TFTR! |
Previously, we'd only allow IngestAndExcise to become flushable ingests if we had a guarantee from the caller that the ingestion contained rangedel/rangekeydel overlapping the excise span. This change reworks flushable ingests to return rangedels/rangekeydels from their own iterators, even if one does not exist in the sstables themselves. This, plus the update to the WAL to be able to persist the excise span, allows for IngestAndExcise to allow flushable ingests in more cases than before.
Also resolves TODOs in replayWAL() around properly sequencing flushable ingests so their sstables can go in levels lower than L0. This sequencing was necessary to make excises work correctly; as excises require an up-to-date view of the version to work correctly.
Fixes #2676.