-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver/rangefeed: fix error handling in intent scanner #127204
base: master
Are you sure you want to change the base?
Conversation
3627400
to
73bcbb2
Compare
This comes from the PR review in #126490 (review). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/replica_rangefeed.go
line 514 at r1 (raw file):
r.raftMu.AssertHeld() scanner, err := rangefeed.NewSeparatedIntentScanner(ctx, r.store.TODOEngine(), desc.RSpan())
Can we just return rangefeed.NewSeparatedIntentScanner(...)
?
pkg/kv/kvserver/rangefeed/processor.go
line 390 at r1 (raw file):
p.run(ctx, p.RangeID, newRtsIter, stopper) }); err != nil { p.reg.DisconnectWithErr(ctx, all, kvpb.NewError(err))
Not your change, but is this needed? If we never launched the LegacyProcesor.run
method, then we can't have any registrations, right?
pkg/kv/kvserver/rangefeed/processor.go
line 421 at r1 (raw file):
if err != nil { // No need to close rtsIter if error is non-nil. p.StopWithErr(kvpb.NewError(err))
Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to syncEventC
in StopWithErr
. Should we just be returning here and allowing the defer close(p.stoppedC)
to shut down the processor?
pkg/kv/kvserver/rangefeed/processor.go
line 421 at r1 (raw file):
if err != nil { // No need to close rtsIter if error is non-nil. p.StopWithErr(kvpb.NewError(err))
Separately, how is this going to interact with this logic in rangefeed registration:
cockroach/pkg/kv/kvserver/replica_rangefeed.go
Lines 536 to 545 in 26ce6ee
reg, filter := p.Register(span, startTS, catchUpIter, withDiff, | |
withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) }) | |
if !reg { | |
select { | |
case <-r.store.Stopper().ShouldQuiesce(): | |
return nil, &kvpb.NodeUnavailableError{} | |
default: | |
panic("unexpected Stopped processor") | |
} | |
} |
It seems like we would hit the unexpected Stopped processor
panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @stevendanna)
pkg/kv/kvserver/replica_rangefeed.go
line 514 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we just
return rangefeed.NewSeparatedIntentScanner(...)
?
Right, done.
pkg/kv/kvserver/rangefeed/processor.go
line 390 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Not your change, but is this needed? If we never launched the
LegacyProcesor.run
method, then we can't have any registrations, right?
Legacy processor has now been removed on master. This comment doesn't seem to apply for scheduled processor - scheduled processor just returns an error if it fails to start.
func (p *ScheduledProcessor) Start( |
pkg/kv/kvserver/rangefeed/processor.go
line 421 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to
syncEventC
inStopWithErr
. Should we just be returning here and allowing thedefer close(p.stoppedC)
to shut down the processor?
We didn’t have any tests for this, which was why we haven’t encountered the nil pointer panics during running initialSan.Run
with nil intent scanner. I added a unit test and confirmed that it panics without my changes.
pkg/kv/kvserver/rangefeed/processor.go
line 421 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Separately, how is this going to interact with this logic in rangefeed registration:
cockroach/pkg/kv/kvserver/replica_rangefeed.go
Lines 536 to 545 in 26ce6ee
reg, filter := p.Register(span, startTS, catchUpIter, withDiff, withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) }) if !reg { select { case <-r.store.Stopper().ShouldQuiesce(): return nil, &kvpb.NodeUnavailableError{} default: panic("unexpected Stopped processor") } } It seems like we would hit the
unexpected Stopped processor
panic.
Legacy processor has been removed from master. For scheduled processor, we should return an error
return err |
cockroach/pkg/kv/kvserver/replica_rangefeed.go
Lines 516 to 518 in ed772eb
if err := p.Start(r.store.Stopper(), rtsIter); err != nil { | |
return nil, err | |
} |
2532b2f
to
8243c29
Compare
testutils.SucceedsSoon(t, func() error { | ||
select { | ||
case <-s.(*ScheduledProcessor).stoppedC: | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other invariants we can check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added checks for unregistering the processor from the scheduler and for rts.IsInit(). However, I realized that this isn’t testing what I want to test here since the resolved timestamp will only be initialized much later (after we forward closed timestamp). I tried to add a new case where the intent scanner operates correctly and the resolved timestamp initializes successfully. However, I found the implementation to be more complex than expected, effectively requiring us to reinvent the logic in TestProcessorInitializeResolvedTimestamp. If we feel strong about this, I can refactor the TestProcessorInitializeResolvedTimestamp test to exercise both error and non-error cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code again, it looks to me like the only behaviour we are expecting in the case of a failure here is that this processor is unregistered from the scheduler. So just checking that, assuming we can, is fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the assertion on rts.IsInit and left the check re: unregistering processor from scheduler in the test. Not sure how much you would love it - I'm checking by accessing the scheduler maps here.
_, ok := sch.shards[shardIndex(p.ID(), len(sch.shards), p.Priority)].procs[p.ID()]
require.False(t, ok)
require.False(t, sch.priorityIDs.Contains(p.ID()))
d5f6b23
to
3f00f76
Compare
Previously, when `IntentScannerConstructor` returned an error, only the first rangefeed registration on the replica was disconnected as part of the error handling. This worked because `IntentScannerConstructor` was called before the first `Register` function finished to avoid missing events. But this feels like a brittle assumption. Additionally, we should not continue running `initialSan.Run` when the constructor returns an error as this could lead to nil pointer panics. Comments also incorrectly stated that the resolved timestamp is considered immediately initialized if the provided iterator is nil. We actually check for `IntentScannerConstructor`, not the iterator itself. Passing a nil `IntentScannerConstructor` should only be possible in tests, and it shouldn't be possible for `IntentScannerConstructor` to return a nil `IntentScanner` without an error. This patch fixes these issues by updating the comments and stopping the entire processor when the constructor returns an error. Release note: Fixed a rare bug introduced in v20.2 that could cause panics if the intent scanner failed to construct during processor initialization. Epic: none
Previously, when
IntentScannerConstructor
returned an error, only the firstrangefeed registration on the replica was disconnected as part of the error
handling. This worked because
IntentScannerConstructor
was called before thefirst
Register
function finished to avoid missing events. But this feels likea brittle assumption.
Additionally, we should not continue running
initialSan.Run
when theconstructor returns an error as this could lead to nil pointer panics.
Comments also incorrectly stated that the resolved timestamp is considered
immediately initialized if the provided iterator is nil. We actually check for
IntentScannerConstructor
, not the iterator itself. Passing a nilIntentScannerConstructor
should only be possible in tests, and it shouldn't bepossible for
IntentScannerConstructor
to return a nilIntentScanner
withoutan error.
This patch fixes these issues by updating the comments and stopping the entire
processor when the constructor returns an error.
Epic: none
Release note: Fixed a rare bug introduced in v20.2 that could cause panics if
the intent scanner failed to construct during processor initialization.