Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver/rangefeed: fix error handling in intent scanner #127204

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Jul 15, 2024

Previously, when IntentScannerConstructor returned an error, only the first
rangefeed registration on the replica was disconnected as part of the error
handling. This worked because IntentScannerConstructor was called before the
first Register function finished to avoid missing events. But this feels like
a brittle assumption.

Additionally, we should not continue running initialSan.Run when the
constructor returns an error as this could lead to nil pointer panics.

Comments also incorrectly stated that the resolved timestamp is considered
immediately initialized if the provided iterator is nil. We actually check for
IntentScannerConstructor, not the iterator itself. Passing a nil
IntentScannerConstructor should only be possible in tests, and it shouldn't be
possible for IntentScannerConstructor to return a nil IntentScanner without
an error.

This patch fixes these issues by updating the comments and stopping the entire
processor when the constructor returns an error.

Epic: none

Release note: Fixed a rare bug introduced in v20.2 that could cause panics if
the intent scanner failed to construct during processor initialization.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the intenterror branch 10 times, most recently from 3627400 to 73bcbb2 Compare July 18, 2024 17:46
@wenyihu6 wenyihu6 marked this pull request as ready for review July 18, 2024 17:46
@wenyihu6 wenyihu6 requested a review from a team as a code owner July 18, 2024 17:46
@wenyihu6 wenyihu6 requested a review from nvanbenschoten July 18, 2024 17:47
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jul 18, 2024

This comes from the PR review in #126490 (review).

@wenyihu6 wenyihu6 removed the request for review from a team July 18, 2024 17:48
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/replica_rangefeed.go line 514 at r1 (raw file):

		r.raftMu.AssertHeld()

		scanner, err := rangefeed.NewSeparatedIntentScanner(ctx, r.store.TODOEngine(), desc.RSpan())

Can we just return rangefeed.NewSeparatedIntentScanner(...)?


pkg/kv/kvserver/rangefeed/processor.go line 390 at r1 (raw file):

		p.run(ctx, p.RangeID, newRtsIter, stopper)
	}); err != nil {
		p.reg.DisconnectWithErr(ctx, all, kvpb.NewError(err))

Not your change, but is this needed? If we never launched the LegacyProcesor.run method, then we can't have any registrations, right?


pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

		if err != nil {
			// No need to close rtsIter if error is non-nil.
			p.StopWithErr(kvpb.NewError(err))

Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to syncEventC in StopWithErr. Should we just be returning here and allowing the defer close(p.stoppedC) to shut down the processor?


pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

		if err != nil {
			// No need to close rtsIter if error is non-nil.
			p.StopWithErr(kvpb.NewError(err))

Separately, how is this going to interact with this logic in rangefeed registration:

reg, filter := p.Register(span, startTS, catchUpIter, withDiff,
withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) })
if !reg {
select {
case <-r.store.Stopper().ShouldQuiesce():
return nil, &kvpb.NodeUnavailableError{}
default:
panic("unexpected Stopped processor")
}
}

It seems like we would hit the unexpected Stopped processor panic.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @stevendanna)


pkg/kv/kvserver/replica_rangefeed.go line 514 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we just return rangefeed.NewSeparatedIntentScanner(...)?

Right, done.


pkg/kv/kvserver/rangefeed/processor.go line 390 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Not your change, but is this needed? If we never launched the LegacyProcesor.run method, then we can't have any registrations, right?

Legacy processor has now been removed on master. This comment doesn't seem to apply for scheduled processor - scheduled processor just returns an error if it fails to start.

func (p *ScheduledProcessor) Start(


pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to syncEventC in StopWithErr. Should we just be returning here and allowing the defer close(p.stoppedC) to shut down the processor?

We didn’t have any tests for this, which was why we haven’t encountered the nil pointer panics during running initialSan.Run with nil intent scanner. I added a unit test and confirmed that it panics without my changes.


pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Separately, how is this going to interact with this logic in rangefeed registration:

reg, filter := p.Register(span, startTS, catchUpIter, withDiff,
withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) })
if !reg {
select {
case <-r.store.Stopper().ShouldQuiesce():
return nil, &kvpb.NodeUnavailableError{}
default:
panic("unexpected Stopped processor")
}
}

It seems like we would hit the unexpected Stopped processor panic.

Legacy processor has been removed from master. For scheduled processor, we should return an error

from p.Start()
if err := p.Start(r.store.Stopper(), rtsIter); err != nil {
return nil, err
}
before it reaches p.Register().

@wenyihu6 wenyihu6 force-pushed the intenterror branch 2 times, most recently from 2532b2f to 8243c29 Compare October 24, 2024 17:11
pkg/kv/kvserver/rangefeed/processor_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rangefeed/processor_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rangefeed/processor_test.go Outdated Show resolved Hide resolved
testutils.SucceedsSoon(t, func() error {
select {
case <-s.(*ScheduledProcessor).stoppedC:
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other invariants we can check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added checks for unregistering the processor from the scheduler and for rts.IsInit(). However, I realized that this isn’t testing what I want to test here since the resolved timestamp will only be initialized much later (after we forward closed timestamp). I tried to add a new case where the intent scanner operates correctly and the resolved timestamp initializes successfully. However, I found the implementation to be more complex than expected, effectively requiring us to reinvent the logic in TestProcessorInitializeResolvedTimestamp. If we feel strong about this, I can refactor the TestProcessorInitializeResolvedTimestamp test to exercise both error and non-error cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again, it looks to me like the only behaviour we are expecting in the case of a failure here is that this processor is unregistered from the scheduler. So just checking that, assuming we can, is fine by me.

Copy link
Contributor Author

@wenyihu6 wenyihu6 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the assertion on rts.IsInit and left the check re: unregistering processor from scheduler in the test. Not sure how much you would love it - I'm checking by accessing the scheduler maps here.

_, ok := sch.shards[shardIndex(p.ID(), len(sch.shards), p.Priority)].procs[p.ID()]
require.False(t, ok)
require.False(t, sch.priorityIDs.Contains(p.ID()))

pkg/kv/kvserver/rangefeed/processor_test.go Outdated Show resolved Hide resolved
@wenyihu6 wenyihu6 force-pushed the intenterror branch 3 times, most recently from d5f6b23 to 3f00f76 Compare October 28, 2024 16:36
Previously, when `IntentScannerConstructor` returned an error, only the first
rangefeed registration on the replica was disconnected as part of the error
handling. This worked because `IntentScannerConstructor` was called before the
first `Register` function finished to avoid missing events. But this feels like
a brittle assumption.

Additionally, we should not continue running `initialSan.Run` when the
constructor returns an error as this could lead to nil pointer panics.

Comments also incorrectly stated that the resolved timestamp is considered
immediately initialized if the provided iterator is nil. We actually check for
`IntentScannerConstructor`, not the iterator itself. Passing a nil
`IntentScannerConstructor` should only be possible in tests, and it shouldn't be
possible for `IntentScannerConstructor` to return a nil `IntentScanner` without
an error.

This patch fixes these issues by updating the comments and stopping the entire
processor when the constructor returns an error.

Release note: Fixed a rare bug introduced in v20.2 that could cause panics if
the intent scanner failed to construct during processor initialization.

Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants