kvserver/rangefeed: fix error handling in intent scanner #127204

wenyihu6 · 2024-07-15T21:32:02Z

Previously, when IntentScannerConstructor returned an error, only the first
rangefeed registration on the replica was disconnected as part of the error
handling. This worked because IntentScannerConstructor was called before the
first Register function finished to avoid missing events. But this feels like
a brittle assumption.

Additionally, we should not continue running initialSan.Run when the
constructor returns an error as this could lead to nil pointer panics.

Comments also incorrectly stated that the resolved timestamp is considered
immediately initialized if the provided iterator is nil. We actually check for
IntentScannerConstructor, not the iterator itself. Passing a nil
IntentScannerConstructor should only be possible in tests, and it shouldn't be
possible for IntentScannerConstructor to return a nil IntentScanner without
an error.

This patch fixes these issues by updating the comments and stopping the entire
processor when the constructor returns an error.

Epic: none

Release note: Fixed a rare bug introduced in v20.2 that could cause panics if
the intent scanner failed to construct during processor initialization.

cockroach-teamcity · 2024-07-15T21:32:10Z

This change is

wenyihu6 · 2024-07-18T17:47:30Z

This comes from the PR review in #126490 (review).

nvanbenschoten

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)

pkg/kv/kvserver/replica_rangefeed.go line 514 at r1 (raw file):

		r.raftMu.AssertHeld()

		scanner, err := rangefeed.NewSeparatedIntentScanner(ctx, r.store.TODOEngine(), desc.RSpan())

Can we just return rangefeed.NewSeparatedIntentScanner(...)?

pkg/kv/kvserver/rangefeed/processor.go line 390 at r1 (raw file):

		p.run(ctx, p.RangeID, newRtsIter, stopper)
	}); err != nil {
		p.reg.DisconnectWithErr(ctx, all, kvpb.NewError(err))

Not your change, but is this needed? If we never launched the LegacyProcesor.run method, then we can't have any registrations, right?

pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

		if err != nil {
			// No need to close rtsIter if error is non-nil.
			p.StopWithErr(kvpb.NewError(err))

Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to syncEventC in StopWithErr. Should we just be returning here and allowing the defer close(p.stoppedC) to shut down the processor?

pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

		if err != nil {
			// No need to close rtsIter if error is non-nil.
			p.StopWithErr(kvpb.NewError(err))

Separately, how is this going to interact with this logic in rangefeed registration:

cockroach/pkg/kv/kvserver/replica_rangefeed.go

Lines 536 to 545 in 26ce6ee

    
           reg, filter := p.Register(span, startTS, catchUpIter, withDiff, 
        
           	withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) }) 
        
           if !reg { 
        
           	select { 
        
           	case <-r.store.Stopper().ShouldQuiesce(): 
        
           		return nil, &kvpb.NodeUnavailableError{} 
        
           	default: 
        
           		panic("unexpected Stopped processor") 
        
           	} 
        
           }

It seems like we would hit the unexpected Stopped processor panic.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @stevendanna)

pkg/kv/kvserver/replica_rangefeed.go line 514 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we just return rangefeed.NewSeparatedIntentScanner(...)?

Right, done.

pkg/kv/kvserver/rangefeed/processor.go line 390 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Not your change, but is this needed? If we never launched the LegacyProcesor.run method, then we can't have any registrations, right?

Legacy processor has now been removed on master. This comment doesn't seem to apply for scheduled processor - scheduled processor just returns an error if it fails to start.

cockroach/pkg/kv/kvserver/rangefeed/scheduled_processor.go

Line 97 in ed772eb

func (p *ScheduledProcessor) Start(

pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have test coverage for this case? I'm surprised that it's not deadlocking, given the call to syncEventC in StopWithErr. Should we just be returning here and allowing the defer close(p.stoppedC) to shut down the processor?

We didn’t have any tests for this, which was why we haven’t encountered the nil pointer panics during running initialSan.Run with nil intent scanner. I added a unit test and confirmed that it panics without my changes.

pkg/kv/kvserver/rangefeed/processor.go line 421 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Separately, how is this going to interact with this logic in rangefeed registration:

cockroach/pkg/kv/kvserver/replica_rangefeed.go

Lines 536 to 545 in 26ce6ee

reg, filter := p.Register(span, startTS, catchUpIter, withDiff,

withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) })

if !reg {

select {

case <-r.store.Stopper().ShouldQuiesce():

return nil, &kvpb.NodeUnavailableError{}

default:

panic("unexpected Stopped processor")

}

}

It seems like we would hit the unexpected Stopped processor panic.

Legacy processor has been removed from master. For scheduled processor, we should return an error

cockroach/pkg/kv/kvserver/rangefeed/scheduled_processor.go

Line 118 in ed772eb

return err

from p.Start()

cockroach/pkg/kv/kvserver/replica_rangefeed.go

Lines 516 to 518 in ed772eb

    
           if err := p.Start(r.store.Stopper(), rtsIter); err != nil { 
        
           	return nil, err 
        
           }

before it reaches p.Register().

pkg/kv/kvserver/rangefeed/processor_test.go

stevendanna · 2024-10-28T08:30:44Z

pkg/kv/kvserver/rangefeed/processor_test.go

+	testutils.SucceedsSoon(t, func() error {
+		select {
+		case <-s.(*ScheduledProcessor).stoppedC:
+			return nil


Are there any other invariants we can check here?

I added checks for unregistering the processor from the scheduler and for rts.IsInit(). However, I realized that this isn’t testing what I want to test here since the resolved timestamp will only be initialized much later (after we forward closed timestamp). I tried to add a new case where the intent scanner operates correctly and the resolved timestamp initializes successfully. However, I found the implementation to be more complex than expected, effectively requiring us to reinvent the logic in TestProcessorInitializeResolvedTimestamp. If we feel strong about this, I can refactor the TestProcessorInitializeResolvedTimestamp test to exercise both error and non-error cases.

Looking at the code again, it looks to me like the only behaviour we are expecting in the case of a failure here is that this processor is unregistered from the scheduler. So just checking that, assuming we can, is fine by me.

Removed the assertion on rts.IsInit and left the check re: unregistering processor from scheduler in the test. Not sure how much you would love it - I'm checking by accessing the scheduler maps here.

_, ok := sch.shards[shardIndex(p.ID(), len(sch.shards), p.Priority)].procs[p.ID()] require.False(t, ok) require.False(t, sch.priorityIDs.Contains(p.ID()))

pkg/kv/kvserver/rangefeed/processor_test.go

Previously, when `IntentScannerConstructor` returned an error, only the first rangefeed registration on the replica was disconnected as part of the error handling. This worked because `IntentScannerConstructor` was called before the first `Register` function finished to avoid missing events. But this feels like a brittle assumption. Additionally, we should not continue running `initialSan.Run` when the constructor returns an error as this could lead to nil pointer panics. Comments also incorrectly stated that the resolved timestamp is considered immediately initialized if the provided iterator is nil. We actually check for `IntentScannerConstructor`, not the iterator itself. Passing a nil `IntentScannerConstructor` should only be possible in tests, and it shouldn't be possible for `IntentScannerConstructor` to return a nil `IntentScanner` without an error. This patch fixes these issues by updating the comments and stopping the entire processor when the constructor returns an error. Release note: Fixed a rare bug introduced in v20.2 that could cause panics if the intent scanner failed to construct during processor initialization. Epic: none

wenyihu6 mentioned this pull request Jul 17, 2024

kvserver/rangefeed: remove future package #126490

Merged

wenyihu6 force-pushed the intenterror branch 10 times, most recently from 3627400 to 73bcbb2 Compare July 18, 2024 17:46

wenyihu6 marked this pull request as ready for review July 18, 2024 17:46

wenyihu6 requested a review from a team as a code owner July 18, 2024 17:46

wenyihu6 requested a review from nvanbenschoten July 18, 2024 17:47

wenyihu6 removed the request for review from a team July 18, 2024 17:48

nvanbenschoten reviewed Jul 29, 2024

View reviewed changes

wenyihu6 force-pushed the intenterror branch from 73bcbb2 to ed772eb Compare October 24, 2024 05:15

wenyihu6 requested a review from stevendanna October 24, 2024 05:15

wenyihu6 commented Oct 24, 2024

View reviewed changes

wenyihu6 force-pushed the intenterror branch 2 times, most recently from 2532b2f to 8243c29 Compare October 24, 2024 17:11

stevendanna reviewed Oct 28, 2024

View reviewed changes

wenyihu6 force-pushed the intenterror branch 3 times, most recently from d5f6b23 to 3f00f76 Compare October 28, 2024 16:36

wenyihu6 force-pushed the intenterror branch from 3f00f76 to b694612 Compare October 28, 2024 18:05

wenyihu6 requested a review from stevendanna November 1, 2024 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver/rangefeed: fix error handling in intent scanner #127204

kvserver/rangefeed: fix error handling in intent scanner #127204

wenyihu6 commented Jul 15, 2024 •

edited

Loading

cockroach-teamcity commented Jul 15, 2024

wenyihu6 commented Jul 18, 2024 •

edited

Loading

nvanbenschoten left a comment

wenyihu6 left a comment

stevendanna Oct 28, 2024

wenyihu6 Oct 28, 2024

stevendanna Oct 28, 2024

wenyihu6 Oct 28, 2024 •

edited

Loading

	reg, filter := p.Register(span, startTS, catchUpIter, withDiff,
	withFiltering, withOmitRemote, stream, func() { r.maybeDisconnectEmptyRangefeed(p) })
	if !reg {
	select {
	case <-r.store.Stopper().ShouldQuiesce():
	return nil, &kvpb.NodeUnavailableError{}
	default:
	panic("unexpected Stopped processor")
	}
	}

	if err := p.Start(r.store.Stopper(), rtsIter); err != nil {
	return nil, err
	}

kvserver/rangefeed: fix error handling in intent scanner #127204

Are you sure you want to change the base?

kvserver/rangefeed: fix error handling in intent scanner #127204

Conversation

wenyihu6 commented Jul 15, 2024 • edited Loading

cockroach-teamcity commented Jul 15, 2024

wenyihu6 commented Jul 18, 2024 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

wenyihu6 left a comment

Choose a reason for hiding this comment

stevendanna Oct 28, 2024

Choose a reason for hiding this comment

wenyihu6 Oct 28, 2024

Choose a reason for hiding this comment

stevendanna Oct 28, 2024

Choose a reason for hiding this comment

wenyihu6 Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

wenyihu6 commented Jul 15, 2024 •

edited

Loading

wenyihu6 commented Jul 18, 2024 •

edited

Loading

wenyihu6 Oct 28, 2024 •

edited

Loading