streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

msbutler · 2023-10-17T15:09:51Z

Per discussion with @kvoli and @stevendanna , we're quite confident that #111541 fails due to allocator cpu starvation in our unit test environment. I'm opening this issue in case any of us would like to further investigate this, though we have much bigger fish to fry right now.

The allocator may be particularly slow in this test because TestStreamingRegionalConstraint adds a 1 replica constraint on a table, which could hit:

cockroach/pkg/kv/kvserver/allocator/plan/util.go

Lines 41 to 62 in cddcddd

    
           // If there's only one replica, the removal target is the 
        
           // leaseholder and this is unsupported and will fail. However, 
        
           // this is also the only way to rebalance in a single-replica 
        
           // range. If we try the atomic swap here, we'll fail doing 
        
           // nothing, and so we stay locked into the current distribution 
        
           // of replicas. (Note that maybeTransferLeaseAway above will not 
        
           // have found a target, and so will have returned (false, nil). 
        
           // 
        
           // Do the best thing we can, which is carry out the addition 
        
           // only, which should succeed, and the next time we touch this 
        
           // range, we will have one more replica and hopefully it will 
        
           // take the lease and remove the current leaseholder. 
        
           // 
        
           // It's possible that "rebalancing deadlock" can occur in other 
        
           // scenarios, it's really impossible to tell from the code given 
        
           // the constraints we support. However, the lease transfer often 
        
           // does not happen spuriously, and we can't enter dangerous 
        
           // configurations sporadically, so this code path is only hit 
        
           // when we know it's necessary, picking the smaller of two evils. 
        
           // 
        
           // See https://github.com/cockroachdb/cockroach/issues/40333. 
        
           log.KvDistribution.Infof(ctx, "can't swap replica due to lease; falling back to add")

(note that the the 1 replica constraint avoids the need to set up more test servers, which would further exacerbate cpu starvation problems)

It's also worth noting that other end-to-end unit tests that exercise allocator code are skipped under stress

cockroach/pkg/kv/kvserver/replicate_queue_test.go

Line 2212 in cddcddd

func TestPromoteNonVoterInAddVoter(t *testing.T) {

Jira issue: CRDB-32483

blathers-crl · 2023-10-17T15:09:54Z

Hi @msbutler, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

RaduBerinde · 2023-10-18T13:41:30Z

One failure here: https://teamcity.cockroachdb.com/viewLog.html?buildId=12228702&buildTypeId=Cockroach_BazelEssentialCi

RaduBerinde · 2023-10-18T13:46:40Z

And here: https://teamcity.cockroachdb.com/viewLog.html?buildId=12228674&buildTypeId=Cockroach_BazelEssentialCi

rafiss · 2023-10-18T18:44:31Z

And here https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/12238589?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

rafiss · 2023-10-19T16:12:38Z

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/12238589?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

stevendanna · 2023-10-19T16:14:41Z

Let's reskip this test until we can write a stable version of it or investigate the slowness.

Informs cockroachdb#112541 Release note: None

stevendanna · 2023-10-19T16:20:30Z

Hrm, actually, I wonder if some of these will be resolved once most branches are ahead of #112597

msbutler · 2023-10-19T16:22:10Z

Yeah, it looks like the branch (#112613) rafi's referring too needs to be rebased

adityamaru · 2023-10-19T21:42:50Z

I see a timeout even after the new 3m45s - https://teamcity.cockroachdb.com/viewLog.html?buildId=12260852&buildTypeId=Cockroach_BazelExtendedCi in testrace. Is it worth skipping under test race too and only leaving regular execution unskipped?

kvoli · 2023-10-20T00:10:40Z

The test uses RF=1, which if unlucky will require multiple queue scanner cycles to remove replicas (3->1), then another couple scanner cycles to rebalance the replica to satisfy the constraint. This generally occurs when the new leaseholder doesn't have tracking for the follower replicas, and then excludes them as removal candidates (in addition to the comments mentioned on the issue):

cockroach/pkg/kv/kvserver/allocator/plan/replicate.go

Lines 583 to 612 in 34191e4

    
           	if len(raftStatus.Progress) <= 2 { 
        
           		// HACK(bdarnell): Downreplicating to a single node from 
        
           		// multiple nodes is not really supported. There are edge 
        
           		// cases in which the two peers stop communicating with each 
        
           		// other too soon and we don't reach a satisfactory 
        
           		// resolution. However, some tests (notably 
        
           		// TestRepartitioning) get into this state, and if the 
        
           		// replication queue spends its entire timeout waiting for the 
        
           		// downreplication to finish the test will time out. As a 
        
           		// hack, just fail-fast when we're trying to go down to a 
        
           		// single replica. 
        
           		break 
        
           	} 
        
           	// After upreplication, the candidates for removal could still 
        
           	// be catching up. The allocator determined that the range was 
        
           	// over-replicated, and it's important to clear that state as 
        
           	// quickly as we can (because over-replicated ranges may be 
        
           	// under-diversified). If we return an error here, this range 
        
           	// probably won't be processed again until the next scanner 
        
           	// cycle, which is too long, so we retry here. 
        
           } 
        
           if len(candidates) == 0 { 
        
           	// If we timed out and still don't have any valid candidates, give up. 
        
           	return roachpb.ReplicationTarget{}, "", benignerror.New( 
        
           		errors.Errorf( 
        
           			"no removable replicas from range that needs a removal: %s", 
        
           			rangeRaftProgress(repl.RaftStatus(), existingVoters), 
        
           		), 
        
           	) 
        
           }

Because a benign error is returned, and the remove voter retry loop here doesn't retry (<=2), the result is a full scanner cycle until the next time the range can be retried.

Relevant logs from a repro with --stress:

Details

I231019 22:49:16.025406 40130 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2258  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.025575 40132 13@kv/kvserver/allocator/plan/replicate.go:200 ⋮ [T1,Vsystem,n2,replicate,s2,r88/3:‹/Tenant/2/Table/1{06-10}›] 2259  no rebalance target found, not enqueuing
I231019 22:49:16.025745 40089 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2260  next replica action: ‹remove voter›
I231019 22:49:16.025771 40134 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2261  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.025786 40089 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2262  error processing replica: not raft leader while range needs removal
I231019 22:49:16.025916 40089 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2263  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.026172 40090 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2264  next replica action: ‹remove voter›
I231019 22:49:16.026211 40090 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2265  error processing replica: not raft leader while range needs removal
I231019 22:49:16.028569 40101 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2266  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.028979 40102 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2267  next replica action: ‹remove voter›
I231019 22:49:16.029016 40102 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2268  filtered unremovable replicas from [(n1,s1):1 (n3,s3):2 (n2,s2):3] to get [] as candidates for removal: ‹[1:0, 2:0, 3*:10]›
I231019 22:49:16.032431 40102 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2269  filtered unremovable replicas from [(n1,s1):1 (n3,s3):2 (n2,s2):3] to get [(n1,s1):1] as candidates for removal: ‹[1:0, 2:11, 3*:11]›
I231019 22:49:16.032631 40102 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2270  removing voting replica n1,s1 due to over-replication: ‹[1:0, 2:11, 3*:11]›
I231019 22:49:16.033752 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2271  change replicas (add [] remove [(n1,s1):1VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9]
I231019 22:49:16.040697 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2272  proposing ENTER_JOINT(r1 l1) [(n1,s1):1VOTER_DEMOTING_LEARNER]: after=[(n1,s1):1VOTER_DEMOTING_LEARNER (n3,s3):2 (n2,s2):3] next=4
I231019 22:49:16.045483 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2273  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n3,s3):2, (n2,s2):3, next=4, gen=10]
I231019 22:49:16.068407 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2274  proposing LEAVE_JOINT: after=[(n1,s1):1LEARNER (n3,s3):2 (n2,s2):3] next=4
I231019 22:49:16.071357 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2275  change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1LEARNER, (n3,s3):2, (n2,s2):3, next=4, gen=11]
I231019 22:49:16.082160 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2276  proposing SIMPLE(r1) [(n1,s1):1LEARNER]: after=[(n2,s2):3 (n3,s3):2] next=4
I231019 22:49:16.109476 40102 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2277  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.109637 40102 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2278  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.110014 39062 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2279  next replica action: ‹remove voter›
I231019 22:49:16.110063 39062 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2280  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3*:17, 2:17]›
I231019 22:49:16.110412 39062 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2281  transferring lease to s3
I231019 22:49:16.110463 39062 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2282  transferring lease to s3
I231019 22:49:16.116972 39831 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2283  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.117408 39832 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2284  next replica action: ‹remove voter›
I231019 22:49:16.117453 39832 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2285  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [] as candidates for removal: ‹[3:0, 2*:18]›
I231019 22:49:16.117528 39832 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2286  error processing replica: no removable replicas from range that needs a removal: ‹[3:0, 2*:18]›

I231019 22:51:14.295359 227327 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2690  repair needed (‹remove voter›), enqueuing
I231019 22:51:14.295700 227637 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2691  next replica action: ‹remove voter›
I231019 22:51:14.295740 227637 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2692  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3:56, 2*:56]›
I231019 22:51:14.295927 227637 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2693  removing voting replica n2,s2 due to over-replication: ‹[3:56, 2*:56]›
I231019 22:51:14.298832 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2694  change replicas (add [] remove [(n2,s2):3VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, (n3,s3):2, next=4, gen=12]
I231019 22:51:14.304499 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2695  proposing ENTER_JOINT(r3 l3) [(n2,s2):3VOTER_DEMOTING_LEARNER]: after=[(n2,s2):3VOTER_DEMOTING_LEARNER (n3,s3):2] next=4
I231019 22:51:14.322502 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2696  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3VOTER_DEMOTING_LEARNER, (n3,s3):2, next=4, gen=13]
I231019 22:51:14.330840 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2697  proposing LEAVE_JOINT: after=[(n2,s2):3LEARNER (n3,s3):2] next=4
I231019 22:51:14.335626 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2698  change replicas (add [] remove [(n2,s2):3LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3LEARNER, (n3,s3):2, next=4, gen=14]
I231019 22:51:14.346602 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2699  proposing SIMPLE(r3) [(n2,s2):3LEARNER]: after=[(n3,s3):2] next=4
I231019 22:51:14.348065 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2700  rebalance target found for voter, enqueuing
I231019 22:51:14.348235 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2701  rebalance target found for voter, enqueuing
I231019 22:51:14.348476 227690 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2702  next replica action: ‹consider rebalance›
I231019 22:51:14.348773 227690 13@kv/kvserver/allocator/plan/util.go:62 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2703  can't swap replica due to lease; falling back to add
I231019 22:51:14.352212 227690 13@kv/kvserver/allocator/plan/replicate.go:921 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2704  rebalancing ‹voter› n3,s3 to n1,s1: ‹[2*:62]›
I231019 22:51:14.353324 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2705  change replicas (add [(n1,s1):4LEARNER] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, next=4, gen=15]
I231019 22:51:14.358194 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2706  proposing SIMPLE(l4) [(n1,s1):4LEARNER]: after=[(n3,s3):2 (n1,s1):4LEARNER] next=5
I231019 22:51:14.361277 227796 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2707  streamed snapshot bfd02f16 at applied index 64 to (n1,s1):4LEARNER with 1.5 KiB in 0.00s @ 3.1 MiB/s: kvs=19 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.00s
I231019 22:51:14.361889 227798 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n1,s1,r89/4:{-}] 2708  applying snapshot bfd02f16 from (n3,s3):2 at applied index 64
I231019 22:51:14.477813 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2709  change replicas (add [(n1,s1):4] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, (n1,s1):4LEARNER, next=5, gen=16]
I231019 22:51:14.485898 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2710  proposing SIMPLE(v4) [(n1,s1):4]: after=[(n3,s3):2 (n1,s1):4] next=5

I231019 22:51:14.295359 227327 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2690  repair needed (‹remove voter›), enqueuing
I231019 22:51:14.295700 227637 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2691  next replica action: ‹remove voter›
I231019 22:51:14.295740 227637 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2692  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3:56, 2*:56]›
I231019 22:51:14.295927 227637 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2693  removing voting replica n2,s2 due to over-replication: ‹[3:56, 2*:56]›
I231019 22:51:14.298832 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2694  change replicas (add [] remove [(n2,s2):3VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, (n3,s3):2, next=4, gen=12]
I231019 22:51:14.304499 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2695  proposing ENTER_JOINT(r3 l3) [(n2,s2):3VOTER_DEMOTING_LEARNER]: after=[(n2,s2):3VOTER_DEMOTING_LEARNER (n3,s3):2] next=4
I231019 22:51:14.322502 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2696  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3VOTER_DEMOTING_LEARNER, (n3,s3):2, next=4, gen=13]
I231019 22:51:14.330840 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2697  proposing LEAVE_JOINT: after=[(n2,s2):3LEARNER (n3,s3):2] next=4
I231019 22:51:14.335626 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2698  change replicas (add [] remove [(n2,s2):3LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3LEARNER, (n3,s3):2, next=4, gen=14]
I231019 22:51:14.346602 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2699  proposing SIMPLE(r3) [(n2,s2):3LEARNER]: after=[(n3,s3):2] next=4
I231019 22:51:14.348065 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2700  rebalance target found for voter, enqueuing
I231019 22:51:14.348235 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2701  rebalance target found for voter, enqueuing
I231019 22:51:14.348476 227690 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2702  next replica action: ‹consider rebalance›
I231019 22:51:14.348773 227690 13@kv/kvserver/allocator/plan/util.go:62 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2703  can't swap replica due to lease; falling back to add
I231019 22:51:14.352212 227690 13@kv/kvserver/allocator/plan/replicate.go:921 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2704  rebalancing ‹voter› n3,s3 to n1,s1: ‹[2*:62]›
I231019 22:51:14.353324 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2705  change replicas (add [(n1,s1):4LEARNER] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, next=4, gen=15]
I231019 22:51:14.358194 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2706  proposing SIMPLE(l4) [(n1,s1):4LEARNER]: after=[(n3,s3):2 (n1,s1):4LEARNER] next=5
I231019 22:51:14.361277 227796 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2707  streamed snapshot bfd02f16 at applied index 64 to (n1,s1):4LEARNER with 1.5 KiB in 0.00s @ 3.1 MiB/s: kvs=19 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.00s
I231019 22:51:14.361889 227798 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n1,s1,r89/4:{-}] 2708  applying snapshot bfd02f16 from (n3,s3):2 at applied index 64
I231019 22:51:14.477813 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2709  change replicas (add [(n1,s1):4] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, (n1,s1):4LEARNER, next=5, gen=16]
I231019 22:51:14.485898 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2710  proposing SIMPLE(v4) [(n1,s1):4]: after=[(n3,s3):2 (n1,s1):4] next=5

I231019 22:52:10.487450 315745 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2803  repair needed (‹remove voter›), enqueuing
I231019 22:52:10.487828 315623 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2804  next replica action: ‹remove voter›
I231019 22:52:10.487865 315623 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2805  filtered unremovable replicas from [(n3,s3):2 (n1,s1):4] to get [(n3,s3):2 (n1,s1):4] as candidates for removal: ‹[2*:85, 4:85]›
I231019 22:52:10.488083 315623 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2806  transferring lease to s1
I231019 22:52:10.488112 315623 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2807  transferring lease to s1
I231019 22:52:10.493492 315784 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2808  repair needed (‹remove voter›), enqueuing
I231019 22:52:10.493863 315785 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2809  next replica action: ‹remove voter›
I231019 22:52:10.493906 315785 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2810  filtered unremovable replicas from [(n3,s3):2 (n1,s1):4] to get [] as candidates for removal: ‹[2:0, 4*:87]›
I231019 22:52:10.493961 315785 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2811  error processing replica: no removable replicas from range that needs a removal: ‹[2:0, 4*:87]›

For higher RFs, the removeReplica retry loop makes this quicker when the lease just transferred. This doesn't appear like behavior worth changing generally just for testing.

112735: streamingccl: reduce scan interval for testing r=msbutler a=kvoli Reduce the replica scanner min interval from 1s, to 10ms for test clusters. This speeds up tests which rely on replica changes either on the source, or host cluster. ``` dev test pkg/ccl/streamingccl/streamingest \ -f TestStreamingRegionalConstraint -v --stress ... Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s ``` Resolves: #112541 Release note: None Co-authored-by: Austen McClernon <[email protected]>

Reduce the replica scanner min interval from 1s, to 10ms for test clusters. This speeds up tests which rely on replica changes either on the source, or host cluster. ``` dev test pkg/ccl/streamingccl/streamingest \ -f TestStreamingRegionalConstraint -v --stress ... Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s ``` Resolves: #112541 Release note: None

msbutler added the T-kv KV Team label Oct 17, 2023

msbutler added the C-investigation Further steps needed to qualify. C-label will change. label Oct 17, 2023

msbutler mentioned this issue Oct 17, 2023

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

Closed

rharding6373 mentioned this issue Oct 18, 2023

sql: improve inner plan leaf transaction tracking #112498

Merged

RaduBerinde added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-master Failures and bugs on the master branch. and removed branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Oct 18, 2023

RaduBerinde added C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 18, 2023

stevendanna added a commit to stevendanna/cockroach that referenced this issue Oct 19, 2023

streamingccl: skip TestStreamingRegionalConstraint

9f7e1d5

Informs cockroachdb#112541 Release note: None

stevendanna mentioned this issue Oct 19, 2023

streamingccl: skip TestStreamingRegionalConstraint #112694

Closed

adityamaru mentioned this issue Oct 19, 2023

sql: add WITH START TIMESTAMP to SHOW EXPERIMENTAL_FINGERPRINTS #112450

Merged

kvoli mentioned this issue Oct 19, 2023

streamingccl: reduce scan interval for testing #112735

Merged

shralex assigned kvoli Oct 20, 2023

craig bot closed this as completed in 08959eb Oct 23, 2023

blathers-crl bot mentioned this issue Oct 23, 2023

release-23.2: streamingccl: reduce scan interval for testing #112853

Merged

rafiss mentioned this issue Oct 31, 2023

jobs: handle retryable errors waiting for jobs #112613

Merged

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

msbutler commented Oct 17, 2023 •

edited

Loading

blathers-crl bot commented Oct 17, 2023

RaduBerinde commented Oct 18, 2023

RaduBerinde commented Oct 18, 2023

rafiss commented Oct 18, 2023

rafiss commented Oct 19, 2023

stevendanna commented Oct 19, 2023

stevendanna commented Oct 19, 2023

msbutler commented Oct 19, 2023

adityamaru commented Oct 19, 2023

kvoli commented Oct 20, 2023 •

edited

Loading

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

Comments

msbutler commented Oct 17, 2023 • edited Loading

blathers-crl bot commented Oct 17, 2023

RaduBerinde commented Oct 18, 2023

RaduBerinde commented Oct 18, 2023

rafiss commented Oct 18, 2023

rafiss commented Oct 19, 2023

stevendanna commented Oct 19, 2023

stevendanna commented Oct 19, 2023

msbutler commented Oct 19, 2023

adityamaru commented Oct 19, 2023

kvoli commented Oct 20, 2023 • edited Loading

msbutler commented Oct 17, 2023 •

edited

Loading

kvoli commented Oct 20, 2023 •

edited

Loading