Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

Closed
msbutler opened this issue Oct 17, 2023 · 10 comments · Fixed by #112735
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. C-test-failure Broken test (automatically or manually discovered). T-kv KV Team

Comments

@msbutler
Copy link
Collaborator

msbutler commented Oct 17, 2023

Per discussion with @kvoli and @stevendanna , we're quite confident that #111541 fails due to allocator cpu starvation in our unit test environment. I'm opening this issue in case any of us would like to further investigate this, though we have much bigger fish to fry right now.

The allocator may be particularly slow in this test because TestStreamingRegionalConstraint adds a 1 replica constraint on a table, which could hit:

// If there's only one replica, the removal target is the
// leaseholder and this is unsupported and will fail. However,
// this is also the only way to rebalance in a single-replica
// range. If we try the atomic swap here, we'll fail doing
// nothing, and so we stay locked into the current distribution
// of replicas. (Note that maybeTransferLeaseAway above will not
// have found a target, and so will have returned (false, nil).
//
// Do the best thing we can, which is carry out the addition
// only, which should succeed, and the next time we touch this
// range, we will have one more replica and hopefully it will
// take the lease and remove the current leaseholder.
//
// It's possible that "rebalancing deadlock" can occur in other
// scenarios, it's really impossible to tell from the code given
// the constraints we support. However, the lease transfer often
// does not happen spuriously, and we can't enter dangerous
// configurations sporadically, so this code path is only hit
// when we know it's necessary, picking the smaller of two evils.
//
// See https://github.com/cockroachdb/cockroach/issues/40333.
log.KvDistribution.Infof(ctx, "can't swap replica due to lease; falling back to add")

(note that the the 1 replica constraint avoids the need to set up more test servers, which would further exacerbate cpu starvation problems)

It's also worth noting that other end-to-end unit tests that exercise allocator code are skipped under stress

func TestPromoteNonVoterInAddVoter(t *testing.T) {

Jira issue: CRDB-32483

@msbutler msbutler added the T-kv KV Team label Oct 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 17, 2023

Hi @msbutler, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@msbutler msbutler added the C-investigation Further steps needed to qualify. C-label will change. label Oct 17, 2023
@RaduBerinde RaduBerinde added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-master Failures and bugs on the master branch. and removed branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Oct 18, 2023
@RaduBerinde
Copy link
Member

@RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde added C-test-failure Broken test (automatically or manually discovered). release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 18, 2023
@stevendanna
Copy link
Collaborator

Let's reskip this test until we can write a stable version of it or investigate the slowness.

stevendanna added a commit to stevendanna/cockroach that referenced this issue Oct 19, 2023
@stevendanna
Copy link
Collaborator

Hrm, actually, I wonder if some of these will be resolved once most branches are ahead of #112597

@msbutler
Copy link
Collaborator Author

Yeah, it looks like the branch (#112613) rafi's referring too needs to be rebased

@adityamaru
Copy link
Contributor

I see a timeout even after the new 3m45s - https://teamcity.cockroachdb.com/viewLog.html?buildId=12260852&buildTypeId=Cockroach_BazelExtendedCi in testrace. Is it worth skipping under test race too and only leaving regular execution unskipped?

@kvoli
Copy link
Collaborator

kvoli commented Oct 20, 2023

The test uses RF=1, which if unlucky will require multiple queue scanner cycles to remove replicas (3->1), then another couple scanner cycles to rebalance the replica to satisfy the constraint. This generally occurs when the new leaseholder doesn't have tracking for the follower replicas, and then excludes them as removal candidates (in addition to the comments mentioned on the issue):

if len(raftStatus.Progress) <= 2 {
// HACK(bdarnell): Downreplicating to a single node from
// multiple nodes is not really supported. There are edge
// cases in which the two peers stop communicating with each
// other too soon and we don't reach a satisfactory
// resolution. However, some tests (notably
// TestRepartitioning) get into this state, and if the
// replication queue spends its entire timeout waiting for the
// downreplication to finish the test will time out. As a
// hack, just fail-fast when we're trying to go down to a
// single replica.
break
}
// After upreplication, the candidates for removal could still
// be catching up. The allocator determined that the range was
// over-replicated, and it's important to clear that state as
// quickly as we can (because over-replicated ranges may be
// under-diversified). If we return an error here, this range
// probably won't be processed again until the next scanner
// cycle, which is too long, so we retry here.
}
if len(candidates) == 0 {
// If we timed out and still don't have any valid candidates, give up.
return roachpb.ReplicationTarget{}, "", benignerror.New(
errors.Errorf(
"no removable replicas from range that needs a removal: %s",
rangeRaftProgress(repl.RaftStatus(), existingVoters),
),
)
}

Because a benign error is returned, and the remove voter retry loop here doesn't retry (<=2), the result is a full scanner cycle until the next time the range can be retried.

Relevant logs from a repro with --stress:

Details

I231019 22:49:16.025406 40130 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2258  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.025575 40132 13@kv/kvserver/allocator/plan/replicate.go:200 ⋮ [T1,Vsystem,n2,replicate,s2,r88/3:‹/Tenant/2/Table/1{06-10}›] 2259  no rebalance target found, not enqueuing
I231019 22:49:16.025745 40089 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2260  next replica action: ‹remove voter›
I231019 22:49:16.025771 40134 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2261  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.025786 40089 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2262  error processing replica: not raft leader while range needs removal
I231019 22:49:16.025916 40089 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2263  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.026172 40090 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2264  next replica action: ‹remove voter›
I231019 22:49:16.026211 40090 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2265  error processing replica: not raft leader while range needs removal
I231019 22:49:16.028569 40101 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2266  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.028979 40102 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2267  next replica action: ‹remove voter›
I231019 22:49:16.029016 40102 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2268  filtered unremovable replicas from [(n1,s1):1 (n3,s3):2 (n2,s2):3] to get [] as candidates for removal: ‹[1:0, 2:0, 3*:10]›
I231019 22:49:16.032431 40102 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2269  filtered unremovable replicas from [(n1,s1):1 (n3,s3):2 (n2,s2):3] to get [(n1,s1):1] as candidates for removal: ‹[1:0, 2:11, 3*:11]›
I231019 22:49:16.032631 40102 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2270  removing voting replica n1,s1 due to over-replication: ‹[1:0, 2:11, 3*:11]›
I231019 22:49:16.033752 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2271  change replicas (add [] remove [(n1,s1):1VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=9]
I231019 22:49:16.040697 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2272  proposing ENTER_JOINT(r1 l1) [(n1,s1):1VOTER_DEMOTING_LEARNER]: after=[(n1,s1):1VOTER_DEMOTING_LEARNER (n3,s3):2 (n2,s2):3] next=4
I231019 22:49:16.045483 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2273  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n3,s3):2, (n2,s2):3, next=4, gen=10]
I231019 22:49:16.068407 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2274  proposing LEAVE_JOINT: after=[(n1,s1):1LEARNER (n3,s3):2 (n2,s2):3] next=4
I231019 22:49:16.071357 40102 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2275  change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n1,s1):1LEARNER, (n3,s3):2, (n2,s2):3, next=4, gen=11]
I231019 22:49:16.082160 40102 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2276  proposing SIMPLE(r1) [(n1,s1):1LEARNER]: after=[(n2,s2):3 (n3,s3):2] next=4
I231019 22:49:16.109476 40102 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2277  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.109637 40102 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2278  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.110014 39062 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2279  next replica action: ‹remove voter›
I231019 22:49:16.110063 39062 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2280  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3*:17, 2:17]›
I231019 22:49:16.110412 39062 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2281  transferring lease to s3
I231019 22:49:16.110463 39062 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r89/3:‹/Tenant/{2/Table/…-3}›] 2282  transferring lease to s3
I231019 22:49:16.116972 39831 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2283  repair needed (‹remove voter›), enqueuing
I231019 22:49:16.117408 39832 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2284  next replica action: ‹remove voter›
I231019 22:49:16.117453 39832 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2285  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [] as candidates for removal: ‹[3:0, 2*:18]›
I231019 22:49:16.117528 39832 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2286  error processing replica: no removable replicas from range that needs a removal: ‹[3:0, 2*:18]›

I231019 22:51:14.295359 227327 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2690  repair needed (‹remove voter›), enqueuing
I231019 22:51:14.295700 227637 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2691  next replica action: ‹remove voter›
I231019 22:51:14.295740 227637 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2692  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3:56, 2*:56]›
I231019 22:51:14.295927 227637 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2693  removing voting replica n2,s2 due to over-replication: ‹[3:56, 2*:56]›
I231019 22:51:14.298832 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2694  change replicas (add [] remove [(n2,s2):3VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, (n3,s3):2, next=4, gen=12]
I231019 22:51:14.304499 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2695  proposing ENTER_JOINT(r3 l3) [(n2,s2):3VOTER_DEMOTING_LEARNER]: after=[(n2,s2):3VOTER_DEMOTING_LEARNER (n3,s3):2] next=4
I231019 22:51:14.322502 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2696  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3VOTER_DEMOTING_LEARNER, (n3,s3):2, next=4, gen=13]
I231019 22:51:14.330840 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2697  proposing LEAVE_JOINT: after=[(n2,s2):3LEARNER (n3,s3):2] next=4
I231019 22:51:14.335626 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2698  change replicas (add [] remove [(n2,s2):3LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3LEARNER, (n3,s3):2, next=4, gen=14]
I231019 22:51:14.346602 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2699  proposing SIMPLE(r3) [(n2,s2):3LEARNER]: after=[(n3,s3):2] next=4
I231019 22:51:14.348065 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2700  rebalance target found for voter, enqueuing
I231019 22:51:14.348235 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2701  rebalance target found for voter, enqueuing
I231019 22:51:14.348476 227690 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2702  next replica action: ‹consider rebalance›
I231019 22:51:14.348773 227690 13@kv/kvserver/allocator/plan/util.go:62 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2703  can't swap replica due to lease; falling back to add
I231019 22:51:14.352212 227690 13@kv/kvserver/allocator/plan/replicate.go:921 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2704  rebalancing ‹voter› n3,s3 to n1,s1: ‹[2*:62]›
I231019 22:51:14.353324 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2705  change replicas (add [(n1,s1):4LEARNER] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, next=4, gen=15]
I231019 22:51:14.358194 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2706  proposing SIMPLE(l4) [(n1,s1):4LEARNER]: after=[(n3,s3):2 (n1,s1):4LEARNER] next=5
I231019 22:51:14.361277 227796 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2707  streamed snapshot bfd02f16 at applied index 64 to (n1,s1):4LEARNER with 1.5 KiB in 0.00s @ 3.1 MiB/s: kvs=19 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.00s
I231019 22:51:14.361889 227798 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n1,s1,r89/4:{-}] 2708  applying snapshot bfd02f16 from (n3,s3):2 at applied index 64
I231019 22:51:14.477813 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2709  change replicas (add [(n1,s1):4] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, (n1,s1):4LEARNER, next=5, gen=16]
I231019 22:51:14.485898 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2710  proposing SIMPLE(v4) [(n1,s1):4]: after=[(n3,s3):2 (n1,s1):4] next=5

I231019 22:51:14.295359 227327 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2690  repair needed (‹remove voter›), enqueuing
I231019 22:51:14.295700 227637 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2691  next replica action: ‹remove voter›
I231019 22:51:14.295740 227637 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2692  filtered unremovable replicas from [(n2,s2):3 (n3,s3):2] to get [(n2,s2):3 (n3,s3):2] as candidates for removal: ‹[3:56, 2*:56]›
I231019 22:51:14.295927 227637 13@kv/kvserver/allocator/plan/replicate.go:649 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2693  removing voting replica n2,s2 due to over-replication: ‹[3:56, 2*:56]›
I231019 22:51:14.298832 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2694  change replicas (add [] remove [(n2,s2):3VOTER_DEMOTING_LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, (n3,s3):2, next=4, gen=12]
I231019 22:51:14.304499 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2695  proposing ENTER_JOINT(r3 l3) [(n2,s2):3VOTER_DEMOTING_LEARNER]: after=[(n2,s2):3VOTER_DEMOTING_LEARNER (n3,s3):2] next=4
I231019 22:51:14.322502 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2696  change replicas (add [] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3VOTER_DEMOTING_LEARNER, (n3,s3):2, next=4, gen=13]
I231019 22:51:14.330840 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2697  proposing LEAVE_JOINT: after=[(n2,s2):3LEARNER (n3,s3):2] next=4
I231019 22:51:14.335626 227637 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2698  change replicas (add [] remove [(n2,s2):3LEARNER]): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n2,s2):3LEARNER, (n3,s3):2, next=4, gen=14]
I231019 22:51:14.346602 227637 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2699  proposing SIMPLE(r3) [(n2,s2):3LEARNER]: after=[(n3,s3):2] next=4
I231019 22:51:14.348065 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2700  rebalance target found for voter, enqueuing
I231019 22:51:14.348235 227637 13@kv/kvserver/allocator/plan/replicate.go:182 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2701  rebalance target found for voter, enqueuing
I231019 22:51:14.348476 227690 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2702  next replica action: ‹consider rebalance›
I231019 22:51:14.348773 227690 13@kv/kvserver/allocator/plan/util.go:62 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2703  can't swap replica due to lease; falling back to add
I231019 22:51:14.352212 227690 13@kv/kvserver/allocator/plan/replicate.go:921 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2704  rebalancing ‹voter› n3,s3 to n1,s1: ‹[2*:62]›
I231019 22:51:14.353324 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2705  change replicas (add [(n1,s1):4LEARNER] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, next=4, gen=15]
I231019 22:51:14.358194 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2706  proposing SIMPLE(l4) [(n1,s1):4LEARNER]: after=[(n3,s3):2 (n1,s1):4LEARNER] next=5
I231019 22:51:14.361277 227796 13@kv/kvserver/store_snapshot.go:1863 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2707  streamed snapshot bfd02f16 at applied index 64 to (n1,s1):4LEARNER with 1.5 KiB in 0.00s @ 3.1 MiB/s: kvs=19 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.00s
I231019 22:51:14.361889 227798 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n1,s1,r89/4:{-}] 2708  applying snapshot bfd02f16 from (n3,s3):2 at applied index 64
I231019 22:51:14.477813 227690 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2709  change replicas (add [(n1,s1):4] remove []): existing descriptor r89:‹/Tenant/{2/Table/110-3}› [(n3,s3):2, (n1,s1):4LEARNER, next=5, gen=16]
I231019 22:51:14.485898 227690 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n3,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2710  proposing SIMPLE(v4) [(n1,s1):4]: after=[(n3,s3):2 (n1,s1):4] next=5

I231019 22:52:10.487450 315745 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2803  repair needed (‹remove voter›), enqueuing
I231019 22:52:10.487828 315623 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2804  next replica action: ‹remove voter›
I231019 22:52:10.487865 315623 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2805  filtered unremovable replicas from [(n3,s3):2 (n1,s1):4] to get [(n3,s3):2 (n1,s1):4] as candidates for removal: ‹[2*:85, 4:85]›
I231019 22:52:10.488083 315623 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2806  transferring lease to s1
I231019 22:52:10.488112 315623 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n3,replicate,s3,r89/2:‹/Tenant/{2/Table/…-3}›] 2807  transferring lease to s1
I231019 22:52:10.493492 315784 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2808  repair needed (‹remove voter›), enqueuing
I231019 22:52:10.493863 315785 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2809  next replica action: ‹remove voter›
I231019 22:52:10.493906 315785 13@kv/kvserver/allocator/plan/replicate.go:578 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2810  filtered unremovable replicas from [(n3,s3):2 (n1,s1):4] to get [] as candidates for removal: ‹[2:0, 4*:87]›
I231019 22:52:10.493961 315785 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r89/4:‹/Tenant/{2/Table/…-3}›] 2811  error processing replica: no removable replicas from range that needs a removal: ‹[2:0, 4*:87]›


For higher RFs, the removeReplica retry loop makes this quicker when the lease just transferred. This doesn't appear like behavior worth changing generally just for testing.

craig bot pushed a commit that referenced this issue Oct 23, 2023
112735: streamingccl: reduce scan interval for testing r=msbutler a=kvoli

Reduce the replica scanner min interval from 1s, to 10ms for test
clusters. This speeds up tests which rely on replica changes either on
the source, or host cluster.

```
dev test pkg/ccl/streamingccl/streamingest \
  -f TestStreamingRegionalConstraint -v --stress
...
Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s
```

Resolves: #112541
Release note: None

Co-authored-by: Austen McClernon <[email protected]>
@craig craig bot closed this as completed in 08959eb Oct 23, 2023
kvoli added a commit that referenced this issue Oct 23, 2023
Reduce the replica scanner min interval from 1s, to 10ms for test
clusters. This speeds up tests which rely on replica changes either on
the source, or host cluster.

```
dev test pkg/ccl/streamingccl/streamingest \
  -f TestStreamingRegionalConstraint -v --stress
...
Stats over 1000 runs: max = 51.9s, min = 21.6s, avg = 38.3s, dev = 4.6s
```

Resolves: #112541
Release note: None
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 29, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. C-test-failure Broken test (automatically or manually discovered). T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants