roachtest: replicate/wide failed #99268

cockroach-teamcity · 2023-03-22T18:15:38Z

roachtest.replicate/wide failed with artifacts on release-23.1 @ 80c4895c566a7eaa6f16c4098980509dd3795ad7:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9182772-1679462401-116-n9cpu1/1679504844327457704?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_170724.327597136_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: replicate/wide failed #98945 roachtest: replicate/wide failed [C-bug C-test-failure O-roachtest O-robot T-kv branch-release-22.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-25795}

The text was updated successfully, but these errors were encountered:

tbg · 2023-03-24T14:56:35Z

Stuck here

cockroach/pkg/cmd/roachtest/tests/allocator.go

Lines 406 to 409 in eec5a47

    
           // Stop the cluster and restart 2/3 of the nodes. 
        
           c.Stop(ctx, t.L(), option.DefaultStopOpts()) 
        
           tBeginDown := timeutil.Now() 
        
           c.Start(ctx, t.L(), startOpts, settings, c.Range(1, 6))

when attempting to create a backup schedule:

goroutine 5607337 [select, 8 minutes]:
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).CombinedOutput(0xc002312540, {0x79
f3b38, 0xc00da62500})
        github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:155 +0x165
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).createFixedBackupSchedule(0xc0074c
65a0, {0x79f3b38, 0xc00da62500}, 0x2?, {0x7983bf0, 0x0})
        github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:840 +0x57a
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start(0xc0074c65a0, {0x79f3b38?, 0
xc00da62500}, 0xc00ed2aba0, {0x0, 0x1, {0xc009a18ba0, 0x1, 0x1}, 0x1, ...})

cockroach/pkg/roachprod/install/cockroach.go

Lines 831 to 840 in c5ef384

    
           url := c.NodeURL("localhost", c.NodePort(node), "" /* tenantName */) 
        
           fullCmd := fmt.Sprintf(`COCKROACH_CONNECT_TIMEOUT=0 %s sql --url %s -e %q`, 
        
           	binary, url, createScheduleCmd) 
        
           // Instead of using `c.ExecSQL()`, use the more flexible c.newSession(), which allows us to 
        
           // 1) prefix the schedule backup cmd with COCKROACH_CONNECT_TIMEOUT=0. 
        
           // 2) run the command against the first node in the cluster target. 
        
           sess := c.newSession(l, node, fullCmd, withDebugName("init-backup-schedule")) 
        
           defer sess.Close() 
        
           out, err := sess.CombinedOutput(ctx)

I think there an internal endless DistSender retry loop (I also see a stuck SQL stmt in DistSender waiting to hear back from single-range RPCs like this stack here):

It's this retry loop:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1611 to 1618 in b19a4a8

    
           // Start a retry loop for sending the batch to the range. Each iteration of 
        
           // this loop uses a new descriptor. Attempts to send to multiple replicas in 
        
           // this descriptor are done at a lower level. 
        
           tBegin, attempts := timeutil.Now(), int64(0) // for slow log message 
        
           // prevTok maintains the EvictionToken used on the previous iteration. 
        
           var prevTok rangecache.EvictionToken 
        
           for r := retry.StartWithCtx(ctx, ds.rpcRetryOptions); r.Next(); { 
        
           	attempts++

There are various "slow range RPC":

teamcity-9182772-1679462401-116-n9cpu1-0003> W230322 17:08:23.950127 12420 kv/kvclient/kvcoord/dist_sender.go:1675 ⋮ [T1,n3,intExec=‹protectedts-GetMetadata›] 134  slow range RPC: have been waiting 60.18s (65 attempts) for RPC Get [/Table/5/1/‹1›/‹2›/‹1›,/Min), [txn: 26e77d8e] to r9:‹/Table/{5-6}› [(n8,s8):6, (n2,s2):2, (n9,s9):7, (n5,s5):4, (n3,s3):5, (n6,s6):8, (n4,s4):9, (n7,s7):10, (n1,s1):11, next=12, gen=24]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] can't determine lease status of (n8,s8):6VOTER_INCOMING due to node liveness error: liveness record not found in cache; r9: replica (n1,s1):11 not lease holder; lease holder unknown›

Note that only n1-n6 are up right now. n7, n8, n9 are intentionally down. What should happen is that such a request would quickly bounce to one of the live nodes (which are all in the descriptor and we shouldn't have lost quorum here). But it doesn't seem to happen.

Have to run now, need to investigate more later.

tbg · 2023-03-24T16:13:51Z

Note that only n1-n6 are up right now. n7, n8, n9 are intentionally down. What should happen is that such a request would quickly bounce to one of the live nodes (which are all in the descriptor and we shouldn't have lost quorum here). But it doesn't seem to happen.

I was a bit sloppy here. We're getting NLHEs back, so these are clearly coming from n1-n6 (a down node can't respond). For example, the one above came from replica (n1,s1):11 not lease holder; lease holder unknown›.

I see these messages from all replicas (for different RPCs), so it's a fair assumption that they are all responding in that same way. Why is nobody acquiring the lease?

Probably this has to do with this message

can't determine lease status of (n8,s8):6VOTER_INCOMING due to node liveness error: liveness record not found in cache

which we see off and on, for n7-n9 (i.e. the down nodes). This originates here:

cockroach/pkg/kv/kvserver/replica_range_lease.go

Lines 672 to 692 in ecc931b

    
           l, ok := r.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID) 
        
           status.Liveness = l.Liveness 
        
           if !ok || status.Liveness.Epoch < lease.Epoch { 
        
           	// If lease validity can't be determined (e.g. gossip is down 
        
           	// and liveness info isn't available for owner), we can neither 
        
           	// use the lease nor do we want to attempt to acquire it. 
        
           	var msg redact.StringBuilder 
        
           	if !ok { 
        
           		msg.Printf("can't determine lease status of %s due to node liveness error: %v", 
        
           			lease.Replica, liveness.ErrRecordCacheMiss) 
        
           	} else { 
        
           		msg.Printf("can't determine lease status of %s because node liveness info for n%d is stale. lease: %s, liveness: %s", 
        
           			lease.Replica, lease.Replica.NodeID, lease, l.Liveness) 
        
           	} 
        
           	if leaseStatusLogLimiter.ShouldLog() { 
        
           		log.Infof(ctx, "%s", msg) 
        
           	} 
        
           	status.State = kvserverpb.LeaseState_ERROR 
        
           	status.ErrInfo = msg.String() 
        
           	return status 
        
           }

I think the effect of that is that it prevents lease acquisitions, since it returns an "error" lease state, but we need an "expired" one of we are to take over the lease.

What should have happened is that the liveness of all nodes, including down ones, was gossiped when the (expiration-based) liveness lease was acquired. As it happens, we just made a change in that area two weeks ago:

#98150

Prior to this PR, all livenesses were gossiped on each extension, i.e. quite frequently. Now this only happens once, when the lease is first acquired. It's possible that there is a bug where this trigger does not fire.

Is it possible that leaseChangingHands ends up being false if the leaseholder prior to the restart restarts and still finds its lease valid?

cockroach/pkg/kv/kvserver/replica_proposal.go

Line 264 in 04f4284

    
           leaseChangingHands := prevLease.Replica.StoreID != newLease.Replica.StoreID || prevLease.Sequence != newLease.Sequence

Then, even though we've cleared cluster-wide gossip state due to the whole-cluster downtime, the effectively "new" (same) leaseholder would not re-gossip, and so the gossip info for the nodes that remain down would never appear in gossip.

It seems like a likely explanation. cc @nvanbenschoten

…ollowing restart See: cockroachdb#99268 (comment) This also sneaks in the logging requested in cockroachdb#99472. Epic: none Release note: None

tbg · 2023-03-24T16:41:51Z

Was hoping that putting in an assertion and restarting a local single-node cluster would prove it, but I've run it a few times and so far no success.

#99499

tbg · 2023-03-24T16:43:52Z

Still, reading the code I see all indications are there that a node re-requesting a lease it previously held would not bump the sequence number.

cockroach-teamcity · 2023-03-24T19:21:16Z

roachtest.replicate/wide failed with artifacts on release-23.1 @ 52e55d2ef172b7cfec14e8a0a954f8864b2be779:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9225816-1679635256-116-n9cpu1/1679681580014183203?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_181300.014211626_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: replicate/wide failed #98945 roachtest: replicate/wide failed [C-bug C-test-failure O-roachtest O-robot T-kv branch-release-22.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-03-25T18:43:58Z

roachtest.replicate/wide failed with artifacts on release-23.1 @ f351747ed97862fc037717cadec23f18073fb6be:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9244183-1679721573-114-n9cpu1/1679765745166115497?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_173545.166149846_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

roachtest: replicate/wide failed #98945 roachtest: replicate/wide failed [C-bug C-test-failure O-roachtest O-robot T-kv branch-release-22.1]

_{This test on roachdash | Improve this report!}

tbg · 2023-03-27T10:52:31Z

I think it's this bit that prevents the assertion I've put in to fire:

cockroach/pkg/kv/kvserver/batcheval/cmd_lease_request.go

Lines 108 to 125 in 736a67e

    
           // If the lease holder promised to not propose any commands below 
        
           // MinProposedTS, it must also not be allowed to extend a lease before that 
        
           // timestamp. We make sure that when a node restarts, its earlier in-flight 
        
           // commands (which are not tracked by the spanlatch manager post restart) 
        
           // receive an error under the new lease by making sure the sequence number 
        
           // of that lease is higher. This in turn is achieved by forwarding its start 
        
           // time here, which makes it not Equivalent() to the preceding lease for the 
        
           // same store. 
        
           // 
        
           // Note also that leasePostApply makes sure to update the timestamp cache in 
        
           // this case: even though the lease holder does not change, the sequence 
        
           // number does and this triggers a low water mark bump. 
        
           // 
        
           // The bug prevented with this is unlikely to occur in practice 
        
           // since earlier commands usually apply before this lease will. 
        
           if ts := args.MinProposedTS; isExtension && ts != nil { 
        
           	effectiveStart.Forward(*ts) 
        
           }

when the node starts up, we set it to Now(), precisely to prevent re-use of leases from an earlier incarnation.

cockroach/pkg/kv/kvserver/replica_init.go

Lines 215 to 225 in 51f8f8e

    
           // Only do this if there was a previous lease. This shouldn't be important 
        
           // to do but consider that the first lease which is obtained is back-dated 
        
           // to a zero start timestamp (and this de-flakes some tests). If we set the 
        
           // min proposed TS here, this lease could not be renewed (by the semantics 
        
           // of minLeaseProposedTS); and since minLeaseProposedTS is copied on splits, 
        
           // this problem would multiply to a number of replicas at cluster bootstrap. 
        
           // Instead, we make the first lease special (which is OK) and the problem 
        
           // disappears. 
        
           if r.mu.state.Lease.Sequence > 0 { 
        
           	r.mu.minLeaseProposedTS = r.Clock().NowAsClockTimestamp() 
        
           }

tbg · 2023-03-27T11:21:17Z

Roachstress running for 574ded4 (presumed "bad") and 0fda198 (its left parent, presumed "good"). And we have the known bad SHA 80c4895.

presumed bad: 10/10 passed
presumed good: 7/7 passed (then roachtest got stuck and I canceled it)
known bad: 8/10 passed, 2/10 failed with the symptoms seen in this issue.

I'll try "presumed bad" for another 20 iterations. If it's still not failing then, I'll have to assume the problem is in another PR.

Unfortunately having some trouble with roachtest/roachprod, seems like cluster creation sometimes hangs for 20m+, I was suspecting a deadlock, not sure. I collected some stack traces but for now will try to focus on the bisection.

update: seeing a cluster hung with these exact symptoms in "presumed bad". Its gossip network is missing the liveness info for n7-n9, as expected:

"presumed good" passed another 20+ times, there is a different failure mode that I also saw on "presumed bad", so I think it is unrelated and we can call "presumed good" "good".

blathers-crl · 2023-03-27T11:28:35Z

cc @cockroachdb/replication

tbg · 2023-03-27T13:09:33Z

Once Github is not down any more, I'll send the revert and backport it to 22.2 and 22.1.

99643: kvserver: revert #98150 r=tbg a=tbg This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Tracked in #99652. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>

This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Epic: none Release note: None

This reverts cockroachdb#98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: cockroachdb#99268 (comment) Touches cockroachdb#99268. Touches cockroachdb#97966. Closes cockroachdb#99268. Touches cockroachdb#98945. Epic: none Release note: None

This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Epic: none Release note: None

cockroach-teamcity added this to the 23.1 milestone Mar 22, 2023

blathers-crl bot added the T-kv KV Team label Mar 22, 2023

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 24, 2023

tbg mentioned this issue Mar 27, 2023

roachtest: replicate/wide failed #98945

Closed

tbg self-assigned this Mar 27, 2023

tbg added the T-kv-replication label Mar 27, 2023

tbg removed the T-kv KV Team label Mar 27, 2023

This was referenced Mar 27, 2023

kvserver: revert #98150 #99643

Merged

kvserver: dead nodes' liveness not gossiped reliably on liveness lease acquisition #99652

Open

craig bot closed this as completed in 49d001d Mar 27, 2023

blathers-crl bot mentioned this issue Mar 27, 2023

release-22.2: kvserver: revert #98150 #99673

Merged

erikgrinaker mentioned this issue Mar 27, 2023

release-22.1: kvserver: revert #98150 #99694

Merged

blathers-crl bot mentioned this issue Mar 27, 2023

release-23.1: kvserver: revert #98150 #99710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: replicate/wide failed #99268

roachtest: replicate/wide failed #99268

cockroach-teamcity commented Mar 22, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

cockroach-teamcity commented Mar 24, 2023

cockroach-teamcity commented Mar 25, 2023

tbg commented Mar 27, 2023

tbg commented Mar 27, 2023 •

edited

Loading

blathers-crl bot commented Mar 27, 2023

tbg commented Mar 27, 2023

roachtest: replicate/wide failed #99268

roachtest: replicate/wide failed #99268

Comments

cockroach-teamcity commented Mar 22, 2023 • edited by cockroach-jira-scripts Loading

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

tbg commented Mar 24, 2023

cockroach-teamcity commented Mar 24, 2023

cockroach-teamcity commented Mar 25, 2023

tbg commented Mar 27, 2023

tbg commented Mar 27, 2023 • edited Loading

blathers-crl bot commented Mar 27, 2023

tbg commented Mar 27, 2023

cockroach-teamcity commented Mar 22, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Mar 27, 2023 •

edited

Loading