Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #99268

Closed
cockroach-teamcity opened this issue Mar 22, 2023 · 10 comments · Fixed by #99643
Closed

roachtest: replicate/wide failed #99268

cockroach-teamcity opened this issue Mar 22, 2023 · 10 comments · Fixed by #99643
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 22, 2023

roachtest.replicate/wide failed with artifacts on release-23.1 @ 80c4895c566a7eaa6f16c4098980509dd3795ad7:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9182772-1679462401-116-n9cpu1/1679504844327457704?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_170724.327597136_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-25795

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 22, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 22, 2023
@tbg
Copy link
Member

tbg commented Mar 24, 2023

Stuck here

// Stop the cluster and restart 2/3 of the nodes.
c.Stop(ctx, t.L(), option.DefaultStopOpts())
tBeginDown := timeutil.Now()
c.Start(ctx, t.L(), startOpts, settings, c.Range(1, 6))

when attempting to create a backup schedule:

goroutine 5607337 [select, 8 minutes]:
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).CombinedOutput(0xc002312540, {0x79
f3b38, 0xc00da62500})
        github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:155 +0x165
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).createFixedBackupSchedule(0xc0074c
65a0, {0x79f3b38, 0xc00da62500}, 0x2?, {0x7983bf0, 0x0})
        github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:840 +0x57a
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start(0xc0074c65a0, {0x79f3b38?, 0
xc00da62500}, 0xc00ed2aba0, {0x0, 0x1, {0xc009a18ba0, 0x1, 0x1}, 0x1, ...})

url := c.NodeURL("localhost", c.NodePort(node), "" /* tenantName */)
fullCmd := fmt.Sprintf(`COCKROACH_CONNECT_TIMEOUT=0 %s sql --url %s -e %q`,
binary, url, createScheduleCmd)
// Instead of using `c.ExecSQL()`, use the more flexible c.newSession(), which allows us to
// 1) prefix the schedule backup cmd with COCKROACH_CONNECT_TIMEOUT=0.
// 2) run the command against the first node in the cluster target.
sess := c.newSession(l, node, fullCmd, withDebugName("init-backup-schedule"))
defer sess.Close()
out, err := sess.CombinedOutput(ctx)

I think there an internal endless DistSender retry loop (I also see a stuck SQL stmt in DistSender waiting to hear back from single-range RPCs like this stack here):

image

It's this retry loop:

// Start a retry loop for sending the batch to the range. Each iteration of
// this loop uses a new descriptor. Attempts to send to multiple replicas in
// this descriptor are done at a lower level.
tBegin, attempts := timeutil.Now(), int64(0) // for slow log message
// prevTok maintains the EvictionToken used on the previous iteration.
var prevTok rangecache.EvictionToken
for r := retry.StartWithCtx(ctx, ds.rpcRetryOptions); r.Next(); {
attempts++

There are various "slow range RPC":

teamcity-9182772-1679462401-116-n9cpu1-0003> W230322 17:08:23.950127 12420 kv/kvclient/kvcoord/dist_sender.go:1675 ⋮ [T1,n3,intExec=‹protectedts-GetMetadata›] 134  slow range RPC: have been waiting 60.18s (65 attempts) for RPC Get [/Table/5/1/‹1›/‹2›/‹1›,/Min), [txn: 26e77d8e] to r9:‹/Table/{5-6}› [(n8,s8):6, (n2,s2):2, (n9,s9):7, (n5,s5):4, (n3,s3):5, (n6,s6):8, (n4,s4):9, (n7,s7):10, (n1,s1):11, next=12, gen=24]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] can't determine lease status of (n8,s8):6VOTER_INCOMING due to node liveness error: liveness record not found in cache; r9: replica (n1,s1):11 not lease holder; lease holder unknown›

Note that only n1-n6 are up right now. n7, n8, n9 are intentionally down. What should happen is that such a request would quickly bounce to one of the live nodes (which are all in the descriptor and we shouldn't have lost quorum here). But it doesn't seem to happen.

Have to run now, need to investigate more later.

@tbg
Copy link
Member

tbg commented Mar 24, 2023

Note that only n1-n6 are up right now. n7, n8, n9 are intentionally down. What should happen is that such a request would quickly bounce to one of the live nodes (which are all in the descriptor and we shouldn't have lost quorum here). But it doesn't seem to happen.

I was a bit sloppy here. We're getting NLHEs back, so these are clearly coming from n1-n6 (a down node can't respond). For example, the one above came from replica (n1,s1):11 not lease holder; lease holder unknown›.

I see these messages from all replicas (for different RPCs), so it's a fair assumption that they are all responding in that same way. Why is nobody acquiring the lease?

Probably this has to do with this message

can't determine lease status of (n8,s8):6VOTER_INCOMING due to node liveness error: liveness record not found in cache

which we see off and on, for n7-n9 (i.e. the down nodes). This originates here:

l, ok := r.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID)
status.Liveness = l.Liveness
if !ok || status.Liveness.Epoch < lease.Epoch {
// If lease validity can't be determined (e.g. gossip is down
// and liveness info isn't available for owner), we can neither
// use the lease nor do we want to attempt to acquire it.
var msg redact.StringBuilder
if !ok {
msg.Printf("can't determine lease status of %s due to node liveness error: %v",
lease.Replica, liveness.ErrRecordCacheMiss)
} else {
msg.Printf("can't determine lease status of %s because node liveness info for n%d is stale. lease: %s, liveness: %s",
lease.Replica, lease.Replica.NodeID, lease, l.Liveness)
}
if leaseStatusLogLimiter.ShouldLog() {
log.Infof(ctx, "%s", msg)
}
status.State = kvserverpb.LeaseState_ERROR
status.ErrInfo = msg.String()
return status
}

I think the effect of that is that it prevents lease acquisitions, since it returns an "error" lease state, but we need an "expired" one of we are to take over the lease.

What should have happened is that the liveness of all nodes, including down ones, was gossiped when the (expiration-based) liveness lease was acquired. As it happens, we just made a change in that area two weeks ago:

#98150

Prior to this PR, all livenesses were gossiped on each extension, i.e. quite frequently. Now this only happens once, when the lease is first acquired. It's possible that there is a bug where this trigger does not fire.

Is it possible that leaseChangingHands ends up being false if the leaseholder prior to the restart restarts and still finds its lease valid?

leaseChangingHands := prevLease.Replica.StoreID != newLease.Replica.StoreID || prevLease.Sequence != newLease.Sequence

Then, even though we've cleared cluster-wide gossip state due to the whole-cluster downtime, the effectively "new" (same) leaseholder would not re-gossip, and so the gossip info for the nodes that remain down would never appear in gossip.

It seems like a likely explanation. cc @nvanbenschoten

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 24, 2023
tbg added a commit to tbg/cockroach that referenced this issue Mar 24, 2023
…ollowing restart

See: cockroachdb#99268 (comment)

This also sneaks in the logging requested in cockroachdb#99472.

Epic: none
Release note: None
@tbg
Copy link
Member

tbg commented Mar 24, 2023

Was hoping that putting in an assertion and restarting a local single-node cluster would prove it, but I've run it a few times and so far no success.

#99499

@tbg
Copy link
Member

tbg commented Mar 24, 2023

Still, reading the code I see all indications are there that a node re-requesting a lease it previously held would not bump the sequence number.

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on release-23.1 @ 52e55d2ef172b7cfec14e8a0a954f8864b2be779:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9225816-1679635256-116-n9cpu1/1679681580014183203?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_181300.014211626_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on release-23.1 @ f351747ed97862fc037717cadec23f18073fb6be:

test artifacts and logs in: /artifacts/replicate/wide/run_1
(test_runner.go:1008).runTest: test timed out (10m0s)
(cluster.go:1896).Start: ~ COCKROACH_CONNECT_TIMEOUT=0 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-9244183-1679721573-114-n9cpu1/1679765745166115497?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: server closed the connection.
Is this a CockroachDB node?
unexpected EOF
Failed running "sql": COMMAND_PROBLEM: ssh verbose log retained in ssh_173545.166149846_n1_init-backup-schedule.log: exit status 1

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Mar 27, 2023

I think it's this bit that prevents the assertion I've put in to fire:

// If the lease holder promised to not propose any commands below
// MinProposedTS, it must also not be allowed to extend a lease before that
// timestamp. We make sure that when a node restarts, its earlier in-flight
// commands (which are not tracked by the spanlatch manager post restart)
// receive an error under the new lease by making sure the sequence number
// of that lease is higher. This in turn is achieved by forwarding its start
// time here, which makes it not Equivalent() to the preceding lease for the
// same store.
//
// Note also that leasePostApply makes sure to update the timestamp cache in
// this case: even though the lease holder does not change, the sequence
// number does and this triggers a low water mark bump.
//
// The bug prevented with this is unlikely to occur in practice
// since earlier commands usually apply before this lease will.
if ts := args.MinProposedTS; isExtension && ts != nil {
effectiveStart.Forward(*ts)
}

when the node starts up, we set it to Now(), precisely to prevent re-use of leases from an earlier incarnation.

// Only do this if there was a previous lease. This shouldn't be important
// to do but consider that the first lease which is obtained is back-dated
// to a zero start timestamp (and this de-flakes some tests). If we set the
// min proposed TS here, this lease could not be renewed (by the semantics
// of minLeaseProposedTS); and since minLeaseProposedTS is copied on splits,
// this problem would multiply to a number of replicas at cluster bootstrap.
// Instead, we make the first lease special (which is OK) and the problem
// disappears.
if r.mu.state.Lease.Sequence > 0 {
r.mu.minLeaseProposedTS = r.Clock().NowAsClockTimestamp()
}

@tbg
Copy link
Member

tbg commented Mar 27, 2023

Roachstress running for 574ded4 (presumed "bad") and 0fda198 (its left parent, presumed "good"). And we have the known bad SHA 80c4895.

presumed bad: 10/10 passed
presumed good: 7/7 passed (then roachtest got stuck and I canceled it)
known bad: 8/10 passed, 2/10 failed with the symptoms seen in this issue.

I'll try "presumed bad" for another 20 iterations. If it's still not failing then, I'll have to assume the problem is in another PR.

Unfortunately having some trouble with roachtest/roachprod, seems like cluster creation sometimes hangs for 20m+, I was suspecting a deadlock, not sure. I collected some stack traces but for now will try to focus on the bisection.

update: seeing a cluster hung with these exact symptoms in "presumed bad". Its gossip network is missing the liveness info for n7-n9, as expected:

image

"presumed good" passed another 20+ times, there is a different failure mode that I also saw on "presumed bad", so I think it is unrelated and we can call "presumed good" "good".

@tbg tbg self-assigned this Mar 27, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 27, 2023

cc @cockroachdb/replication

@tbg tbg removed the T-kv KV Team label Mar 27, 2023
@tbg
Copy link
Member

tbg commented Mar 27, 2023

Once Github is not down any more, I'll send the revert and backport it to 22.2 and 22.1.

craig bot pushed a commit that referenced this issue Mar 27, 2023
99643: kvserver: revert #98150 r=tbg a=tbg

This reverts #98150 because we think it introduced a problem that was detected
via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown
we do rely on the periodic gossip trigger on liveness lease extensions.

[^1]: #99268 (comment)

Touches #99268.
Touches #97966.
Closes #99268.
Touches #98945.

Tracked in #99652.

Epic: none
Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 49d001d Mar 27, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 27, 2023
This reverts #98150 because we think it introduced a problem that was detected
via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown
we do rely on the periodic gossip trigger on liveness lease extensions.

[^1]: #99268 (comment)

Touches #99268.
Touches #97966.
Closes #99268.
Touches #98945.

Epic: none
Release note: None
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 27, 2023
This reverts cockroachdb#98150 because we think it introduced a problem that was detected
via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown
we do rely on the periodic gossip trigger on liveness lease extensions.

[^1]: cockroachdb#99268 (comment)

Touches cockroachdb#99268.
Touches cockroachdb#97966.
Closes cockroachdb#99268.
Touches cockroachdb#98945.

Epic: none
Release note: None
blathers-crl bot pushed a commit that referenced this issue Mar 27, 2023
This reverts #98150 because we think it introduced a problem that was detected
via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown
we do rely on the periodic gossip trigger on liveness lease extensions.

[^1]: #99268 (comment)

Touches #99268.
Touches #97966.
Closes #99268.
Touches #98945.

Epic: none
Release note: None
celiala pushed a commit that referenced this issue Mar 27, 2023
This reverts #98150 because we think it introduced a problem that was detected
via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown
we do rely on the periodic gossip trigger on liveness lease extensions.

[^1]: #99268 (comment)

Touches #99268.
Touches #97966.
Closes #99268.
Touches #98945.

Epic: none
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants