Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false/rangeTs=false failed #104016

Closed
cockroach-teamcity opened this issue May 28, 2023 · 5 comments · Fixed by #104128
Closed

roachtest: clearrange/checks=false/rangeTs=false failed #104016

cockroach-teamcity opened this issue May 28, 2023 · 5 comments · Fixed by #104128
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 28, 2023

roachtest.clearrange/checks=false/rangeTs=false failed with artifacts on master @ dd8dbfc4fa61763dc4fa8fbc19d3a7f2a30bb1e2:

test artifacts and logs in: /artifacts/clearrange/checks=false/rangeTs=false/run_1
(cluster.go:1995).Start: ~ COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-10281147-1685251278-43-n10cpu16/1685270926162596245?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
dial tcp 127.0.0.1:26257: connect: connection refused
Failed running "sql": COMMAND_PROBLEM: exit status 1
(test_runner.go:1109).func1: 4 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-28313

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 28, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone May 28, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label May 28, 2023
@jbowens jbowens marked this as a duplicate and then as not a duplicate of #104078 May 30, 2023
@jbowens
Copy link
Collaborator

jbowens commented May 30, 2023

a bunch of nodes failed to come back up:

10:04:35 cockroach.go:826: teamcity-10281147-1685251278-43-n10cpu16: creating backup schedule
10:04:37 cockroach.go:857: schedule_id	label	status	first_run	schedule	backup_stmt
869000404975878145	test_only_backup	PAUSED: Waiting for initial backup to complete	NULL	*/15 * * * *	BACKUP INTO LATEST IN 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-10281147-1685251278-43-n10cpu16/1685268275988772044?AUTH=implicit' WITH detached
869000404981907457	test_only_backup	ACTIVE	2023-05-28 10:04:37.171545+00	@hourly	BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-10281147-1685251278-43-n10cpu16/1685268275988772044?AUTH=implicit' WITH detached
10:48:14 clearrange.go:93: import took 2616.95s
10:48:14 cluster.go:689: test status: stopping cluster
teamcity-10281147-1685251278-43-n10cpu16: stopping
10:48:17 cluster.go:689: test status: starting cluster
10:48:17 cockroach.go:184: teamcity-10281147-1685251278-43-n10cpu16: starting nodes
10:48:20 cockroach.go:648: teamcity-10281147-1685251278-43-n10cpu16: initializing cluster
10:48:21 cockroach.go:666: teamcity-10281147-1685251278-43-n10cpu16: setting cluster settings
10:48:46 cockroach.go:826: teamcity-10281147-1685251278-43-n10cpu16: creating backup schedule
10:48:47 test_impl.go:348: test failure #1: full stack retained in failure_1.log: (cluster.go:1995).Start: ~ COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?sslmode=disable' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-10281147-1685251278-43-n10cpu16/1685270926162596245?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
dial tcp 127.0.0.1:26257: connect: connection refused
Failed running "sql": COMMAND_PROBLEM: exit status 1
10:48:47 test_runner.go:1023: tearing down after failure; see teardown.log

n1:

I230528 10:48:32.901601 3353 sql/sqlstats/persistedsqlstats/provider.go:167 ⋮ [T1,n1] 130  starting sql-stats-worker with initial delay: 10m17.734078344s
I230528 10:48:32.901978 102 upgrade/upgrademanager/manager.go:170 ⋮ [T1,n1] 131  running permanent upgrades up to version: 23.1-8
I230528 10:48:32.902965 3387 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n1] 132  system-config-cache: established range feed cache
I230528 10:48:32.903476 3352 sql/temporary_schema.go:505 ⋮ [T1,n1] 133  running temporary object cleanup background job
I230528 10:48:32.921699 1 1@cli/start.go:1020 ⋮ [T1,n1] 134  initiating graceful shutdown of server
I230528 10:48:32.921844 1 1@cli/start.go:1081 ⋮ [T1,n1] 135  too early to drain; used hard shutdown instead
E230528 10:48:32.923737 1 1@cli/clierror/check.go:35 ⋮ [-] 136  server startup failed: cockroach server exited with error: ‹migration-job-find-already-completed›: key range id:42 is unavailable: ‹failed to send RPC: no replica node information available via gossip for r42›
I230528 10:48:32.926380 3352 sql/temporary_schema.go:555 ⋮ [T1,n1] 137  found 0 temporary schemas
I230528 10:48:32.926421 3352 sql/temporary_schema.go:558 ⋮ [T1,n1] 138  early exiting temporary schema cleaner as no temporary schemas were found
I230528 10:48:32.926439 3352 sql/temporary_schema.go:559 ⋮ [T1,n1] 139  completed temporary object cleanup job
I230528 10:48:32.926453 3352 sql/temporary_schema.go:640 ⋮ [T1,n1] 140  temporary object cleaner next scheduled to run at 2023-05-28 11:18:32.901565352 +0000 UTC m=+1813.720380332

n2:

I230528 10:48:32.903412 3069 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n2] 107  system-config-cache: established range feed cache
I230528 10:48:32.919891 1 1@cli/start.go:1020 ⋮ [T1,n2] 108  initiating graceful shutdown of server
I230528 10:48:32.920018 1 1@cli/start.go:1081 ⋮ [T1,n2] 109  too early to drain; used hard shutdown instead
E230528 10:48:32.920744 1 1@cli/clierror/check.go:35 ⋮ [-] 110  server startup failed: cockroach server exited with error: ‹migration-job-find-already-completed›: key range id:42 is unavailable: ‹failed to send RPC: no replica node information available via gossip for r42›

n3:

I230528 10:48:32.903398 2881 sql/temporary_schema.go:640 ⋮ [T1,n3] 83  temporary object cleaner next scheduled to run at 2023-05-28 11:18:32.903288847 +0000 UTC m=+1807.005457789
I230528 10:48:32.904430 104 upgrade/upgrademanager/manager.go:170 ⋮ [T1,n3] 84  running permanent upgrades up to version: 23.1-8
I230528 10:48:32.904925 3015 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n3] 85  system-config-cache: established range feed cache
I230528 10:48:32.920630 1 1@cli/start.go:1020 ⋮ [T1,n3] 86  initiating graceful shutdown of server
I230528 10:48:32.920766 1 1@cli/start.go:1081 ⋮ [T1,n3] 87  too early to drain; used hard shutdown instead
E230528 10:48:32.924899 1 1@cli/clierror/check.go:35 ⋮ [-] 88  server startup failed: cockroach server exited with error: ‹migration-job-find-already-completed›: key range id:42 is unavailable: ‹failed to send RPC: no replica node information available via gossip for r42›

n4:

I230528 10:48:32.903987 1780 kv/kvclient/rangefeed/rangefeedcache/watcher.go:335 ⋮ [T1,n4] 69  system-config-cache: established range feed cache
I230528 10:48:32.920182 1 1@cli/start.go:1020 ⋮ [T1,n4] 70  initiating graceful shutdown of server
I230528 10:48:32.920260 1 1@cli/start.go:1081 ⋮ [T1,n4] 71  too early to drain; used hard shutdown instead
E230528 10:48:32.920747 1 1@cli/clierror/check.go:35 ⋮ [-] 72  server startup failed: cockroach server exited with error: ‹migration-job-find-already-completed›: key range id:42 is unavailable: ‹failed to send RPC: no replica node information available via gossip for r42›

@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label May 30, 2023
@jbowens
Copy link
Collaborator

jbowens commented May 30, 2023

I'm not sure what's up, or who should take a look at this. @knz do you mind helping me triage this?

@knz
Copy link
Contributor

knz commented May 30, 2023

Yep this looks like kv.

@jbowens jbowens added A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels May 30, 2023
@jbowens jbowens removed T-server-and-security DB Server & Security T-storage Storage Team labels May 30, 2023
@jbowens
Copy link
Collaborator

jbowens commented May 30, 2023

thanks!

@tbg
Copy link
Member

tbg commented May 31, 2023

The error originates here:

return nil, newSendError(
fmt.Sprintf("no replica node information available via gossip for r%d", desc.RangeID))
}

and is wrapped here:

replicas, err := kvcoord.NewReplicaSlice(ctx, nodeDescs, desc, nil, filter)
if err != nil {
return kvcoord.ReplicaSlice{}, sqlerrors.NewRangeUnavailableError(desc.RangeID, err)
}

We have recently done some work to make operations in the critical startup path retry on certain errors (in particular circuit breaker errors): #97710.

However, here we are returning a pgerror, but we only retry on kv-level ambigous results and unavailability errors: https://github.com/cockroachdb/cockroach/pull/97710/files#diff-28828ba015feb6cc7f09785666cb858af92fa63899aa635ce187e4c9f9f266fdR204

I think we hit this code through this call:

if err := s.upgradeManager.RunPermanentUpgrades(
ctx,
s.cfg.Settings.Version.ActiveVersion(ctx).Version, /* upToVersion */
); err != nil {
return err
}

which calls into this code:

lastUpgradeCompleted, err := startup.RunIdempotentWithRetryEx(ctx,
m.deps.Stopper.ShouldQuiesce(),
"check if migration completed",
func(ctx context.Context) (bool, error) {
return migrationstable.CheckIfMigrationCompleted(
ctx, lastVer, nil /* txn */, m.ie,
// We'll do a follower read. This is all best effort anyway, and the
// follower read should keep the startup time low in the common case where
// all upgrades have run a long time ago before this node start.
enterpriseEnabled,
migrationstable.StaleRead)
})

Note that this is wrapped in startup.RunIdempotentWithRetryEx, however we happen to not retry this particular error.

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 31, 2023
craig bot pushed a commit that referenced this issue Jun 2, 2023
104128: startup: retry pgerror.Code == RangeUnavailable r=aliher1911 a=tbg

Closes #104016.

Epic: None
Release note (bug fix): a bug was fixed due to which nodes could terminate with the following message:
    server startup failed: cockroach server exited with error:
    ‹migration-job-find-already-completed›: key range id:X is unavailable: ‹failed
    to send RPC: no replica node information available via gossip for rX›


Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in df97a99 Jun 2, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 2, 2023
Closes #104016.

Epic: None
Release note (bug fix): a bug was fixed due to which nodes could terminate with the following message:
    server startup failed: cockroach server exited with error:
    ‹migration-job-find-already-completed›: key range id:X is unavailable: ‹failed
    to send RPC: no replica node information available via gossip for rX›
tbg added a commit to tbg/cockroach that referenced this issue Jun 2, 2023
Closes cockroachdb#104016.

Epic: None
Release note (bug fix): a bug was fixed due to which nodes could terminate with the following message:
    server startup failed: cockroach server exited with error:
    ‹migration-job-find-already-completed›: key range id:X is unavailable: ‹failed
    to send RPC: no replica node information available via gossip for rX›
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants