Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/mixed-versions failed #54908

Closed
cockroach-teamcity opened this issue Sep 29, 2020 · 14 comments
Closed

roachtest: decommission/mixed-versions failed #54908

cockroach-teamcity opened this issue Sep 29, 2020 · 14 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).decommission/mixed-versions failed on release-20.2@bdb8cd0e7b2f25a08569a56464838486b6d16421:

		  |   | *
		  |   | * WARNING: RUNNING IN INSECURE MODE!
		  |   | * 
		  |   | * - Your cluster is open for any client that can access <all your IP addresses>.
		  |   | * - Any user, even root, can log in without providing a password.
		  |   | * - Any user, connecting as root, can read or write any data in your cluster.
		  |   | * - There is no network encryption nor authentication, and thus no confidentiality.
		  |   | * 
		  |   | * Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v20.2/secure-a-cluster.html
		  |   | *
		  |   | *
		  |   | * ERROR: ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
		  |   | *
		  |   | ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
		  |   | Failed running "start"
		  |   | E200929 06:43:42.755194 1 cli/error.go:398  ERROR: exit status 1
		  |   | ERROR: exit status 1
		  |   | Failed running "start"
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I200929 06:43:42.789482 1 cluster_synced.go:1733  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

	cluster.go:1654,context.go:135,cluster.go:1643,test_runner.go:824: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2328570-1601361596-03-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: 4266
		1: 5283
		2: 4430
		4: dead
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/mixed-versions
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 29, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Sep 29, 2020
@nvanbenschoten
Copy link
Member

ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)

Same as #54908.

@nvanbenschoten nvanbenschoten self-assigned this Sep 29, 2020
@nvanbenschoten
Copy link
Member

This is strange. We see the logic to pick a predecessor version working as expected:

06:42:45 cluster.go:382: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-2328570-1601361596-03-n4cpu4:1-4 /home/agent/temp/buildTmp/cockroach-v20.1.4.linux-amd64 ./cockroach-20.1.4
teamcity-2328570-1601361596-03-n4cpu4: putting (dist) /home/agent/temp/buildTmp/cockroach-v20.1.4.linux-amd64 ./cockroach-20.1.4

So at no point during this test should the node have been running a v19.1-1 binary.

@nvanbenschoten
Copy link
Member

I can reproduce this at will.

@nvanbenschoten
Copy link
Member

The reproductions aren't as easy as I thought. But I was able to hit this again and take a look at the node in question. The v20.1 binary certainly looks like a real v20.1 binary.

ubuntu@nathan-1601486618-01-n4cpu4-0004:~$ ./cockroach-20.1.4 version
Build Tag:    v20.1.4
Build Time:   2020/07/29 22:56:36
Distribution: CCL
Platform:     linux amd64 (x86_64-unknown-linux-gnu)
Go Version:   go1.13.9
C Compiler:   gcc 6.3.0
Build SHA-1:  12049d3fe3650660e1b6abf1e522d9bb016acb88
Build Type:   release

But I do certainly see a v19.1-1 cluster version stored under the /Local/Store/clusterVersion key on the node that has only ever run this new binary:

ubuntu@nathan-1601517506-09-n4cpu4-0004:~$ ./cockroach debug keys --values /mnt/data1/cockroach
0,0 /Local/Store/clusterVersion (0x01736376657200): 19.1-1

0,0 /Local/Store/storeIdent (0x01736964656e00): {Txn:<nil> Timestamp:0,0 Deleted:false KeyBytes:0 ValBytes:0 RawBytes:[240 46 145 106 3 10 16 109 176 92 71 75 49 67 192 152 91 29 22 250 81 62 20 16 4 24 4] IntentHistory:[] MergeTimestamp:<nil>}

When running the debug command with the following diff:

git diff

diff --git a/pkg/kv/kvserver/debug_print.go b/pkg/kv/kvserver/debug_print.go
index c179390..f5b1dc5 100644
--- a/pkg/kv/kvserver/debug_print.go
+++ b/pkg/kv/kvserver/debug_print.go
@@ -25,6 +25,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/util/protoutil"
        "github.com/cockroachdb/errors"
        "go.etcd.io/etcd/raft/raftpb"
+       "github.com/cockroachdb/cockroach/pkg/clusterversion"
 )

 // PrintKeyValue attempts to pretty-print the specified MVCCKeyValue to
@@ -226,7 +227,7 @@ func tryRaftLogEntry(kv storage.MVCCKeyValue) (string, error) {
 }

 func tryTxn(kv storage.MVCCKeyValue) (string, error) {
-       var txn roachpb.Transaction
+       var txn clusterversion.ClusterVersion
        if err := maybeUnmarshalInline(kv.Value, &txn); err != nil {
                return "", err
        }

It is strange to me that there are only two keys in the cockroach debug keys output.

@irfansharif does any of this ring any bells for you? I think you're most familiar with these decommission tests.

@irfansharif
Copy link
Contributor

No I don't think I've seen this failure before. But happy to take this off your plate, I'm around this area (cluster version persistence) anyhow.

@tbg
Copy link
Member

tbg commented Oct 2, 2020

I've been looking at this over in #54906, @irfansharif

@tbg tbg assigned tbg and unassigned irfansharif Oct 6, 2020
tbg added a commit to tbg/cockroach that referenced this issue Oct 6, 2020
Writes to a `storage.Engine` are not sync'ed by default, meaning that
they can get lost due to an ill-timed crash.

Fixes cockroachdb#54906.

(The backport will take care of cockroachdb#54908).

Release note (bug fix): a rare scenario in which a node would refuse
to start after updating the binary was fixed. The log message would
indicate: "store [...], last used with cockroach version [...], is too
old for running version [...] (which requires data from [...] or
later)".
tbg added a commit to tbg/cockroach that referenced this issue Oct 8, 2020
Writes to a RocksDB `storage.Engine` were not sync'ed by default,
meaning that they can get lost due to an ill-timed crash. They are now,
matching pebble's behavior. This affects only WriteClusterVersion,
updateBootstrapInfoLocked, WriteLastUpTimestamp, and Compactor.Suggest,
nonw of which are performance sensitive.

Fixes cockroachdb#54906.

(The backport will take care of cockroachdb#54908).

Release note (bug fix): a rare scenario in which a node would refuse
to start after updating the binary was fixed. The log message would
indicate: "store [...], last used with cockroach version [...], is too
old for running version [...] (which requires data from [...] or
later)".
craig bot pushed a commit that referenced this issue Oct 8, 2020
55240: kvserver: sync cluster version setting to store r=petermattis a=tbg

Writes to a `storage.Engine` are not sync'ed by default, meaning that
they can get lost due to an ill-timed crash.

Fixes #54906.

(The backport will take care of #54908).

Release note (bug fix): a rare scenario in which a node would refuse
to start after updating the binary was fixed. The log message would
indicate: "store [...], last used with cockroach version [...], is too
old for running version [...] (which requires data from [...] or
later)".

Co-authored-by: Tobias Grieger <[email protected]>
@nvanbenschoten
Copy link
Member

@tbg we can close this issue now, right?

@thoszhang thoszhang removed branch-release-20.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 12, 2020
@tbg
Copy link
Member

tbg commented Oct 13, 2020

No, needs backport, this is release-20.2 (though I am confused - @lucy-zhang , why are you removing the branch labels? We use them to track the branch the failure is on. This particular one has been fixed on master but not release-20.2).

tbg added a commit to tbg/cockroach that referenced this issue Oct 13, 2020
Writes to a RocksDB `storage.Engine` were not sync'ed by default,
meaning that they can get lost due to an ill-timed crash. They are now,
matching pebble's behavior. This affects only WriteClusterVersion,
updateBootstrapInfoLocked, WriteLastUpTimestamp, and Compactor.Suggest,
nonw of which are performance sensitive.

Fixes cockroachdb#54906.

(The backport will take care of cockroachdb#54908).

Release note (bug fix): a rare scenario in which a node would refuse
to start after updating the binary was fixed. The log message would
indicate: "store [...], last used with cockroach version [...], is too
old for running version [...] (which requires data from [...] or
later)".
tbg added a commit to tbg/cockroach that referenced this issue Oct 13, 2020
Writes to a RocksDB `storage.Engine` were not sync'ed by default,
meaning that they can get lost due to an ill-timed crash. They are now,
matching pebble's behavior. This affects only WriteClusterVersion,
updateBootstrapInfoLocked, WriteLastUpTimestamp, and Compactor.Suggest,
nonw of which are performance sensitive.

Fixes cockroachdb#54906.

(The backport will take care of cockroachdb#54908).

Release note (bug fix): a rare scenario in which a node would refuse
to start after updating the binary was fixed. The log message would
indicate: "store [...], last used with cockroach version [...], is too
old for running version [...] (which requires data from [...] or
later)".
@thoszhang
Copy link
Contributor

Sorry, that was accidental. I'm putting them back on the other roachtest failures.

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/mixed-versions failed on release-20.2@973a6f33ecb8f99b2d510408826f37cd84a99f81:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/mixed-versions/run_1
	mixed_version_decommission.go:248,versionupgrade.go:189,mixed_version_decommission.go:121,decommission.go:73,test_runner.go:755: expected to find 1 node with membership=decommissioning, found 0
		(1) attached stack trace
		  -- stack trace:
		  | main.checkOneMembership.func1.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:237
		  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:197
		  | main.checkOneMembership.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:228
		  | main.(*versionUpgradeTest).run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:189
		  | main.runDecommissionMixedVersions
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:121
		  | main.registerDecommission.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:73
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) expected to find 1 node with membership=decommissioning, found 0
		Error types: (1) *withstack.withStack (2) *errutil.leafError

More

Artifacts: /decommission/mixed-versions

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/mixed-versions failed on release-20.2@7f10df809db5076075b6ec63bc744b62109ee459:

		  |   | * - Intruders can log in without password and read or write any data in the cluster.
		  |   | * - Intruders can consume all your server's resources and cause unavailability.
		  |   | *
		  |   | *
		  |   | * INFO: To start a secure server without mandating TLS for clients,
		  |   | * consider --accept-sql-without-tls instead. For other options, see:
		  |   | * 
		  |   | * - https://go.crdb.dev/issue-v/53404/v20.2
		  |   | * - https://www.cockroachlabs.com/docs/v20.2/secure-a-cluster.html
		  |   | *
		  |   | *
		  |   | * ERROR: ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.2 (which requires data from v20.1 or later)
		  |   | *
		  |   | ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.2 (which requires data from v20.1 or later)
		  |   | Failed running "start"
		  |   | E201107 07:16:30.591477 1 cli/error.go:398  ERROR: exit status 1
		  |   | ERROR: exit status 1
		  |   | Failed running "start"
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I201107 07:16:30.623127 1 cluster_synced.go:1733  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

	cluster.go:1657,context.go:135,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2425022-1604733173-13-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: 5206
		2: 4330
		3: 4117
		4: dead
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/mixed-versions
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@andreimatei
Copy link
Contributor

last failure:

store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.2 (which requires data from v20.1 or later)

@tbg we still have to backport #55240 to fix that, right? Is there a holdup?

@tbg
Copy link
Member

tbg commented Nov 11, 2020

Ball's in @itsbilal's court: #55745

@tbg tbg assigned itsbilal and unassigned tbg Nov 11, 2020
@itsbilal
Copy link
Contributor

Closing now that #55745 is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

7 participants