Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: inconsistency failed #54005

Closed
cockroach-teamcity opened this issue Sep 7, 2020 · 1 comment · Fixed by #54019
Closed

roachtest: inconsistency failed #54005

cockroach-teamcity opened this issue Sep 7, 2020 · 1 comment · Fixed by #54019
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).inconsistency failed on master@8b9e8dc32e73cfdfc1999da35d61e5cc9a2b35ec:

test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/inconsistency/run_1
	inconsistency.go:97,test_runner.go:754: dial tcp 35.225.170.169:26257: connect: connection refused

	cluster.go:1651,context.go:135,cluster.go:1640,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2251139-1599459125-22-n3cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead
		2: dead
		3: 5692
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (4) 1: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

More

Artifacts: /inconsistency

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 7, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Sep 7, 2020
tbg added a commit to tbg/cockroach that referenced this issue Sep 8, 2020
This test sets up an intentionally corrupted replica and wants its node
to shut down as a result of its detection. When only two of the three
nodes were included in the consistency check, either one of them could
end up terminating (as no obvious majority of healthy replicas can be
determined). Change the test so that we wait for the cluster to come
fully together before setting a low consistency check interval.

Closes cockroachdb#54005.

Release justification: testing
Release note: None
@tbg
Copy link
Member

tbg commented Sep 8, 2020

It's an funny failure. The test basically sets n1 up with an inconsistency and expects it to fatal. But we see n2 fatal, as a result of this consistency check run by n1:

E200907 06:50:14.253062 1693 kv/kvserver/replica_consistency.go:163 ⋮ [n2,consistencyChecker,s2,r1/2:‹/{Min-System/NodeL…}›] ‹›
‹(n1,s1):1: checksum 506b30129032ad06811563d41d1a59eeda21121c2e8364aa28448a8bda57ace3c8f8e3e98520b5e63cbe480725813a9e944284eb07029fa74304ab25f6a32fdf›
‹- stats: contains_estimates:0 last_update_nanos:1599461359376307376 intent_age:0 gc_bytes_age:135861 live_bytes:2105 live_count:36 key_bytes:2296 key_count:36 val_bytes:6677 val_count:180 intent_bytes:0 intent_count:0 sys_bytes:451 sys_count:4 abort_span_bytes:0 ›
‹- stats.Sub(recomputation): last_update_nanos:1599461359376307376 sys_bytes:-81 sys_count:-1 ›
‹(n2,s2):2: checksum f39daaaf2cae21dfbc094e3f9ed7a320db8d038637f317282f0d883a70f20c8c5ea97112c97876bc81241ce9a7477af145af339084826a6e1739f7d251f25277 [minority]›
‹- stats: contains_estimates:0 last_update_nanos:1599461359376307376 intent_age:0 gc_bytes_age:135861 live_bytes:2105 live_count:36 key_bytes:2296 key_count:36 val_bytes:6677 val_count:180 intent_bytes:0 intent_count:0 sys_bytes:451 sys_count:4 abort_span_bytes:0 ›
‹- stats.Sub(recomputation): last_update_nanos:1599461359376307376›
E200907 06:50:14.253131 1693 kv/kvserver/replica_consistency.go:287 ⋮ [n2,consistencyChecker,s2,r1/2:‹/{Min-System/NodeL…}›] consistency check failed; fetching details and shutting down minority ‹[(n2,s2):2]›

It looks like for some reason n3 isn't included in the check, which can happen if it didn't respond in time or wasn't considered to be live. I think this is easy to avoid, sending PR.

@tbg tbg self-assigned this Sep 8, 2020
@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 8, 2020
craig bot pushed a commit that referenced this issue Sep 10, 2020
53991: pgwire: accept non-TLS client conns safely in secure mode r=aaron-crl,irfansharif,bdarnell a=knz

Fixes #44842.
Informs #49532. 
Informs #53404.

This change makes it possible for a DBA / system administrator to
reconfigure individual nodes *in a secure cluster* to accept SQL
client sessions over TCP without mandating a TLS
handshake. Authentication remains mandatory as per the HBA rules.

Motivation: we have at least two high-profile customers who keep their
nodes and client apps in a private secure network (with network-level
encryption / privacy) and who experience client-side TLS as
unnecessary and expensive friction.

Additionally, **this feature is a prerequisite to upgrade an insecure
cluster to secure mode without downtime.**

Why this does not impair security:

- authentication remains mandatory (as per the HBA rules -- [1] [2]).
- the feature is opt-in: the operator must set a command-line flag
  (`--accept-sql-without-tls`), which is not enabled by default.
- there is an interlock: the user must both set up the flag
  and set log-in passwords for their SQL users (by default,
  users get created without a password and thus cannot log
  in without client certs).
- for now, node-node connections still require TLS.

[1]: https://www.postgresql.org/docs/12/auth-pg-hba-conf.html
[2]: https://dr-knz.net/authentication-in-postgresql-and-cockroachdb.html

For context, the default HBA configuration is the following:

```
host  all root all cert-password # fixed rule
host  all all  all cert-password # built-in CockroachDB default
local all all      password      # built-in CockroachDB default
```

The directive `host` covers both TLS and non-TLS incoming TCP
connections (`local` is for the unix socket). The method
`cert-password` means "client cert or password": without a cert, the
password is mandatory.

As previously, the user can further secure the configuration by
restricting non-TLS connections to just a subnetwork, for example:

```
host  all all 10.0.0.0/8 password # accept conns on the 10/8 network.
host  all all all        reject   # refuse conns from other nets.
local all all            password
```

Note that this change is limited to the server side: CockroachDB's own
`cockroach` CLI commands do not yet know how to connect to a
CockroachDB server without TLS; such connections are only supported
from `psql` or SQL client drivers in apps. See #53994 for a follow-up.

Release justification: fixes for high-priority or high-severity bugs in existing functionality


54019: roachtest: de-flake 'inconsistent' r=knz a=tbg

This test sets up an intentionally corrupted replica and wants its node
to shut down as a result of its detection. When only two of the three
nodes were included in the consistency check, either one of them could
end up terminating (as no obvious majority of healthy replicas can be
determined). Change the test so that we wait for the cluster to come
fully together before setting a low consistency check interval.

Closes #54005.

Release justification: testing
Release note: None


Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this as completed in c6e36ac Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants