Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: version/mixed/nodes=3 failed #66522

Closed
cockroach-teamcity opened this issue Jun 16, 2021 · 3 comments · Fixed by #66539
Closed

roachtest: version/mixed/nodes=3 failed #66522

cockroach-teamcity opened this issue Jun 16, 2021 · 3 comments · Fixed by #66539
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.version/mixed/nodes=3 failed with artifacts on master @ ec36bc66f5147a8703719a9c890be12dd0c08945:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/version/mixed/nodes=3/run_1
	cluster.go:1870,version.go:139,cluster.go:2462,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false teamcity-3087613-1623823848-07-n4cpu4:1 returned: context canceled
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false teamcity-3087613-1623823848-07-n4cpu4:1 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-3087613-1623823848-07-n4cpu4: starting nodes
		Wraps: (2) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (3) context canceled
		Error types: (1) *main.withCommandDetails (2) *secondary.withSecondaryError (3) *errors.errorString

	cluster.go:2484,version.go:208,version.go:222,test_runner.go:757: monitor failure: unexpected node event: 1: dead (exit status 0)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2472
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2480
		  | main.registerVersion.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/version.go:208
		  | main.registerVersion.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/version.go:222
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead (exit status 0)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh version/mixed/nodes=3

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 16, 2021
@tbg
Copy link
Member

tbg commented Jun 16, 2021

My hunch would be that this is fallout from #66414, though I don't understand exactly why. I think what we're seeing here and in the sibling issue #66528 is that the monitor thinks a node is dead (with exit status zero) when the test intentionally stops and restarts a node here:

if err := stop(i); err != nil {
return err
}
c.Put(ctx, cockroach, "./cockroach", c.Node(i))
c.Start(ctx, t, c.Node(i), startArgsDontEncrypt)

I think the problem is that we're doing a graceful stop here, i.e. the process stops with exit status zero, but the monitor now detects this as a dead node. I suspect it didn't prior to #66414.

I think this new behavior makes sense. I'll verify and update the test with an appropriate m.ExpectDeath.

@tbg tbg self-assigned this Jun 16, 2021
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 16, 2021

I think this is because the restart will cause the node to start with a new PID. We intentionally changed the behavior of monitor to consider that a node crash and emit a dead event. Before, we didn't care about PID changes as long as it was up.

We could e.g. change this to emit a restart event instead of a dead event. Or change the callers to expect this.

@tbg
Copy link
Member

tbg commented Jun 16, 2021

Hmm, it has to be something like that, but not quite. I realized stop(i) does call m.ExpectDeath:

stop := func(node int) error {
m.ExpectDeath()
l.Printf("stopping node %d\n", node)
return c.StopCockroachGracefullyOnNode(ctx, node)
}

I see that when we're starting CRDB, there's an extra "dead (exit status 0)" event that shouldn't be there. Will fix in roachprod monitor.

craig bot pushed a commit that referenced this issue Jun 16, 2021
66420: ui: disallow imports from cluster-ui sources r=koorosh a=koorosh

Db Console depends on local `cluster-ui` package that and it was possible to specify
imports to its modules as a paths to source modules or use exported `index` file which
is an entry point for bundled module.
`cluster-ui` package has to be built before imports in Db Console because it has its
own dependencies and build process which isn't compatible with build process of Db Console.
To prevent incorrect imports, this change adds es lint rule to prohibit any imports from
`@cockroachlabs/cluster-ui/src/*` path.

Release note: none

66539: roachprod: avoid extra dead event in roachprod monitor r=erikgrinaker a=tbg

Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in #66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes #66522
Closes #66528

Release note: None


66540: backupccl: add setting to write files in SQL r=dt a=dt

This adds a setting -- default off -- to force BACKUP to always ask
KV to return files to SQL to write instead of writing them directly.
This is currently what it does for tenants but not for the system
tenant due to the extra network hop, data copy, cpu and mem overhead.

Release note: none.

66542: build: set ROACHPROD_USER in weekly roachtest script r=tbg a=rickystewart

This was lost in 6651a08.

Release note: None

Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in b0c459d Jun 16, 2021
erikgrinaker pushed a commit to erikgrinaker/cockroach that referenced this issue Jun 28, 2021
Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in cockroachdb#66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes cockroachdb#66522
Closes cockroachdb#66528

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants