roachtest: version/mixed/nodes=3 failed #66522

cockroach-teamcity · 2021-06-16T07:00:28Z

roachtest.version/mixed/nodes=3 failed with artifacts on master @ ec36bc66f5147a8703719a9c890be12dd0c08945:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/version/mixed/nodes=3/run_1
	cluster.go:1870,version.go:139,cluster.go:2462,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false teamcity-3087613-1623823848-07-n4cpu4:1 returned: context canceled
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start --encrypt=false teamcity-3087613-1623823848-07-n4cpu4:1 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-3087613-1623823848-07-n4cpu4: starting nodes
		Wraps: (2) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (3) context canceled
		Error types: (1) *main.withCommandDetails (2) *secondary.withSecondaryError (3) *errors.errorString

	cluster.go:2484,version.go:208,version.go:222,test_runner.go:757: monitor failure: unexpected node event: 1: dead (exit status 0)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2472
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2480
		  | main.registerVersion.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/version.go:208
		  | main.registerVersion.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/version.go:222
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead (exit status 0)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh version/mixed/nodes=3

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

tbg · 2021-06-16T13:32:46Z

My hunch would be that this is fallout from #66414, though I don't understand exactly why. I think what we're seeing here and in the sibling issue #66528 is that the monitor thinks a node is dead (with exit status zero) when the test intentionally stops and restarts a node here:

cockroach/pkg/cmd/roachtest/version.go

Lines 135 to 139 in 7dc3736

    
           if err := stop(i); err != nil { 
        
           	return err 
        
           } 
        
           c.Put(ctx, cockroach, "./cockroach", c.Node(i)) 
        
           c.Start(ctx, t, c.Node(i), startArgsDontEncrypt)

I think the problem is that we're doing a graceful stop here, i.e. the process stops with exit status zero, but the monitor now detects this as a dead node. I suspect it didn't prior to #66414.

I think this new behavior makes sense. I'll verify and update the test with an appropriate m.ExpectDeath.

erikgrinaker · 2021-06-16T13:52:09Z

I think this is because the restart will cause the node to start with a new PID. We intentionally changed the behavior of monitor to consider that a node crash and emit a dead event. Before, we didn't care about PID changes as long as it was up.

We could e.g. change this to emit a restart event instead of a dead event. Or change the callers to expect this.

tbg · 2021-06-16T14:00:07Z

Hmm, it has to be something like that, but not quite. I realized stop(i) does call m.ExpectDeath:

cockroach/pkg/cmd/roachtest/version.go

Lines 119 to 123 in 7dc3736

    
           stop := func(node int) error { 
        
           	m.ExpectDeath() 
        
           	l.Printf("stopping node %d\n", node) 
        
           	return c.StopCockroachGracefullyOnNode(ctx, node) 
        
           }

I see that when we're starting CRDB, there's an extra "dead (exit status 0)" event that shouldn't be there. Will fix in roachprod monitor.

66420: ui: disallow imports from cluster-ui sources r=koorosh a=koorosh Db Console depends on local `cluster-ui` package that and it was possible to specify imports to its modules as a paths to source modules or use exported `index` file which is an entry point for bundled module. `cluster-ui` package has to be built before imports in Db Console because it has its own dependencies and build process which isn't compatible with build process of Db Console. To prevent incorrect imports, this change adds es lint rule to prohibit any imports from `@cockroachlabs/cluster-ui/src/*` path. Release note: none 66539: roachprod: avoid extra dead event in roachprod monitor r=erikgrinaker a=tbg Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in #66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes #66522 Closes #66528 Release note: None 66540: backupccl: add setting to write files in SQL r=dt a=dt This adds a setting -- default off -- to force BACKUP to always ask KV to return files to SQL to write instead of writing them directly. This is currently what it does for tenants but not for the system tenant due to the extra network hop, data copy, cpu and mem overhead. Release note: none. 66542: build: set ROACHPROD_USER in weekly roachtest script r=tbg a=rickystewart This was lost in 6651a08. Release note: None Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in cockroachdb#66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes cockroachdb#66522 Closes cockroachdb#66528 Release note: None

tbg self-assigned this Jun 16, 2021

tbg mentioned this issue Jun 16, 2021

roachprod: avoid extra dead event in roachprod monitor #66539

Merged

craig bot closed this as completed in b0c459d Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: version/mixed/nodes=3 failed #66522

roachtest: version/mixed/nodes=3 failed #66522

cockroach-teamcity commented Jun 16, 2021

tbg commented Jun 16, 2021

erikgrinaker commented Jun 16, 2021 •

edited

Loading

tbg commented Jun 16, 2021

roachtest: version/mixed/nodes=3 failed #66522

roachtest: version/mixed/nodes=3 failed #66522

Comments

cockroach-teamcity commented Jun 16, 2021

tbg commented Jun 16, 2021

erikgrinaker commented Jun 16, 2021 • edited Loading

tbg commented Jun 16, 2021

erikgrinaker commented Jun 16, 2021 •

edited

Loading