roachprod: show process exit status with monitor #66414

erikgrinaker · 2021-06-13T14:05:10Z

roachprod: remove netcat references

roachprod monitor used to use netcat to wait for process
termination, but this was replaced by a kill -0 loop back in #37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated netcat code and references.

Release note: None

roachprod: show process exit status with monitor

This patch changes roachprod monitor to use systemctl to poll
process info on non-local clusters, and outputs the exit status for dead
nodes. On local clusters, it retains the old logic.

Release note: None

During the lifecycle of a cluster (create, start, stop) the output is:

$ roachprod monitor grinaker-mon
2: dead (exit status 0)
3: dead (exit status 0)
1: dead (exit status 0)
1: 9628
2: 9714
3: 9674
1: dead (exit status 137)
2: dead (exit status 137)
3: dead (exit status 137)

/cc @cockroachdb/test-eng

`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None

cockroach-teamcity · 2021-06-13T14:05:19Z

This change is

tbg

Thank you! Good after comments.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

	status=unknown
{{- else }}
	pid=$(systemctl show cockroach --property MainPID --value)

it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.

pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):

  if [ "${pid}" != "${lastpid}" ]; then
    if [ "${pid}" == 0 ]; then

Hmm, realizing now that monitor is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.

pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):

  if [ "${pid}" != "${lastpid}" ]; then
    if [ "${pid}" == 0 ]; then
      echo "dead (exit status ${status})"

you might need to change roachtest.newMonitor as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.

Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.

pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, realizing now that monitor is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.

Good point, fixed.

pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

you might need to change roachtest.newMonitor as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/

It matches on dead:

cockroach/pkg/cmd/roachprod/main.go

Line 1155 in c6d85fe

if msg.Err != nil || strings.Contains(msg.Msg, "dead") {

Tested it manually:

	cluster.go:1740,context.go:140,cluster.go:1729,test_runner.go:882: dead node detection: /home/erik/go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor grinaker-1623660519-01-n4cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: 9199
		2: dead (exit status 137)
		4: skipped
		1: 9447

tbg

Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.

Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.

No worries, was just curious if you maybe meant something else like the template scope.

This patch changes `roachprod monitor` to use `systemctl` to poll process info on non-local clusters, and outputs the exit status for dead nodes. On local clusters, it retains the old logic. Release note: None

erikgrinaker · 2021-06-14T11:59:49Z

bors r=tbg

TFTR!

craig · 2021-06-14T12:40:53Z

Build succeeded:

GitHub CI (Cockroach)

Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in cockroachdb#66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes cockroachdb#66522 Closes cockroachdb#66528 Release note: None

66420: ui: disallow imports from cluster-ui sources r=koorosh a=koorosh Db Console depends on local `cluster-ui` package that and it was possible to specify imports to its modules as a paths to source modules or use exported `index` file which is an entry point for bundled module. `cluster-ui` package has to be built before imports in Db Console because it has its own dependencies and build process which isn't compatible with build process of Db Console. To prevent incorrect imports, this change adds es lint rule to prohibit any imports from `@cockroachlabs/cluster-ui/src/*` path. Release note: none 66539: roachprod: avoid extra dead event in roachprod monitor r=erikgrinaker a=tbg Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in #66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes #66522 Closes #66528 Release note: None 66540: backupccl: add setting to write files in SQL r=dt a=dt This adds a setting -- default off -- to force BACKUP to always ask KV to return files to SQL to write instead of writing them directly. This is currently what it does for tenants but not for the system tenant due to the extra network hop, data copy, cpu and mem overhead. Release note: none. 66542: build: set ROACHPROD_USER in weekly roachtest script r=tbg a=rickystewart This was lost in 6651a08. Release note: None Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachprod/install/cluster_synced.go, line 410 at r1 (raw file):

				return
			}
			// Give the session a valid stdin pipe so that nc won't exit immediately.

Thanks for the first commit removing netcat crap! I was doing the same in #51298, but got stalled. Peter had some comments there (#51298 (review)) about how some comments about the handling of stdout should evolve; would you mind seeing if they make sense to you / if they're still applicable?

Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in cockroachdb#66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes cockroachdb#66522 Closes cockroachdb#66528 Release note: None

erikgrinaker requested a review from tbg June 13, 2021 14:05

erikgrinaker self-assigned this Jun 13, 2021

tbg approved these changes Jun 13, 2021

View reviewed changes

erikgrinaker force-pushed the roachprod-monitor branch from 8ef43c5 to 5d74ba6 Compare June 14, 2021 09:01

erikgrinaker commented Jun 14, 2021

View reviewed changes

tbg approved these changes Jun 14, 2021

View reviewed changes

erikgrinaker commented Jun 14, 2021

View reviewed changes

roachprod: show process exit status with monitor

e539415

This patch changes `roachprod monitor` to use `systemctl` to poll process info on non-local clusters, and outputs the exit status for dead nodes. On local clusters, it retains the old logic. Release note: None

erikgrinaker force-pushed the roachprod-monitor branch from 5d74ba6 to e539415 Compare June 14, 2021 11:18

craig bot merged commit 128e94f into cockroachdb:master Jun 14, 2021

tbg mentioned this pull request Jun 16, 2021

roachtest: version/mixed/nodes=3 failed #66522

Closed

tbg mentioned this pull request Jun 16, 2021

roachprod: avoid extra dead event in roachprod monitor #66539

Merged

andreimatei reviewed Jun 16, 2021

View reviewed changes

andreimatei mentioned this pull request Jun 16, 2021

roachprod: correct leftover netcat comments in the monitor #51298

Closed

erikgrinaker mentioned this pull request Jun 28, 2021

release-21.1: roachprod: show process exit status with monitor #66952

Merged

erikgrinaker deleted the roachprod-monitor branch June 28, 2021 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: show process exit status with monitor #66414

roachprod: show process exit status with monitor #66414

erikgrinaker commented Jun 13, 2021

cockroach-teamcity commented Jun 13, 2021

tbg left a comment

erikgrinaker left a comment

tbg left a comment

erikgrinaker left a comment

erikgrinaker commented Jun 14, 2021

craig bot commented Jun 14, 2021

andreimatei left a comment

roachprod: show process exit status with monitor #66414

roachprod: show process exit status with monitor #66414

Conversation

erikgrinaker commented Jun 13, 2021

roachprod: remove netcat references

roachprod: show process exit status with monitor

cockroach-teamcity commented Jun 13, 2021

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 14, 2021

craig bot commented Jun 14, 2021

andreimatei left a comment

Choose a reason for hiding this comment