Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: show process exit status with monitor #66414

Merged
merged 2 commits into from
Jun 14, 2021

Conversation

erikgrinaker
Copy link
Contributor

roachprod: remove netcat references

roachprod monitor used to use netcat to wait for process
termination, but this was replaced by a kill -0 loop back in #37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated netcat code and references.

Release note: None

roachprod: show process exit status with monitor

This patch changes roachprod monitor to use systemctl to poll
process info on non-local clusters, and outputs the exit status for dead
nodes. On local clusters, it retains the old logic.

Release note: None

During the lifecycle of a cluster (create, start, stop) the output is:

$ roachprod monitor grinaker-mon
2: dead (exit status 0)
3: dead (exit status 0)
1: dead (exit status 0)
1: 9628
2: 9714
3: 9674
1: dead (exit status 137)
2: dead (exit status 137)
3: dead (exit status 137)

/cc @cockroachdb/test-eng

`roachprod monitor` used to use `netcat` to wait for process
termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated `netcat` code and references.

Release note: None
@erikgrinaker erikgrinaker requested a review from tbg June 13, 2021 14:05
@erikgrinaker erikgrinaker self-assigned this Jun 13, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Good after comments.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

	status=unknown
{{- else }}
	pid=$(systemctl show cockroach --property MainPID --value)

it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.


pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):

  if [ "${pid}" != "${lastpid}" ]; then
    if [ "${pid}" == 0 ]; then

Hmm, realizing now that monitor is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.


pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):

  if [ "${pid}" != "${lastpid}" ]; then
    if [ "${pid}" == 0 ]; then
      echo "dead (exit status ${status})"

you might need to change roachtest.newMonitor as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.

Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.


pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, realizing now that monitor is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.

Good point, fixed.


pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

you might need to change roachtest.newMonitor as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/

It matches on dead:

if msg.Err != nil || strings.Contains(msg.Msg, "dead") {

Tested it manually:

	cluster.go:1740,context.go:140,cluster.go:1729,test_runner.go:882: dead node detection: /home/erik/go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor grinaker-1623660519-01-n4cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: 9199
		2: dead (exit status 137)
		4: skipped
		1: 9447

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.

Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.

No worries, was just curious if you maybe meant something else like the template scope.

This patch changes `roachprod monitor` to use `systemctl` to poll
process info on non-local clusters, and outputs the exit status for dead
nodes. On local clusters, it retains the old logic.

Release note: None
@erikgrinaker
Copy link
Contributor Author

bors r=tbg

TFTR!

@craig
Copy link
Contributor

craig bot commented Jun 14, 2021

Build succeeded:

@craig craig bot merged commit 128e94f into cockroachdb:master Jun 14, 2021
tbg added a commit to tbg/cockroach that referenced this pull request Jun 16, 2021
Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in cockroachdb#66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes cockroachdb#66522
Closes cockroachdb#66528

Release note: None
craig bot pushed a commit that referenced this pull request Jun 16, 2021
66420: ui: disallow imports from cluster-ui sources r=koorosh a=koorosh

Db Console depends on local `cluster-ui` package that and it was possible to specify
imports to its modules as a paths to source modules or use exported `index` file which
is an entry point for bundled module.
`cluster-ui` package has to be built before imports in Db Console because it has its
own dependencies and build process which isn't compatible with build process of Db Console.
To prevent incorrect imports, this change adds es lint rule to prohibit any imports from
`@cockroachlabs/cluster-ui/src/*` path.

Release note: none

66539: roachprod: avoid extra dead event in roachprod monitor r=erikgrinaker a=tbg

Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in #66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes #66522
Closes #66528

Release note: None


66540: backupccl: add setting to write files in SQL r=dt a=dt

This adds a setting -- default off -- to force BACKUP to always ask
KV to return files to SQL to write instead of writing them directly.
This is currently what it does for tenants but not for the system
tenant due to the extra network hop, data copy, cpu and mem overhead.

Release note: none.

66542: build: set ROACHPROD_USER in weekly roachtest script r=tbg a=rickystewart

This was lost in 6651a08.

Release note: None

Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachprod/install/cluster_synced.go, line 410 at r1 (raw file):

				return
			}
			// Give the session a valid stdin pipe so that nc won't exit immediately.

Thanks for the first commit removing netcat crap! I was doing the same in #51298, but got stalled. Peter had some comments there (#51298 (review)) about how some comments about the handling of stdout should evolve; would you mind seeing if they make sense to you / if they're still applicable?

erikgrinaker pushed a commit to erikgrinaker/cockroach that referenced this pull request Jun 28, 2021
Prior to this commit, we'd get an extra dead event when restarting a
node:

```
// initially running
1: 5762
// issue roachprod stop
1: dead (exit status 137)
// issue roachprod start
1: dead (exit status 0)
1: 6254
```

This would in turn upset roachtest's `monitor`, as it keeps track of
the number of expected death events, and cause bogus test failures.

This was introduced in cockroachdb#66414 and is fixed in this commit.

As a small extra fix, if we don't observe the exit status of the
stopped systemd unit (i.e. if the process cycles rapidly), we now
correctly print that we don't know the exit status, where previously
we would print the "exit" status of the new running incarnation, i.e.
likely zero.

I tested this manually by exacerbating the sleeps in the shell snippet:

```
1: 10885
1: dead (exit status unknown)
1: 11220
```

I also verfied that repeated restarts produce the expected events in
general, i.e. one dead event following one pid event.

Closes cockroachdb#66522
Closes cockroachdb#66528

Release note: None
@erikgrinaker erikgrinaker deleted the roachprod-monitor branch June 28, 2021 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants