-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: show process exit status with monitor #66414
Conversation
`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Good after comments.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):
status=unknown {{- else }} pid=$(systemctl show cockroach --property MainPID --value)
it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.
pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):
if [ "${pid}" != "${lastpid}" ]; then if [ "${pid}" == 0 ]; then
Hmm, realizing now that monitor
is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.
pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):
if [ "${pid}" != "${lastpid}" ]; then if [ "${pid}" == 0 ]; then echo "dead (exit status ${status})"
you might need to change roachtest.newMonitor
as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/
8ef43c5
to
5d74ba6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
it will always bug me that this is visible in the "outer" scope later but I guess that's just how this works.
Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.
pkg/cmd/roachprod/install/cluster_synced.go, line 380 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, realizing now that
monitor
is a bit too forgiving. I think we want it to catch all node terminations. If a node rapidly cycles (i.e. pid changes from X to Y) we basically want to still say that it died.
Good point, fixed.
pkg/cmd/roachprod/install/cluster_synced.go, line 381 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
you might need to change
roachtest.newMonitor
as well, it might string match on the output here and I don't know if it's doing it in a way that will work with this change. Probably best to test that too, though it will have to be manual :-/
It matches on dead
:
cockroach/pkg/cmd/roachprod/main.go
Line 1155 in c6d85fe
if msg.Err != nil || strings.Contains(msg.Msg, "dead") { |
Tested it manually:
cluster.go:1740,context.go:140,cluster.go:1729,test_runner.go:882: dead node detection: /home/erik/go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor grinaker-1623660519-01-n4cpu8 --oneshot --ignore-empty-nodes: exit status 1 3: 9199
2: dead (exit status 137)
4: skipped
1: 9447
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Not sure which outer scope you mean here. The global shell script scope? There are no scopes in shell, except for explicit local variables in functions, which seems unnecessary. I don't think it should have any practical impact in this case anyway -- not like it'll leak into any parent or child processes or anything.
Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/cmd/roachprod/install/cluster_synced.go, line 375 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Sorry, didn't want to make you wonder what to do with this comment. I was just venting, the absence of scopes in shell script always weirds me out.
No worries, was just curious if you maybe meant something else like the template scope.
This patch changes `roachprod monitor` to use `systemctl` to poll process info on non-local clusters, and outputs the exit status for dead nodes. On local clusters, it retains the old logic. Release note: None
5d74ba6
to
e539415
Compare
bors r=tbg TFTR! |
Build succeeded: |
Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in cockroachdb#66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes cockroachdb#66522 Closes cockroachdb#66528 Release note: None
66420: ui: disallow imports from cluster-ui sources r=koorosh a=koorosh Db Console depends on local `cluster-ui` package that and it was possible to specify imports to its modules as a paths to source modules or use exported `index` file which is an entry point for bundled module. `cluster-ui` package has to be built before imports in Db Console because it has its own dependencies and build process which isn't compatible with build process of Db Console. To prevent incorrect imports, this change adds es lint rule to prohibit any imports from `@cockroachlabs/cluster-ui/src/*` path. Release note: none 66539: roachprod: avoid extra dead event in roachprod monitor r=erikgrinaker a=tbg Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in #66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes #66522 Closes #66528 Release note: None 66540: backupccl: add setting to write files in SQL r=dt a=dt This adds a setting -- default off -- to force BACKUP to always ask KV to return files to SQL to write instead of writing them directly. This is currently what it does for tenants but not for the system tenant due to the extra network hop, data copy, cpu and mem overhead. Release note: none. 66542: build: set ROACHPROD_USER in weekly roachtest script r=tbg a=rickystewart This was lost in 6651a08. Release note: None Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachprod/install/cluster_synced.go, line 410 at r1 (raw file):
return } // Give the session a valid stdin pipe so that nc won't exit immediately.
Thanks for the first commit removing netcat crap! I was doing the same in #51298, but got stalled. Peter had some comments there (#51298 (review)) about how some comments about the handling of stdout should evolve; would you mind seeing if they make sense to you / if they're still applicable?
Prior to this commit, we'd get an extra dead event when restarting a node: ``` // initially running 1: 5762 // issue roachprod stop 1: dead (exit status 137) // issue roachprod start 1: dead (exit status 0) 1: 6254 ``` This would in turn upset roachtest's `monitor`, as it keeps track of the number of expected death events, and cause bogus test failures. This was introduced in cockroachdb#66414 and is fixed in this commit. As a small extra fix, if we don't observe the exit status of the stopped systemd unit (i.e. if the process cycles rapidly), we now correctly print that we don't know the exit status, where previously we would print the "exit" status of the new running incarnation, i.e. likely zero. I tested this manually by exacerbating the sleeps in the shell snippet: ``` 1: 10885 1: dead (exit status unknown) 1: 11220 ``` I also verfied that repeated restarts produce the expected events in general, i.e. one dead event following one pid event. Closes cockroachdb#66522 Closes cockroachdb#66528 Release note: None
roachprod: remove netcat references
roachprod monitor
used to usenetcat
to wait for processtermination, but this was replaced by a
kill -0
loop back in #37390.However, the code still contained code and comments related to netcat.
This patch removes the outdated
netcat
code and references.Release note: None
roachprod: show process exit status with monitor
This patch changes
roachprod monitor
to usesystemctl
to pollprocess info on non-local clusters, and outputs the exit status for dead
nodes. On local clusters, it retains the old logic.
Release note: None
During the lifecycle of a cluster (
create
,start
,stop
) the output is:/cc @cockroachdb/test-eng