roachprod: remove monitor netcat command #37390

ajkr · 2019-05-08T06:35:04Z

roachprod monitor assumes nc will exit as soon as Cockroach server
exits. This actually is not the case in later versions of netcat (tested
on Ubuntu 18.04+).

This PR changes to a polling approach calling kill -0 once per second
to monitor the Cockroach server's liveness. This should give us better
portability and we verified the overhead is low (~0.65ms of a CPU core's
time per kill invocation). Tested by running roachprod monitor
locally, gradually killing the nodes, and observing the output:

3: 28342
1: 28176
2: 28257
3: kill exited nonzero
3: dead
2: kill exited nonzero
2: dead
1: kill exited nonzero
1: dead

Fixes #37370.

Release note: None

cockroach-teamcity · 2019-05-08T06:35:10Z

This change is

tbg · 2019-05-08T10:49:12Z

I have bad news -- -d isn't a thing on OSX, and this stuff has to work there too (because of --local).

I'm thinking that this netcat thing is trying to be too clever. It's just trying to achieve "instant" detection of the dead node, but we could just replace it with a "sleep 1" and end up in a functionally similar, but much simpler place. Am I missing a reason for which this isn't viable?

PS great find. Are we using 18.04 in our nightly testing, or where did you run into it?

petermattis

When monitor was originally written, I recall trying to poll with lsof and finding it moderately expensive. nc is such a simple utility. It is sad that this behavior is changing between ubuntu releases.

Note that we know if the machine we're running nc on is local or not. We could optionally add -d based on that, or try to detect if -d is supported.

Reviewable status: complete! 0 of 0 LGTMs obtained

petermattis · 2019-05-08T13:03:30Z

Another thought is to add cockroach debug monitor or cockroach debug wait. I'd guess it is a few dozen lines of Go code to open the network connection and block waiting for it to be closed.

tbg · 2019-05-08T13:40:31Z

@petermattis are you saying you don't just want to rip out nc? That seems like a straightforward way to kill less eng time on this.

Before we write a utility that does low-level nc stuff against an open crdb port, we can just add a GRPC endpoint that lets you wait forever. But then you have all the migration concerns to deal with. I really think we should just remove nc from the equation and loop a little more often.

petermattis · 2019-05-08T13:44:39Z

@petermattis are you saying you don't just want to rip out nc? That seems like a straightforward way to kill less eng time on this.

I'm fine with ripping out nc, but see: "When monitor was originally written, I recall trying to poll with lsof and finding it moderately expensive." For performance tests, we don't want monitor to have an effect on performance.

tbg · 2019-05-08T13:46:10Z

Oh, I see. How about polling with kill -0 $PID? You use lsof to discover and then do a cheap thing frequently after.

petermattis · 2019-05-08T13:49:12Z

Oh, I see. How about polling with kill -0 $PID? You use lsof to discover and then do a cheap thing frequently after.

Seems possible. The irritating thing is testing that this doesn't have an effect on performance.

ajkr · 2019-05-08T17:00:51Z

I have bad news -- -d isn't a thing on OSX, and this stuff has to work there too (because of --local).

That is very bad news!

PS great find. Are we using 18.04 in our nightly testing, or where did you run into it?

Thanks. I'm using 18.04 and was addressing one of Peter's suggestions to make a callback for node deaths. When running with --local on my machine, that callback never got called.

ajkr · 2019-05-08T18:19:25Z

Should we measure how long kill takes? What is a good number? How about sacrificing at most 0.1% of CPU time since this is only used in tests? I'd guess it'll be way less than that since kill is a small program that just sends a signal, and we're talking about running it once a second. But we can reduce the frequency of running it if needed.

tbg · 2019-05-08T18:40:56Z

Sounds reasonable to me. I would be very surprised if it weren't negligible but worth checking once.

…

On Wed, May 8, 2019, 20:19 Andrew Kryczka ***@***.***> wrote: Should we measure how long kill takes? What is a good number? How about sacrificing at most 0.1% of CPU time since this is only used in tests (I'd guess it'll be way less than that since kill is a small program that just sends a signal, and we're talking about running it once a second)? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#37390 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGXPZFQOGGPU22XWWGWYRDPUMKT7ANCNFSM4HLOZSPA> .

tbg · 2019-05-08T18:43:06Z

PS with -0 it doesn't even send a signal, I guess all it does is check whether the PID exists. Of course there might be the problem that the PID switches out from under you. Wonder if the old code handled that. The concern is theoretical so I wouldn't worry about regressing here.

…

On Wed, May 8, 2019, 20:39 Tobias Grieger ***@***.***> wrote: Sounds reasonable to me. I would be very surprised if it weren't negligible but worth checking once. On Wed, May 8, 2019, 20:19 Andrew Kryczka ***@***.***> wrote: > Should we measure how long kill takes? What is a good number? How about > sacrificing at most 0.1% of CPU time since this is only used in tests (I'd > guess it'll be way less than that since kill is a small program that > just sends a signal, and we're talking about running it once a second)? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#37390 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABGXPZFQOGGPU22XWWGWYRDPUMKT7ANCNFSM4HLOZSPA> > . >

ajkr · 2019-05-09T04:18:17Z

The kill command takes 0.646 ms so running it once per second is within our 0.1% overhead bound (although not by as large a margin as I would've guessed!).

$ rm -f ./kill-times.out ; for l in $(seq 1000); do perf stat -e cpu-clock kill -0 `pidof sleep` |& grep cpu-clock | awk '{print $1}' >>./kill-times.out ; done ; echo "scale=5; ($(paste -sd+ ./kill-times.out)) / 1000" | bc
.64649

ajkr · 2019-05-09T04:28:49Z

For comparison, here is the same methodology used to measure lsof times. Those were indeed way slower at 36ms per call. That'd be 3.6% of a core during benchmarking which I agree doesn't look good.

$ rm -f ./lsof-times.out ; for l in $(seq 1000); do perf stat -e cpu-clock lsof -i :12345 -sTCP:LISTEN |& grep cpu-clock | awk '{print $1}' >>./lsof-times.out ; done ; echo "scale=5; ($(paste -sd+ ./lsof-times.out)) / 1000" | bc
35.96782

`roachprod monitor` assumes `nc` will exit as soon as Cockroach server exits. This actually is not the case in later versions of netcat (tested on Ubuntu 18.04+). This PR changes to a polling approach calling `kill -0` once per second to monitor the Cockroach server's liveness. This should give us better portability and we verified the overhead is low (~0.65ms of a CPU core's time per `kill` invocation). Tested by running `roachprod monitor` locally, gradually killing the nodes, and observing the output: ``` 3: 28342 1: 28176 2: 28257 3: kill exited nonzero 3: dead 2: kill exited nonzero 2: dead 1: kill exited nonzero 1: dead ``` Fixes cockroachdb#37370. Release note: None

ajkr · 2019-05-09T04:49:49Z

Alright, RFAL.

ajkr · 2019-05-09T15:42:16Z

bors r+

37390: roachprod: remove monitor netcat command r=ajkr a=ajkr `roachprod monitor` assumes `nc` will exit as soon as Cockroach server exits. This actually is not the case in later versions of netcat (tested on Ubuntu 18.04+). This PR changes to a polling approach calling `kill -0` once per second to monitor the Cockroach server's liveness. This should give us better portability and we verified the overhead is low (~0.65ms of a CPU core's time per `kill` invocation). Tested by running `roachprod monitor` locally, gradually killing the nodes, and observing the output: ``` 3: 28342 1: 28176 2: 28257 3: kill exited nonzero 3: dead 2: kill exited nonzero 2: dead 1: kill exited nonzero 1: dead ``` Fixes #37370. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>

craig · 2019-05-09T16:07:33Z

Build succeeded

GitHub CI (Cockroach)

The monitor stopped using netcat in cockroachdb#37390, but a bunch of comments about it were leftover. I think some code is leftover too, but I don't know what to do about it other than put a note on it. Release note: None

`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None

66414: roachprod: show process exit status with monitor r=tbg a=erikgrinaker ### roachprod: remove netcat references `roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in #37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None ### roachprod: show process exit status with monitor This patch changes `roachprod monitor` to use `systemctl` to poll process info on non-local clusters, and outputs the exit status for dead nodes. On local clusters, it retains the old logic. Release note: None During the lifecycle of a cluster (`create`, `start`, `stop`) the output is: ``` $ roachprod monitor grinaker-mon 2: dead (exit status 0) 3: dead (exit status 0) 1: dead (exit status 0) 1: 9628 2: 9714 3: 9674 1: dead (exit status 137) 2: dead (exit status 137) 3: dead (exit status 137) ``` /cc @cockroachdb/test-eng Co-authored-by: Erik Grinaker <[email protected]>

`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None

ajkr force-pushed the fix-roachprod-monitor-netcat branch from d1d9675 to ca714fb Compare May 8, 2019 06:35

petermattis reviewed May 8, 2019

View reviewed changes

ajkr force-pushed the fix-roachprod-monitor-netcat branch from ca714fb to bb4cdf8 Compare May 9, 2019 04:48

ajkr changed the title ~~roachprod: fix monitor netcat command~~ roachprod: remove monitor netcat command May 9, 2019

tbg approved these changes May 9, 2019

View reviewed changes

craig bot merged commit bb4cdf8 into cockroachdb:master May 9, 2019

andreimatei mentioned this pull request Jul 10, 2020

roachprod: correct leftover netcat comments in the monitor #51298

Closed

erikgrinaker mentioned this pull request Jun 13, 2021

roachprod: show process exit status with monitor #66414

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: remove monitor netcat command #37390

roachprod: remove monitor netcat command #37390

ajkr commented May 8, 2019 •

edited

Loading

cockroach-teamcity commented May 8, 2019

tbg commented May 8, 2019

petermattis left a comment

petermattis commented May 8, 2019

tbg commented May 8, 2019

petermattis commented May 8, 2019

tbg commented May 8, 2019

petermattis commented May 8, 2019

ajkr commented May 8, 2019

ajkr commented May 8, 2019 •

edited

Loading

tbg commented May 8, 2019 via email

tbg commented May 8, 2019 via email

ajkr commented May 9, 2019 •

edited

Loading

ajkr commented May 9, 2019

ajkr commented May 9, 2019

ajkr commented May 9, 2019

craig bot commented May 9, 2019

roachprod: remove monitor netcat command #37390

roachprod: remove monitor netcat command #37390

Conversation

ajkr commented May 8, 2019 • edited Loading

cockroach-teamcity commented May 8, 2019

tbg commented May 8, 2019

petermattis left a comment

Choose a reason for hiding this comment

petermattis commented May 8, 2019

tbg commented May 8, 2019

petermattis commented May 8, 2019

tbg commented May 8, 2019

petermattis commented May 8, 2019

ajkr commented May 8, 2019

ajkr commented May 8, 2019 • edited Loading

tbg commented May 8, 2019 via email

tbg commented May 8, 2019 via email

ajkr commented May 9, 2019 • edited Loading

ajkr commented May 9, 2019

ajkr commented May 9, 2019

ajkr commented May 9, 2019

craig bot commented May 9, 2019

Build succeeded

ajkr commented May 8, 2019 •

edited

Loading

ajkr commented May 8, 2019 •

edited

Loading

ajkr commented May 9, 2019 •

edited

Loading