Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: remove monitor netcat command #37390

Merged
merged 1 commit into from
May 9, 2019

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented May 8, 2019

roachprod monitor assumes nc will exit as soon as Cockroach server
exits. This actually is not the case in later versions of netcat (tested
on Ubuntu 18.04+).

This PR changes to a polling approach calling kill -0 once per second
to monitor the Cockroach server's liveness. This should give us better
portability and we verified the overhead is low (~0.65ms of a CPU core's
time per kill invocation). Tested by running roachprod monitor
locally, gradually killing the nodes, and observing the output:

3: 28342
1: 28176
2: 28257
3: kill exited nonzero
3: dead
2: kill exited nonzero
2: dead
1: kill exited nonzero
1: dead

Fixes #37370.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajkr ajkr force-pushed the fix-roachprod-monitor-netcat branch from d1d9675 to ca714fb Compare May 8, 2019 06:35
@tbg
Copy link
Member

tbg commented May 8, 2019

I have bad news -- -d isn't a thing on OSX, and this stuff has to work there too (because of --local).

I'm thinking that this netcat thing is trying to be too clever. It's just trying to achieve "instant" detection of the dead node, but we could just replace it with a "sleep 1" and end up in a functionally similar, but much simpler place. Am I missing a reason for which this isn't viable?

PS great find. Are we using 18.04 in our nightly testing, or where did you run into it?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When monitor was originally written, I recall trying to poll with lsof and finding it moderately expensive. nc is such a simple utility. It is sad that this behavior is changing between ubuntu releases.

Note that we know if the machine we're running nc on is local or not. We could optionally add -d based on that, or try to detect if -d is supported.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@petermattis
Copy link
Collaborator

Another thought is to add cockroach debug monitor or cockroach debug wait. I'd guess it is a few dozen lines of Go code to open the network connection and block waiting for it to be closed.

@tbg
Copy link
Member

tbg commented May 8, 2019

@petermattis are you saying you don't just want to rip out nc? That seems like a straightforward way to kill less eng time on this.

Before we write a utility that does low-level nc stuff against an open crdb port, we can just add a GRPC endpoint that lets you wait forever. But then you have all the migration concerns to deal with. I really think we should just remove nc from the equation and loop a little more often.

@petermattis
Copy link
Collaborator

@petermattis are you saying you don't just want to rip out nc? That seems like a straightforward way to kill less eng time on this.

I'm fine with ripping out nc, but see: "When monitor was originally written, I recall trying to poll with lsof and finding it moderately expensive." For performance tests, we don't want monitor to have an effect on performance.

@tbg
Copy link
Member

tbg commented May 8, 2019

Oh, I see. How about polling with kill -0 $PID? You use lsof to discover and then do a cheap thing frequently after.

@petermattis
Copy link
Collaborator

Oh, I see. How about polling with kill -0 $PID? You use lsof to discover and then do a cheap thing frequently after.

Seems possible. The irritating thing is testing that this doesn't have an effect on performance.

@ajkr
Copy link
Contributor Author

ajkr commented May 8, 2019

I have bad news -- -d isn't a thing on OSX, and this stuff has to work there too (because of --local).

That is very bad news!

PS great find. Are we using 18.04 in our nightly testing, or where did you run into it?

Thanks. I'm using 18.04 and was addressing one of Peter's suggestions to make a callback for node deaths. When running with --local on my machine, that callback never got called.

@ajkr
Copy link
Contributor Author

ajkr commented May 8, 2019

Should we measure how long kill takes? What is a good number? How about sacrificing at most 0.1% of CPU time since this is only used in tests? I'd guess it'll be way less than that since kill is a small program that just sends a signal, and we're talking about running it once a second. But we can reduce the frequency of running it if needed.

@tbg
Copy link
Member

tbg commented May 8, 2019 via email

@tbg
Copy link
Member

tbg commented May 8, 2019 via email

@ajkr
Copy link
Contributor Author

ajkr commented May 9, 2019

The kill command takes 0.646 ms so running it once per second is within our 0.1% overhead bound (although not by as large a margin as I would've guessed!).

$ rm -f ./kill-times.out ; for l in $(seq 1000); do perf stat -e cpu-clock kill -0 `pidof sleep` |& grep cpu-clock | awk '{print $1}' >>./kill-times.out ; done ; echo "scale=5; ($(paste -sd+ ./kill-times.out)) / 1000" | bc
.64649

@ajkr
Copy link
Contributor Author

ajkr commented May 9, 2019

For comparison, here is the same methodology used to measure lsof times. Those were indeed way slower at 36ms per call. That'd be 3.6% of a core during benchmarking which I agree doesn't look good.

$ rm -f ./lsof-times.out ; for l in $(seq 1000); do perf stat -e cpu-clock lsof -i :12345 -sTCP:LISTEN |& grep cpu-clock | awk '{print $1}' >>./lsof-times.out ; done ; echo "scale=5; ($(paste -sd+ ./lsof-times.out)) / 1000" | bc
35.96782

`roachprod monitor` assumes `nc` will exit as soon as Cockroach server
exits. This actually is not the case in later versions of netcat (tested
on Ubuntu 18.04+).

This PR changes to a polling approach calling `kill -0` once per second
to monitor the Cockroach server's liveness. This should give us better
portability and we verified the overhead is low (~0.65ms of a CPU core's
time per `kill` invocation). Tested by running `roachprod monitor`
locally, gradually killing the nodes, and observing the output:

```
3: 28342
1: 28176
2: 28257
3: kill exited nonzero
3: dead
2: kill exited nonzero
2: dead
1: kill exited nonzero
1: dead
```

Fixes cockroachdb#37370.

Release note: None
@ajkr ajkr force-pushed the fix-roachprod-monitor-netcat branch from ca714fb to bb4cdf8 Compare May 9, 2019 04:48
@ajkr ajkr changed the title roachprod: fix monitor netcat command roachprod: remove monitor netcat command May 9, 2019
@ajkr
Copy link
Contributor Author

ajkr commented May 9, 2019

Alright, RFAL.

@ajkr
Copy link
Contributor Author

ajkr commented May 9, 2019

bors r+

craig bot pushed a commit that referenced this pull request May 9, 2019
37390: roachprod: remove monitor netcat command r=ajkr a=ajkr

`roachprod monitor` assumes `nc` will exit as soon as Cockroach server
exits. This actually is not the case in later versions of netcat (tested
on Ubuntu 18.04+).

This PR changes to a polling approach calling `kill -0` once per second
to monitor the Cockroach server's liveness. This should give us better
portability and we verified the overhead is low (~0.65ms of a CPU core's
time per `kill` invocation). Tested by running `roachprod monitor`
locally, gradually killing the nodes, and observing the output:

```
3: 28342
1: 28176
2: 28257
3: kill exited nonzero
3: dead
2: kill exited nonzero
2: dead
1: kill exited nonzero
1: dead
```

Fixes #37370.

Release note: None

Co-authored-by: Andrew Kryczka <[email protected]>
@craig
Copy link
Contributor

craig bot commented May 9, 2019

Build succeeded

@craig craig bot merged commit bb4cdf8 into cockroachdb:master May 9, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Jul 10, 2020
The monitor stopped using netcat in cockroachdb#37390, but a bunch of comments
about it were leftover. I think some code is leftover too, but I don't
know what to do about it other than put a note on it.

Release note: None
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this pull request Jun 13, 2021
`roachprod monitor` used to use `netcat` to wait for process
termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated `netcat` code and references.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 14, 2021
66414: roachprod: show process exit status with monitor r=tbg a=erikgrinaker

### roachprod: remove netcat references

`roachprod monitor` used to use `netcat` to wait for process
termination, but this was replaced by a `kill -0` loop back in #37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated `netcat` code and references.

Release note: None

### roachprod: show process exit status with monitor

This patch changes `roachprod monitor` to use `systemctl` to poll
process info on non-local clusters, and outputs the exit status for dead
nodes. On local clusters, it retains the old logic.

Release note: None

During the lifecycle of a cluster (`create`, `start`, `stop`) the output is:

```
$ roachprod monitor grinaker-mon
2: dead (exit status 0)
3: dead (exit status 0)
1: dead (exit status 0)
1: 9628
2: 9714
3: 9674
1: dead (exit status 137)
2: dead (exit status 137)
3: dead (exit status 137)
```

/cc @cockroachdb/test-eng 

Co-authored-by: Erik Grinaker <[email protected]>
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this pull request Jun 28, 2021
`roachprod monitor` used to use `netcat` to wait for process
termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390.
However, the code still contained code and comments related to netcat.

This patch removes the outdated `netcat` code and references.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachprod: monitor's dead node detection is netcat version dependent
4 participants