-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: remove monitor netcat command #37390
Conversation
d1d9675
to
ca714fb
Compare
I have bad news -- I'm thinking that this PS great find. Are we using 18.04 in our nightly testing, or where did you run into it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When monitor
was originally written, I recall trying to poll with lsof
and finding it moderately expensive. nc
is such a simple utility. It is sad that this behavior is changing between ubuntu releases.
Note that we know if the machine we're running nc
on is local or not. We could optionally add -d
based on that, or try to detect if -d
is supported.
Reviewable status: complete! 0 of 0 LGTMs obtained
Another thought is to add |
@petermattis are you saying you don't just want to rip out Before we write a utility that does low-level nc stuff against an open crdb port, we can just add a GRPC endpoint that lets you wait forever. But then you have all the migration concerns to deal with. I really think we should just remove |
I'm fine with ripping out |
Oh, I see. How about polling with |
Seems possible. The irritating thing is testing that this doesn't have an effect on performance. |
That is very bad news!
Thanks. I'm using 18.04 and was addressing one of Peter's suggestions to make a callback for node deaths. When running with |
Should we measure how long |
Sounds reasonable to me. I would be very surprised if it weren't negligible
but worth checking once.
…On Wed, May 8, 2019, 20:19 Andrew Kryczka ***@***.***> wrote:
Should we measure how long kill takes? What is a good number? How about
sacrificing at most 0.1% of CPU time since this is only used in tests (I'd
guess it'll be way less than that since kill is a small program that just
sends a signal, and we're talking about running it once a second)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#37390 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGXPZFQOGGPU22XWWGWYRDPUMKT7ANCNFSM4HLOZSPA>
.
|
PS with -0 it doesn't even send a signal, I guess all it does is check
whether the PID exists. Of course there might be the problem that the PID
switches out from under you. Wonder if the old code handled that. The
concern is theoretical so I wouldn't worry about regressing here.
…On Wed, May 8, 2019, 20:39 Tobias Grieger ***@***.***> wrote:
Sounds reasonable to me. I would be very surprised if it weren't
negligible but worth checking once.
On Wed, May 8, 2019, 20:19 Andrew Kryczka ***@***.***>
wrote:
> Should we measure how long kill takes? What is a good number? How about
> sacrificing at most 0.1% of CPU time since this is only used in tests (I'd
> guess it'll be way less than that since kill is a small program that
> just sends a signal, and we're talking about running it once a second)?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#37390 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABGXPZFQOGGPU22XWWGWYRDPUMKT7ANCNFSM4HLOZSPA>
> .
>
|
The kill command takes 0.646 ms so running it once per second is within our 0.1% overhead bound (although not by as large a margin as I would've guessed!).
|
For comparison, here is the same methodology used to measure
|
`roachprod monitor` assumes `nc` will exit as soon as Cockroach server exits. This actually is not the case in later versions of netcat (tested on Ubuntu 18.04+). This PR changes to a polling approach calling `kill -0` once per second to monitor the Cockroach server's liveness. This should give us better portability and we verified the overhead is low (~0.65ms of a CPU core's time per `kill` invocation). Tested by running `roachprod monitor` locally, gradually killing the nodes, and observing the output: ``` 3: 28342 1: 28176 2: 28257 3: kill exited nonzero 3: dead 2: kill exited nonzero 2: dead 1: kill exited nonzero 1: dead ``` Fixes cockroachdb#37370. Release note: None
ca714fb
to
bb4cdf8
Compare
Alright, RFAL. |
bors r+ |
37390: roachprod: remove monitor netcat command r=ajkr a=ajkr `roachprod monitor` assumes `nc` will exit as soon as Cockroach server exits. This actually is not the case in later versions of netcat (tested on Ubuntu 18.04+). This PR changes to a polling approach calling `kill -0` once per second to monitor the Cockroach server's liveness. This should give us better portability and we verified the overhead is low (~0.65ms of a CPU core's time per `kill` invocation). Tested by running `roachprod monitor` locally, gradually killing the nodes, and observing the output: ``` 3: 28342 1: 28176 2: 28257 3: kill exited nonzero 3: dead 2: kill exited nonzero 2: dead 1: kill exited nonzero 1: dead ``` Fixes #37370. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>
Build succeeded |
The monitor stopped using netcat in cockroachdb#37390, but a bunch of comments about it were leftover. I think some code is leftover too, but I don't know what to do about it other than put a note on it. Release note: None
`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None
66414: roachprod: show process exit status with monitor r=tbg a=erikgrinaker ### roachprod: remove netcat references `roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in #37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None ### roachprod: show process exit status with monitor This patch changes `roachprod monitor` to use `systemctl` to poll process info on non-local clusters, and outputs the exit status for dead nodes. On local clusters, it retains the old logic. Release note: None During the lifecycle of a cluster (`create`, `start`, `stop`) the output is: ``` $ roachprod monitor grinaker-mon 2: dead (exit status 0) 3: dead (exit status 0) 1: dead (exit status 0) 1: 9628 2: 9714 3: 9674 1: dead (exit status 137) 2: dead (exit status 137) 3: dead (exit status 137) ``` /cc @cockroachdb/test-eng Co-authored-by: Erik Grinaker <[email protected]>
`roachprod monitor` used to use `netcat` to wait for process termination, but this was replaced by a `kill -0` loop back in cockroachdb#37390. However, the code still contained code and comments related to netcat. This patch removes the outdated `netcat` code and references. Release note: None
roachprod monitor
assumesnc
will exit as soon as Cockroach serverexits. This actually is not the case in later versions of netcat (tested
on Ubuntu 18.04+).
This PR changes to a polling approach calling
kill -0
once per secondto monitor the Cockroach server's liveness. This should give us better
portability and we verified the overhead is low (~0.65ms of a CPU core's
time per
kill
invocation). Tested by runningroachprod monitor
locally, gradually killing the nodes, and observing the output:
Fixes #37370.
Release note: None