stability: investigate query dip after 10 min node outage #10972

petermattis · 2016-11-23T02:51:55Z

A 6min chaos node outage on blue caused an dip in query performance after the node restarted:

The node restarted at 22:47:? which corresponds with when the initial dip in performance. Performance then recovered and dipped again. The queue metrics show:

The initial bump in replicate activity after the node had been down for 5min is expected. But what is that second bump that starts at 22:50. The down node has been up for 2+ min at that point. Per node metrics show the replica GC queue activity was all on the down node which is to be expected.

The down node shows the following replicate queue activity:

So it was queueing up replicas for processing but never actually spending any time processing them. Perhaps it wasn't able to acquire the lease.

The text was updated successfully, but these errors were encountered:

petermattis · 2017-04-05T19:25:41Z

A 10min chaos event on blue caused a dip in performance both while the node was dead and then while it was recovering:

Initially, recovery looked good with the under-replicated ranges dropping to 0 around 18:43. But then additional badness started happening at 18:50.

We can also see the badness in the node liveness graph. The resuscitated node started failing node liveness heartbeats at 18:50 and then had a number of other blips before eventually recovering for good.

The replica leaseholders graph showed that the recovering node initially started getting leases, then shed them all rapidly when the liveness heartbeats failed.

Still poking around the graphs to see if anything else jumps out.

petermattis · 2017-04-06T15:42:53Z

Looks like there is some sort of RPC problem when the node liveness failures occur. During another chaos event on blue, I see the recovering node is doing fine and then boom, various slow request errors along with:

I170406 15:40:14.301949 422847 vendor/google.golang.org/grpc/server.go:752  grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded"

I wonder of the stream and connection window size settings are proving problematic.

andreimatei · 2017-04-06T15:46:33Z

I think they are. The context deadline exceeded error is likely just our heartbeats not making it across. I'm fixing that in: https://reviewable.io/reviews/cockroachdb/cockroach/14424#- The next thing I was going to do is make the conn win size a multiple of the per stream one, or in fact get rid of the per-connection window altogether if I can convince Ben.

…

On Thu, Apr 6, 2017 at 11:42 AM, Peter Mattis ***@***.***> wrote: Looks like there is some sort of RPC problem when the node liveness failures occur. During another chaos event on blue, I see the recovering node is doing fine and then boom, various slow request errors along with: I170406 15:40:14.301949 422847 vendor/google.golang.org/grpc/server.go:752 grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded" I wonder of the stream and connection window size settings are proving problematic. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10972 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXBcV50kbR2SK3nTgHjkNOPJOONbBdiks5rtQf_gaJpZM4K6LH2> .

petermattis · 2017-04-06T15:48:13Z

The next thing I was going to do is make the conn win size a multiple of
the per stream one, or in fact get rid of the per-connection window
altogether if I can convince Ben.

How are you going to do that? (Get rid of the window, not convince Ben).

andreimatei · 2017-04-06T15:50:32Z

Just like you've done it... Change the constants in grpc.

…

On Thu, Apr 6, 2017 at 11:48 AM, Peter Mattis ***@***.***> wrote: The next thing I was going to do is make the conn win size a multiple of the per stream one, or in fact get rid of the per-connection window altogether if I can convince Ben. How were you going to do that? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10972 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXBcYrRCQkGqBE-luVUGIBEHpvklEiHks5rtQk_gaJpZM4K6LH2> .

petermattis · 2017-04-06T15:51:18Z

Notice in the Raft Log graph above how the "raft log behind" metric initially shrinks rapidly when the dead node comes back, then has a period of flatness before another spike. I think this is yet more indication of RPC problems.

petermattis · 2017-04-06T15:52:04Z

Just like you've done it... Change the constants in grpc.

I tried setting the connection window size to 8MB. It didn't help.

petermattis · 2017-04-06T20:12:47Z

I still haven't tracked down where the RPC problem is coming from. I'm experimenting with rate limiting the bandwidth we use for sending snapshots. Initial testing with a limit of 2 MB/sec shows this to significantly less the performance impact of the rebalance traffic when a node is declared dead.

petermattis · 2017-04-06T20:23:56Z

Note to self: my experiment uses a single bandwidth limit for both preemptive and Raft snapshots, but we probably want to allow Raft snapshots to be given more bandwidth.

petermattis · 2017-04-07T03:12:49Z

Here are graphs from a 10 min chaos event with experimental code which limits preemptive snapshots to 2 MB/sec and Raft snapshots to 4 MB/sec:

The throughput impact of the node outage was significantly reduced and the node liveness failures disappeared which seems to indicate they were related to either overwhelming the gRPC connection or some bug in gRPC. I'm going to dig into this more tomorrow to try and better understand where the node liveness failures were coming from.

Limit the bandwidth used for snapshots. Preemptive snapshots are throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft snapshots are throttled to 4 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The effect of limiting the bandwidth is that a preemptive snapshot for a 64 MB range will take ~32s to send and a Raft snapshot will take ~16s. The benefit is a much smaller impact on foreground traffic. Fixes cockroachdb#10972

Limit the bandwidth used for snapshots. Preemptive snapshots are throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft snapshots are throttled to 8 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The effect of limiting the bandwidth is that a preemptive snapshot for a 64 MB range will take ~32s to send and a Raft snapshot will take ~8s. The benefit is a much smaller impact on foreground traffic. Fixes cockroachdb#10972

petermattis self-assigned this Nov 23, 2016

petermattis added this to the 1.0 milestone Feb 23, 2017

petermattis changed the title ~~stability: investigate query dip after 6min node outage~~ stability: investigate query dip after 10 min node outage Apr 5, 2017

petermattis mentioned this issue Apr 7, 2017

storage: limit the bandwidth used for snapshots #14718

Merged

petermattis closed this as completed in #14718 Apr 10, 2017

tbg mentioned this issue Oct 5, 2017

kvserver: avoid need for manual tuning of rebalance rate setting #14768

Open

m-schneider mentioned this issue Oct 10, 2017

stability: stuck requests after rebalancing-inducing downtime #19165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: investigate query dip after 10 min node outage #10972

stability: investigate query dip after 10 min node outage #10972

petermattis commented Nov 23, 2016

petermattis commented Apr 5, 2017

petermattis commented Apr 6, 2017

andreimatei commented Apr 6, 2017 via email

petermattis commented Apr 6, 2017 •

edited

Loading

andreimatei commented Apr 6, 2017 via email

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 7, 2017

stability: investigate query dip after 10 min node outage #10972

stability: investigate query dip after 10 min node outage #10972

Comments

petermattis commented Nov 23, 2016

petermattis commented Apr 5, 2017

petermattis commented Apr 6, 2017

andreimatei commented Apr 6, 2017 via email

petermattis commented Apr 6, 2017 • edited Loading

andreimatei commented Apr 6, 2017 via email

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 6, 2017

petermattis commented Apr 7, 2017

petermattis commented Apr 6, 2017 •

edited

Loading