Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: investigate query dip after 10 min node outage #10972

Closed
petermattis opened this issue Nov 23, 2016 · 10 comments
Closed

stability: investigate query dip after 10 min node outage #10972

petermattis opened this issue Nov 23, 2016 · 10 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

A 6min chaos node outage on blue caused an dip in query performance after the node restarted:

screen shot 2016-11-22 at 9 40 58 pm

The node restarted at 22:47:? which corresponds with when the initial dip in performance. Performance then recovered and dipped again. The queue metrics show:

screen shot 2016-11-22 at 9 41 37 pm

The initial bump in replicate activity after the node had been down for 5min is expected. But what is that second bump that starts at 22:50. The down node has been up for 2+ min at that point. Per node metrics show the replica GC queue activity was all on the down node which is to be expected.

The down node shows the following replicate queue activity:

screen shot 2016-11-22 at 9 47 06 pm

So it was queueing up replicas for processing but never actually spending any time processing them. Perhaps it wasn't able to acquire the lease.

@petermattis petermattis self-assigned this Nov 23, 2016
@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@petermattis petermattis changed the title stability: investigate query dip after 6min node outage stability: investigate query dip after 10 min node outage Apr 5, 2017
@petermattis
Copy link
Collaborator Author

A 10min chaos event on blue caused a dip in performance both while the node was dead and then while it was recovering:

screen shot 2017-04-05 at 3 12 50 pm

Initially, recovery looked good with the under-replicated ranges dropping to 0 around 18:43. But then additional badness started happening at 18:50.

screen shot 2017-04-05 at 3 13 17 pm

screen shot 2017-04-05 at 3 13 25 pm

We can also see the badness in the node liveness graph. The resuscitated node started failing node liveness heartbeats at 18:50 and then had a number of other blips before eventually recovering for good.

screen shot 2017-04-05 at 3 14 32 pm

The replica leaseholders graph showed that the recovering node initially started getting leases, then shed them all rapidly when the liveness heartbeats failed.

screen shot 2017-04-05 at 3 13 03 pm

Still poking around the graphs to see if anything else jumps out.

@petermattis
Copy link
Collaborator Author

Looks like there is some sort of RPC problem when the node liveness failures occur. During another chaos event on blue, I see the recovering node is doing fine and then boom, various slow request errors along with:

I170406 15:40:14.301949 422847 vendor/google.golang.org/grpc/server.go:752  grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded"

I wonder of the stream and connection window size settings are proving problematic.

@andreimatei
Copy link
Contributor

andreimatei commented Apr 6, 2017 via email

@petermattis
Copy link
Collaborator Author

petermattis commented Apr 6, 2017

The next thing I was going to do is make the conn win size a multiple of
the per stream one, or in fact get rid of the per-connection window
altogether if I can convince Ben.

How are you going to do that? (Get rid of the window, not convince Ben).

@andreimatei
Copy link
Contributor

andreimatei commented Apr 6, 2017 via email

@petermattis
Copy link
Collaborator Author

Notice in the Raft Log graph above how the "raft log behind" metric initially shrinks rapidly when the dead node comes back, then has a period of flatness before another spike. I think this is yet more indication of RPC problems.

@petermattis
Copy link
Collaborator Author

Just like you've done it... Change the constants in grpc.

I tried setting the connection window size to 8MB. It didn't help.

@petermattis
Copy link
Collaborator Author

I still haven't tracked down where the RPC problem is coming from. I'm experimenting with rate limiting the bandwidth we use for sending snapshots. Initial testing with a limit of 2 MB/sec shows this to significantly less the performance impact of the rebalance traffic when a node is declared dead.

@petermattis
Copy link
Collaborator Author

Note to self: my experiment uses a single bandwidth limit for both preemptive and Raft snapshots, but we probably want to allow Raft snapshots to be given more bandwidth.

@petermattis
Copy link
Collaborator Author

Here are graphs from a 10 min chaos event with experimental code which limits preemptive snapshots to 2 MB/sec and Raft snapshots to 4 MB/sec:

screen shot 2017-04-06 at 11 07 28 pm

screen shot 2017-04-06 at 11 07 36 pm

screen shot 2017-04-06 at 11 07 47 pm

The throughput impact of the node outage was significantly reduced and the node liveness failures disappeared which seems to indicate they were related to either overwhelming the gRPC connection or some bug in gRPC. I'm going to dig into this more tomorrow to try and better understand where the node liveness failures were coming from.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 7, 2017
Limit the bandwidth used for snapshots. Preemptive snapshots are
throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft
snapshots are throttled to 4 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The
effect of limiting the bandwidth is that a preemptive snapshot for a 64
MB range will take ~32s to send and a Raft snapshot will take ~16s. The
benefit is a much smaller impact on foreground traffic.

Fixes cockroachdb#10972
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 10, 2017
Limit the bandwidth used for snapshots. Preemptive snapshots are
throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft
snapshots are throttled to 8 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The
effect of limiting the bandwidth is that a preemptive snapshot for a 64
MB range will take ~32s to send and a Raft snapshot will take ~8s. The
benefit is a much smaller impact on foreground traffic.

Fixes cockroachdb#10972
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants