-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: investigate query dip after 10 min node outage #10972
Comments
A 10min chaos event on Initially, recovery looked good with the under-replicated ranges dropping to 0 around 18:43. But then additional badness started happening at 18:50. We can also see the badness in the node liveness graph. The resuscitated node started failing node liveness heartbeats at 18:50 and then had a number of other blips before eventually recovering for good. The replica leaseholders graph showed that the recovering node initially started getting leases, then shed them all rapidly when the liveness heartbeats failed. Still poking around the graphs to see if anything else jumps out. |
Looks like there is some sort of RPC problem when the node liveness failures occur. During another chaos event on
I wonder of the stream and connection window size settings are proving problematic. |
I think they are. The context deadline exceeded error is likely just our
heartbeats not making it across.
I'm fixing that in:
https://reviewable.io/reviews/cockroachdb/cockroach/14424#-
The next thing I was going to do is make the conn win size a multiple of
the per stream one, or in fact get rid of the per-connection window
altogether if I can convince Ben.
…On Thu, Apr 6, 2017 at 11:42 AM, Peter Mattis ***@***.***> wrote:
Looks like there is some sort of RPC problem when the node liveness
failures occur. During another chaos event on blue, I see the recovering
node is doing fine and then boom, various slow request errors along with:
I170406 15:40:14.301949 422847 vendor/google.golang.org/grpc/server.go:752 grpc: Server.processUnaryRPC failed to write status: stream error: code = DeadlineExceeded desc = "context deadline exceeded"
I wonder of the stream and connection window size settings are proving
problematic.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10972 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcV50kbR2SK3nTgHjkNOPJOONbBdiks5rtQf_gaJpZM4K6LH2>
.
|
How are you going to do that? (Get rid of the window, not convince Ben). |
Just like you've done it... Change the constants in grpc.
…On Thu, Apr 6, 2017 at 11:48 AM, Peter Mattis ***@***.***> wrote:
The next thing I was going to do is make the conn win size a multiple of
the per stream one, or in fact get rid of the per-connection window
altogether if I can convince Ben.
How were you going to do that?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10972 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcYrRCQkGqBE-luVUGIBEHpvklEiHks5rtQk_gaJpZM4K6LH2>
.
|
Notice in the Raft Log graph above how the "raft log behind" metric initially shrinks rapidly when the dead node comes back, then has a period of flatness before another spike. I think this is yet more indication of RPC problems. |
I tried setting the connection window size to 8MB. It didn't help. |
I still haven't tracked down where the RPC problem is coming from. I'm experimenting with rate limiting the bandwidth we use for sending snapshots. Initial testing with a limit of 2 MB/sec shows this to significantly less the performance impact of the rebalance traffic when a node is declared dead. |
Note to self: my experiment uses a single bandwidth limit for both preemptive and Raft snapshots, but we probably want to allow Raft snapshots to be given more bandwidth. |
Here are graphs from a 10 min chaos event with experimental code which limits preemptive snapshots to 2 MB/sec and Raft snapshots to 4 MB/sec: The throughput impact of the node outage was significantly reduced and the node liveness failures disappeared which seems to indicate they were related to either overwhelming the gRPC connection or some bug in gRPC. I'm going to dig into this more tomorrow to try and better understand where the node liveness failures were coming from. |
Limit the bandwidth used for snapshots. Preemptive snapshots are throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft snapshots are throttled to 4 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The effect of limiting the bandwidth is that a preemptive snapshot for a 64 MB range will take ~32s to send and a Raft snapshot will take ~16s. The benefit is a much smaller impact on foreground traffic. Fixes cockroachdb#10972
Limit the bandwidth used for snapshots. Preemptive snapshots are throttled to 2 MB/sec (COCKROACH_PREEMPTIVE_SNAPSHOT_RATE) and Raft snapshots are throttled to 8 MB/sec (COCKROACH_RAFT_SNAPSHOT_RATE). The effect of limiting the bandwidth is that a preemptive snapshot for a 64 MB range will take ~32s to send and a Raft snapshot will take ~8s. The benefit is a much smaller impact on foreground traffic. Fixes cockroachdb#10972
A 6min chaos node outage on
blue
caused an dip in query performance after the node restarted:The node restarted at 22:47:? which corresponds with when the initial dip in performance. Performance then recovered and dipped again. The queue metrics show:
The initial bump in replicate activity after the node had been down for 5min is expected. But what is that second bump that starts at 22:50. The down node has been up for 2+ min at that point. Per node metrics show the replica GC queue activity was all on the down node which is to be expected.
The down node shows the following replicate queue activity:
So it was queueing up replicas for processing but never actually spending any time processing them. Perhaps it wasn't able to acquire the lease.
The text was updated successfully, but these errors were encountered: