stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

tamird · 2016-09-26T17:58:47Z

roachdemo -n 3
block_writer
get a coffee
echo "range_max_bytes: 1048577" | cockroach zone set datablocks --file=-
watch block_writer, it will stall

tail cockroach-data/1/logs/cockroach.INFO:

I160926 13:54:56.948066 69 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577
I160926 13:54:56.948152 71 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577
I160926 13:54:57.546802 74 storage/replica_raftstorage.go:290  [n1,s1,r410:/Table/51/1/{756457…-807945…}]: not generating snapshot because replica is too large: 21395821 > 2 * 1048577
I160926 13:54:57.546940 73 storage/replica_raftstorage.go:290  [n1,s1,r409:/{Table/51/1/8…-Max}]: not generating snapshot because replica is too large: 21410643 > 2 * 1048577
I160926 13:54:57.547118 74 storage/replica_raftstorage.go:290  [n1,s1,r412:/Table/51/1/{653153…-114755…}]: not generating snapshot because replica is too large: 20730984 > 2 * 1048577
I160926 13:54:57.793041 75 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577
I160926 13:54:57.793060 75 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577
I160926 13:54:57.949481 73 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577
I160926 13:54:57.949497 68 storage/replica_raftstorage.go:290  [n1,s1,r413:/Table/51/1/{297339…-347337…}]: not generating snapshot because replica is too large: 20984435 > 2 * 1048577
I160926 13:54:57.949515 73 storage/replica_raftstorage.go:290  [n1,s1,r401:/Table/51/1/{177428…-230647…}]: not generating snapshot because replica is too large: 21944985 > 2 * 1048577

Profit

Seems like we're declining raft-requested snapshots, which is very bad indeed. The following diff seems to fix the problem (but needs tests):

diff --git a/storage/replica_raftstorage.go b/storage/replica_raftstorage.go
index c4e9a20..71b9044 100644
--- a/storage/replica_raftstorage.go
+++ b/storage/replica_raftstorage.go
@@ -269,7 +269,7 @@ func (r *Replica) GetFirstIndex() (uint64, error) {
 // Snapshot implements the raft.Storage interface.
 // Snapshot requires that the replica lock is held.
 func (r *Replica) Snapshot() (raftpb.Snapshot, error) {
-   snap, err := r.SnapshotWithContext(context.Background())
+   snap, err := r.SnapshotWithContext(context.Background(), false)
    if err != nil {
        return raftpb.Snapshot{}, err
    }
@@ -279,10 +279,10 @@ func (r *Replica) Snapshot() (raftpb.Snapshot, error) {
 // SnapshotWithContext is main implementation for Snapshot() but it takes a
 // context to allow tracing. If this method returns without error, callers
 // must eventually call CloseOutSnap to ready this replica for more snapshots.
-func (r *Replica) SnapshotWithContext(ctx context.Context) (*OutgoingSnapshot, error) {
+func (r *Replica) SnapshotWithContext(ctx context.Context, checkSize bool) (*OutgoingSnapshot, error) {
    rangeID := r.RangeID

-   if r.exceedsDoubleSplitSizeLocked() {
+   if checkSize && r.exceedsDoubleSplitSizeLocked() {
        maxBytes := r.mu.maxBytes
        size := r.mu.state.Stats.Total()
        log.Infof(ctx,
@@ -341,7 +341,7 @@ func (r *Replica) GetSnapshot(ctx context.Context) (*OutgoingSnapshot, error) {
        <-doneChan

        r.mu.Lock()
-       snap, err := r.SnapshotWithContext(ctx)
+       snap, err := r.SnapshotWithContext(ctx, true)
        if err == nil {
            r.mu.outSnap.claimed = true
        }

The text was updated successfully, but these errors were encountered:

tamird · 2016-09-26T18:01:54Z

Actually, I'm not sure that diff fixes things. After apply that change, I still see a flurry of messages:

W160926 14:01:07.779983 5231 storage/replica.go:2381  [n1,s1,r652:/Table/51/1/67{26769…-39039…}] range=652: failed to send snapshot: range=652: remote couldn't accept snapshot with error: [n3,s3],r652: cannot apply snapshot: snapshot intersects existing range [n3,s3,r538:/Table/51/1/6{560904…-936519…}]

jordanlewis · 2016-09-26T18:05:15Z

I believe this behavior has existed since 3bdf92c in July. The idea seems to be that the snapshot should be postponed until the range is split. Is that not happening here?

tamird · 2016-09-26T18:10:55Z

It's easy to imagine a deadlock where splitting a range requires a transaction to commit on the meta ranges, but those ranges need a snapshot (and a split) and so can't make progress. Perhaps it's OK to remove this now that we live in a streaming snapshot world?

cc @dt

Also, @petermattis rightly points out that the presence of any snapshots at all is quite alarming. Perhaps this aggressive splitting causes nodes to fall behind log truncation?

dt · 2016-09-26T18:25:15Z

I picked off 3bdf92c with only a little of the motivating context (explored more in #7581), but essentially, yes, the goal was that waiting would let it split first.

tamird · 2016-09-26T19:03:04Z

OK so it seems that the snapshots are expected in the case of slow-to-apply-splits on followers.

petermattis · 2016-09-26T20:04:12Z

I'm able to reproduce badness here on a local 3-node cluster. I'm going to spend some time tracking down what is going on. Probably related to the perf hiccups I'm seeing on #9465.

petermattis · 2016-09-26T21:26:54Z

In one case I just investigated (and then accidentally deleted the logs for), a non-leader node was not receiving messages from the leader and called an election. Eventually a message from the leader got through which caused the leader to step down. But the non-leader's election failed because it was behind the quorum commit point. Because the leader stepped down we had to wait an election cycle for the leader to campaign again.

Just added some more logging and will train my fingers not to delete the logs.

petermattis · 2016-09-27T12:52:16Z

We're dropping message because the per-replica receive queue is full. Trying to track down why this is happening.

petermattis · 2016-09-28T01:24:51Z

Some weird transport stuff is going on. Still working on tracking this down as it happens very sporadically. Here is one instance:

I160927 16:24:32.958843 303 storage/replica.go:1588  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] prepared command 1a3dadf6c68fa0c9: maxLeaseIndex=139 leaseAppliedIndex=138
I160927 16:24:32.958875 303 storage/replica.go:1719  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] proposing command 1a3dadf6c68fa0c9
I160927 16:24:32.960473 102 storage/replica.go:2485  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] processing command 1a3dadf6c68fa0c9: maxLeaseIndex=139
I160927 16:24:32.960679 131 storage/replica.go:2485  [n3,s3,r34:/Table/51/1/6{608332…-859387…}] processing command 1a3dadf6c68fa0c9: maxLeaseIndex=139
I160927 16:24:32.965811 167 storage/replica.go:2485  [n2,s2,r34:/Table/51/1/6{608332…-859387…}] processing command 1a3dadf6c68fa0c9: maxLeaseIndex=139

Above we can see command 1a3dadf6c68fa0c9 is proposed on n1 and then fairly quickly executed on n3 and n2. The next command is 57aa312e322f59de:

I160927 16:24:32.987504 327 storage/replica.go:1588  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] prepared command 57aa312e322f59de: maxLeaseIndex=140 leaseAppliedIndex=139
I160927 16:24:32.987625 327 storage/replica.go:1719  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] proposing command 57aa312e322f59de
I160927 16:24:32.992452 105 storage/replica.go:2485  [n1,s1,r34:/Table/51/1/6{608332…-859387…}] processing command 57aa312e322f59de: maxLeaseIndex=140
I160927 16:24:32.992674 142 storage/replica.go:2485  [n3,s3,r34:/Table/51/1/6{608332…-859387…}] processing command 57aa312e322f59de: maxLeaseIndex=140
I160927 16:24:44.558447 122 storage/replica.go:2485  [n2,s2,r34:/Table/51/1/6{608332…-859387…}] processing command 57aa312e322f59de: maxLeaseIndex=140

Once again, proposed on n1 and fairly quickly executed on n3. But notice that 12 second gap before execution on n2. During this time period, traffic to n2 (as reported by period raft_transport.go stats logging) shows:

I160927 16:24:31.933984 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0       200.9         0.0         0.0       288.9
...
I160927 16:24:32.933711 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0       303.1         0.0         0.0       377.1
...
I160927 16:24:33.933948 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0        52.0         0.0         0.0        98.0
    3:      0      1          0         1.0         0.0         0.0         0.0
...
I160927 16:24:34.938721 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0        20.9         0.0         0.0        20.9
    3:      0      1          0         0.0         0.0         0.0         0.0
...
I160927 16:24:35.934188 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0         3.0         0.0         0.0         3.0
    3:      0      1          0         0.0         0.0         0.0         0.0
...
I160927 16:24:36.934220 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0        11.0         0.0         0.0         5.0
    3:      0      1          0         7.0         0.0         0.0         0.0
...
I160927 16:24:37.934170 50 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    1:      0     31          0        38.0         0.0         0.0        35.0
    3:      0      2          0         7.0         0.0         0.0         0.0

So n2 is only receiving a trickle of data from n1. Is something wrong with n1? Well, it seems to be sending data to n3 just fine:

I160927 16:24:31.930262 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0       166.1         0.0         0.0       282.1
    3:      0     70          0      3076.5         0.0         0.0      5001.4
...
I160927 16:24:32.930323 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0       258.0         0.0         0.0       365.0
    3:      0     70          0      2938.8         0.0         0.0      5009.7
...
I160927 16:24:33.932040 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0        39.9         0.0         0.0        94.8
    3:      0     70          0      2654.5         0.0         0.0      4663.0
...
I160927 16:24:34.930467 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0        21.0         0.0         0.0        21.0
    3:      0     70          0      2435.3         0.0         0.0      4283.5
...
I160927 16:24:35.930926 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0         2.0         0.0         0.0         3.0
    3:      0     70          0      2884.2         0.0         0.0      4838.9
...
I160927 16:24:36.930381 12 storage/raft_transport.go:232  stats:
         qlen   qmax   qdropped client-sent client-recv server-sent server-recv
    2:      0    473          0         3.0         0.0         0.0        12.0
    3:      2     70          0      3012.6         0.0         0.0      5121.7

Note that this is a 3 node cluster so it's not like there is another node that could be taking the traffic. All the traffic being sent to n3 should also be sent to n2. I only had --vmodule=raft=4 enabled so I can't see if raft was deciding not to send messages to n2 for some reason. It doesn't seem like the problem is with GRPC though because we're not seeing massive queue lengths.

petermattis · 2016-09-28T01:31:51Z

Despite not having dropped any RPCs, I am seeing the following in the logs for n1:

W160927 16:24:30.737227 454 storage/replica.go:2381  [n1,s1,r32:/Table/51/1/3{109360…-398596…}] range=32: failed to send snapshot: range=32: remote couldn't accept snapshot with error: [n2,s2],r32: cannot apply snapshot: snapshot intersects existing range [n2,s2,r18:/Table/51/1/{283946…-339859…}]
I160927 16:24:30.737630 112 raft/raft.go:856  [n1,s1,r7:/Table/51{-/1/49388…}] 1 failed to send message to 2 because it is unreachable [next = 2087, match = 2086, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160927 16:24:30.737714 104 raft/raft.go:856  [n1,s1,r11:/Table/51/1/8{051548…-616082…}] 1 failed to send message to 2 because it is unreachable [next = 667, match = 666, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160927 16:24:30.737750 104 raft/raft.go:856  [n1,s1,r11:/Table/51/1/8{051548…-616082…}] 1 failed to send message to 2 because it is unreachable [next = 667, match = 666, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160927 16:24:30.737824 105 raft/raft.go:856  [n1,s1,r23:/Table/51/1/4{337971…-665431…}] 1 failed to send message to 2 because it is unreachable [next = 77, match = 76, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]

The raft errors are due to calls to RawNode.ReportUnreachable which we call when a message is dropped. So something is dropping the messages without incrementing the dropped count. Seems like that can only happen if the circuit breaker is open. Does the failed snapshot cause the circuit breaker to open? My train ride is almost over. The answer will have to wait until tomorrow.

petermattis · 2016-09-28T01:34:41Z

Yep, it looks like RaftTransport.SendSnapshot is opening the circuit breaker when the client.RaftSnapshot RPC returns an error. That doesn't seem right.

Cc @jordanlewis, @spencerkimball, @tamird who have all touched this code recently and/or know about our circuit breakers.

jordanlewis · 2016-09-28T02:08:33Z

Yes, this seems wrong indeed. It seems like SendSnapshot needs to only propagate connection-related errors to the circuit breaker - at the moment, even application-related errors are propagated to the circuit breaker.

It looks like the other user of the circuit breaker in the transport also propagates all connection errors, but what I didn't realize when I was following that pattern was that the raft transport doesn't actually produce any application errors because it's a long-lived connection.

I'll fix this up tomorrow, assuming my understanding here is correct.

tamird · 2016-09-28T02:09:58Z

A small patch I think might fix this:

diff --git a/storage/raft_transport.go b/storage/raft_transport.go
index 9873dc1..2196a11 100644
--- a/storage/raft_transport.go
+++ b/storage/raft_transport.go
@@ -606,6 +606,26 @@ func (t *RaftTransport) SendAsync(req *RaftMessageRequest) bool {
    }
 }

+type snapshotClientWithBreaker struct {
+   MultiRaft_RaftSnapshotClient
+   breaker *circuit.Breaker
+}
+
+func (c snapshotClientWithBreaker) Send(m *SnapshotRequest) error {
+   return c.breaker.Call(func() error {
+       return c.MultiRaft_RaftSnapshotClient.Send(m)
+   }, 0)
+}
+
+func (c snapshotClientWithBreaker) Recv() (*SnapshotResponse, error) {
+   var m *SnapshotResponse
+   return m, c.breaker.Call(func() error {
+       var err error
+       m, err = c.MultiRaft_RaftSnapshotClient.Recv()
+       return err
+   }, 0)
+}
+
 // SendSnapshot streams the given outgoing snapshot. The caller is responsible for closing the
 // OutgoingSnapshot with snap.Close.
 func (t *RaftTransport) SendSnapshot(
@@ -614,9 +634,10 @@ func (t *RaftTransport) SendSnapshot(
    snap *OutgoingSnapshot,
    newBatch func() engine.Batch,
 ) error {
+   var stream MultiRaft_RaftSnapshotClient
    nodeID := header.RaftMessageRequest.ToReplica.NodeID
    breaker := t.GetCircuitBreaker(nodeID)
-   return breaker.Call(func() error {
+   if err := breaker.Call(func() error {
        addr, err := t.resolver(nodeID)
        if err != nil {
            return err
@@ -626,15 +647,19 @@ func (t *RaftTransport) SendSnapshot(
            return err
        }
        client := NewMultiRaftClient(conn)
-       stream, err := client.RaftSnapshot(ctx)
-       if err != nil {
-           return err
+       stream, err = client.RaftSnapshot(ctx)
+       return err
+   }, 0); err != nil {
+       return err
+   }
+   defer func() {
+       if err := stream.CloseSend(); err != nil {
+           log.Warningf(ctx, "failed to close snapshot stream: %s", err)
        }
-       defer func() {
-           if err := stream.CloseSend(); err != nil {
-               log.Warningf(ctx, "failed to close snapshot stream: %s", err)
-           }
-       }()
-       return sendSnapshot(stream, header, snap, newBatch)
-   }, 0)
+   }()
+
+   return sendSnapshot(snapshotClientWithBreaker{
+       MultiRaft_RaftSnapshotClient: stream,
+       breaker: breaker,
+   }, header, snap, newBatch)
 }

See cockroachdb#9545

petermattis · 2016-09-28T12:34:01Z

@tamird That patch is still performing all of the snapshot sends and receives within a breaker. That doesn't seem right. Perhaps I'm misunderstanding the patch, but I think we only want to open the breaker on a connection failure.

tamird · 2016-09-28T12:38:47Z

The problem here is that we're opening the breaker when the remote says "I can't apply this snapshot". That's an application-level error that the breaker need not care about, but errors from Send and Recv are connection-level errors and those do need to open the breaker.

petermattis · 2016-09-28T12:52:05Z

Ah, got it. This looks good then.

tamird assigned jordanlewis Sep 26, 2016

tamird mentioned this issue Sep 26, 2016

[DNM] storage: remove error on snapshotting large ranges #9548

Closed

petermattis added a commit to petermattis/cockroach that referenced this issue Sep 28, 2016

storage: increment clientDropped stat when circuit breaker is open

b202cd0

See cockroachdb#9545

petermattis mentioned this issue Sep 28, 2016

storage: increment clientDropped stat when circuit breaker is open #9584

Merged

petermattis mentioned this issue Sep 28, 2016

storage: only fail circuit breaker on RPC errors #9589

Merged

tbg closed this as completed in 4a86a7f Sep 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

tamird commented Sep 26, 2016

tamird commented Sep 26, 2016

jordanlewis commented Sep 26, 2016

tamird commented Sep 26, 2016

dt commented Sep 26, 2016

tamird commented Sep 26, 2016

petermattis commented Sep 26, 2016

petermattis commented Sep 26, 2016

petermattis commented Sep 27, 2016

petermattis commented Sep 28, 2016

petermattis commented Sep 28, 2016

petermattis commented Sep 28, 2016

jordanlewis commented Sep 28, 2016

tamird commented Sep 28, 2016

petermattis commented Sep 28, 2016

tamird commented Sep 28, 2016

petermattis commented Sep 28, 2016

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

Comments

tamird commented Sep 26, 2016

tamird commented Sep 26, 2016

jordanlewis commented Sep 26, 2016

tamird commented Sep 26, 2016

dt commented Sep 26, 2016

tamird commented Sep 26, 2016

petermattis commented Sep 26, 2016

petermattis commented Sep 26, 2016

petermattis commented Sep 27, 2016

petermattis commented Sep 28, 2016

petermattis commented Sep 28, 2016

petermattis commented Sep 28, 2016

jordanlewis commented Sep 28, 2016

tamird commented Sep 28, 2016

petermattis commented Sep 28, 2016

tamird commented Sep 28, 2016

petermattis commented Sep 28, 2016