-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: drastically reducing range_max_bytes on a running cluster causes stall #9545
Comments
Actually, I'm not sure that diff fixes things. After apply that change, I still see a flurry of messages:
|
I believe this behavior has existed since 3bdf92c in July. The idea seems to be that the snapshot should be postponed until the range is split. Is that not happening here? |
It's easy to imagine a deadlock where splitting a range requires a transaction to commit on the meta ranges, but those ranges need a snapshot (and a split) and so can't make progress. Perhaps it's OK to remove this now that we live in a streaming snapshot world? cc @dt Also, @petermattis rightly points out that the presence of any snapshots at all is quite alarming. Perhaps this aggressive splitting causes nodes to fall behind log truncation? |
OK so it seems that the snapshots are expected in the case of slow-to-apply-splits on followers. |
I'm able to reproduce badness here on a local 3-node cluster. I'm going to spend some time tracking down what is going on. Probably related to the perf hiccups I'm seeing on #9465. |
In one case I just investigated (and then accidentally deleted the logs for), a non-leader node was not receiving messages from the leader and called an election. Eventually a message from the leader got through which caused the leader to step down. But the non-leader's election failed because it was behind the quorum commit point. Because the leader stepped down we had to wait an election cycle for the leader to campaign again. Just added some more logging and will train my fingers not to delete the logs. |
We're dropping message because the per-replica receive queue is full. Trying to track down why this is happening. |
Some weird transport stuff is going on. Still working on tracking this down as it happens very sporadically. Here is one instance:
Above we can see command
Once again, proposed on
So
Note that this is a 3 node cluster so it's not like there is another node that could be taking the traffic. All the traffic being sent to |
Despite not having dropped any RPCs, I am seeing the following in the logs for
The raft errors are due to calls to |
Yep, it looks like Cc @jordanlewis, @spencerkimball, @tamird who have all touched this code recently and/or know about our circuit breakers. |
Yes, this seems wrong indeed. It seems like It looks like the other user of the circuit breaker in the transport also propagates all connection errors, but what I didn't realize when I was following that pattern was that the raft transport doesn't actually produce any application errors because it's a long-lived connection. I'll fix this up tomorrow, assuming my understanding here is correct. |
A small patch I think might fix this: diff --git a/storage/raft_transport.go b/storage/raft_transport.go
index 9873dc1..2196a11 100644
--- a/storage/raft_transport.go
+++ b/storage/raft_transport.go
@@ -606,6 +606,26 @@ func (t *RaftTransport) SendAsync(req *RaftMessageRequest) bool {
}
}
+type snapshotClientWithBreaker struct {
+ MultiRaft_RaftSnapshotClient
+ breaker *circuit.Breaker
+}
+
+func (c snapshotClientWithBreaker) Send(m *SnapshotRequest) error {
+ return c.breaker.Call(func() error {
+ return c.MultiRaft_RaftSnapshotClient.Send(m)
+ }, 0)
+}
+
+func (c snapshotClientWithBreaker) Recv() (*SnapshotResponse, error) {
+ var m *SnapshotResponse
+ return m, c.breaker.Call(func() error {
+ var err error
+ m, err = c.MultiRaft_RaftSnapshotClient.Recv()
+ return err
+ }, 0)
+}
+
// SendSnapshot streams the given outgoing snapshot. The caller is responsible for closing the
// OutgoingSnapshot with snap.Close.
func (t *RaftTransport) SendSnapshot(
@@ -614,9 +634,10 @@ func (t *RaftTransport) SendSnapshot(
snap *OutgoingSnapshot,
newBatch func() engine.Batch,
) error {
+ var stream MultiRaft_RaftSnapshotClient
nodeID := header.RaftMessageRequest.ToReplica.NodeID
breaker := t.GetCircuitBreaker(nodeID)
- return breaker.Call(func() error {
+ if err := breaker.Call(func() error {
addr, err := t.resolver(nodeID)
if err != nil {
return err
@@ -626,15 +647,19 @@ func (t *RaftTransport) SendSnapshot(
return err
}
client := NewMultiRaftClient(conn)
- stream, err := client.RaftSnapshot(ctx)
- if err != nil {
- return err
+ stream, err = client.RaftSnapshot(ctx)
+ return err
+ }, 0); err != nil {
+ return err
+ }
+ defer func() {
+ if err := stream.CloseSend(); err != nil {
+ log.Warningf(ctx, "failed to close snapshot stream: %s", err)
}
- defer func() {
- if err := stream.CloseSend(); err != nil {
- log.Warningf(ctx, "failed to close snapshot stream: %s", err)
- }
- }()
- return sendSnapshot(stream, header, snap, newBatch)
- }, 0)
+ }()
+
+ return sendSnapshot(snapshotClientWithBreaker{
+ MultiRaft_RaftSnapshotClient: stream,
+ breaker: breaker,
+ }, header, snap, newBatch)
} |
@tamird That patch is still performing all of the snapshot sends and receives within a breaker. That doesn't seem right. Perhaps I'm misunderstanding the patch, but I think we only want to open the breaker on a connection failure. |
The problem here is that we're opening the breaker when the remote says "I can't apply this snapshot". That's an application-level error that the breaker need not care about, but errors from |
Ah, got it. This looks good then. |
roachdemo -n 3
block_writer
get a coffee
echo "range_max_bytes: 1048577" | cockroach zone set datablocks --file=-
watch block_writer, it will stall
tail cockroach-data/1/logs/cockroach.INFO
:Profit
Seems like we're declining raft-requested snapshots, which is very bad indeed. The following diff seems to fix the problem (but needs tests):
The text was updated successfully, but these errors were encountered: