storage: add replicas to replica GC queue when remote reports an error #5789

bdarnell · 2016-03-31T23:20:02Z

Raft never gives a negative answer; either the command commits or you timeout (and in general we use periodic retries of raft commands to ensure that it eventually commits). However, if you're on a replica which has been removed from the range, then none of the other replicas will talk to you.

So when a stale range descriptor cache directs a request to a removed replica, we don't handle it well at all. I think we're currently relying on SendNextTimeout to handle this at the rpc layer, but it leaves an orphaned goroutine behind (prior to #5551). Eventually the replicaGCQueue will see that the replica has been removed and cancel all of its outstanding operations (or at least it should), but that takes a long time.

I think we need to add some sort of error messaging to the raft transport, so when a message is rejected for being from an out of date replica, we can send a message to that replica that will trigger a replica GC.

bdarnell · 2016-04-18T14:00:05Z

When a command is proposed on a replica which as been removed from its group (due to a stale cached range descriptor), that command currently goes into raft-retry limbo until the replica is garbage collected. This appears to be happening in the CI logs posted to #6118, and is why the system is unable to progress past the increment() call even during the intervals when one of the valid nodes can become leader. This makes it even more important to signal errors back to the originator of a rejected raft message so removed ranges can be garbage-collected promptly.

vivekmenezes · 2016-04-18T14:47:30Z

This will be a very valuable addition. When we added consistency checking
and had to retry the consistency check to get better debug output, we had
to rely on the replica requesting a consistency check. It would be better
to have the leader receive an error and retry the consistency check.

On Mon, Apr 18, 2016 at 10:00 AM Ben Darnell [email protected]
wrote:

When a command is proposed on a replica which as been removed from its
group (due to a stale cached range descriptor), that command currently goes
into raft-retry limbo until the replica is garbage collected. This appears
to be happening in the CI logs posted to #6118
#6118, and is why the
system is unable to progress past the increment() call even during the
intervals when one of the valid nodes can become leader. This makes it even
more important to signal errors back to the originator of a rejected raft
message so removed ranges can be garbage-collected promptly.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5789 (comment)

bdarnell · 2016-04-18T14:52:15Z

What I'm proposing operates at the level of raft messages and not application-level commands, so it won't really help for the consistency checker (unless you have an alternate design that could encompass both)

vivekmenezes · 2016-04-18T15:49:14Z

gotcha. I don't have another proposal on the table.

On Mon, Apr 18, 2016 at 10:52 AM Ben Darnell [email protected]
wrote:

What I'm proposing operates at the level of raft messages and not
application-level commands, so it won't really help for the consistency
checker (unless you have an alternate design that could encompass both)

—
You are receiving this because you commented.

Reply to this email directly or view it on GitHub
#5789 (comment)

bdarnell · 2016-04-19T18:09:57Z

From #6011: A second use case for this mechanism would be to explicitly acknowledge snapshots, so we can call ReportSnapshot once the snapshot has been applied, instead of optimistically reporting it as done as soon as the MsgSnap has been sent.

bdarnell · 2016-05-03T09:44:57Z

I don't think we can use this to report snapshot status more accurately (raft may internally decide to drop a snapshot on the floor so detecting when a snapshot has been either applied or rejected is tricky), so let's focus on returning some sort of feedback for messages that are rejected because their replica ID is too old, and ensure that those ranges can be GC'd promptly.

I'm not sure whether it's better to model this as a GRPC return stream (and adding plumbing between RaftTransport and Store.handleRaftMessage) or as a separate RaftMessageRequest that doesn't map to an underlying etcd/raft message.

petermattis · 2016-06-29T15:13:33Z

Per #7507, this is higher priority than initially suspected. @tamird is there a plan of action here? Or do we need to brainstorm on the approach to take?

tamird · 2016-06-29T15:27:44Z

I was just thinking about this; the most straightforward solution is to add
a field to storage.RaftMessageRequest that signals "i'm ignoring you, gc
yourself", but we could also possibly add a field to
storage.RaftMessageResponse. The latter might be better because we can
close the stream when we send that message back.

On Wed, Jun 29, 2016 at 11:13 AM, Peter Mattis [email protected]
wrote:

Per #7507 #7507, this is
higher priority than initially suspected. @tamird
https://github.com/tamird is there a plan of action here? Or do we need
to brainstorm on the approach to take?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5789 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABdsPGNI67_1B5MflahQdXF-Cwmn6rnvks5qQoulgaJpZM4H9Qi2
.

tbg · 2016-06-29T16:26:31Z

Why do you want to close the stream? The stream is per-node, but the message is per-replica.

using RaftMessageResponse sounds more idiomatic, doesn't it? We wouldn't ever send them unless there's information.

tamird · 2016-06-29T17:43:40Z

using RaftMessageResponse sounds more idiomatic, doesn't it? We wouldn't ever send them unless there's information.

The only trouble is that the stream is unidirectional - sending one of these back necessary terminates the stream.

Running the 1-to-3 rebalance test showed that we are still sometimes stalling for ~30 seconds; this could be caused by the GC queue taking a bit to actually GC replicas. In anticipation of cockroachdb#5789, we should soon be able to make the GC queue mostly event-driven. In the meantime, aggressive queue timings serve to rule out this factor when investigating upreplication stalls.

When the Raft transport stream returns an error we can use that error as an signal that the replica may need to be GC'd. Suggested in cockroachdb#8130. Fixes cockroachdb#5789.

When the Raft transport stream returns an error we can use that error as a signal that the replica may need to be GC'd. Suggested in cockroachdb#8130. Fixes cockroachdb#5789.

bdarnell mentioned this issue Mar 31, 2016

stability: OOM in applySnapshot #5467

Closed

bdarnell mentioned this issue Apr 19, 2016

storage: Introduce return stream for raft messages #6011

Closed

bdarnell added this to the Q2 milestone Apr 19, 2016

This was referenced Apr 29, 2016

stability: Two raft leaders for range 1 on beta cluster #6409

Closed

stability: node stalled on stopper throttling task from storage/intent_resolver.go:292 due to semaphore #6415

Closed

stability: node OOM at restart #6443

Closed

bdarnell assigned tamird May 3, 2016

tamird mentioned this issue Jun 29, 2016

storage: eager GC for campaigning replicas #7507

Merged

tamird mentioned this issue Jun 29, 2016

storage: investigate/fix outgoing Raft message queues #7520

Closed

tbg mentioned this issue Jun 29, 2016

storage: put the GC queue on speed #7533

Merged

tamird mentioned this issue Jun 29, 2016

storage/raft_transport: implement error handling #7534

Merged

tamird changed the title ~~storage: Return errors via raft transport~~ storage: add replicas to replica GC queue when remote reports an error Jul 5, 2016

bdarnell mentioned this issue Jul 7, 2016

storage: Temporarily disable preemptive snapshots #7674

Merged

petermattis modified the milestones: Q2, Q3 Jul 11, 2016

tbg mentioned this issue Aug 1, 2016

storage: eagerly GC replicas on raft transport errors #8172

Merged

tbg assigned petermattis Aug 1, 2016

tbg unassigned tamird Aug 1, 2016

petermattis closed this as completed in cefda3b Aug 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add replicas to replica GC queue when remote reports an error #5789

storage: add replicas to replica GC queue when remote reports an error #5789

bdarnell commented Mar 31, 2016

bdarnell commented Apr 18, 2016

vivekmenezes commented Apr 18, 2016

bdarnell commented Apr 18, 2016

vivekmenezes commented Apr 18, 2016

bdarnell commented Apr 19, 2016

bdarnell commented May 3, 2016

petermattis commented Jun 29, 2016

tamird commented Jun 29, 2016

tbg commented Jun 29, 2016 •

edited

Loading

tamird commented Jun 29, 2016

storage: add replicas to replica GC queue when remote reports an error #5789

storage: add replicas to replica GC queue when remote reports an error #5789

Comments

bdarnell commented Mar 31, 2016

bdarnell commented Apr 18, 2016

vivekmenezes commented Apr 18, 2016

bdarnell commented Apr 18, 2016

vivekmenezes commented Apr 18, 2016

bdarnell commented Apr 19, 2016

bdarnell commented May 3, 2016

petermattis commented Jun 29, 2016

tamird commented Jun 29, 2016

tbg commented Jun 29, 2016 • edited Loading

tamird commented Jun 29, 2016

tbg commented Jun 29, 2016 •

edited

Loading