Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add replicas to replica GC queue when remote reports an error #5789

Closed
bdarnell opened this issue Mar 31, 2016 · 10 comments
Closed
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

From #5467:

Raft never gives a negative answer; either the command commits or you timeout (and in general we use periodic retries of raft commands to ensure that it eventually commits). However, if you're on a replica which has been removed from the range, then none of the other replicas will talk to you.

So when a stale range descriptor cache directs a request to a removed replica, we don't handle it well at all. I think we're currently relying on SendNextTimeout to handle this at the rpc layer, but it leaves an orphaned goroutine behind (prior to #5551). Eventually the replicaGCQueue will see that the replica has been removed and cancel all of its outstanding operations (or at least it should), but that takes a long time.

I think we need to add some sort of error messaging to the raft transport, so when a message is rejected for being from an out of date replica, we can send a message to that replica that will trigger a replica GC.

@bdarnell
Copy link
Contributor Author

When a command is proposed on a replica which as been removed from its group (due to a stale cached range descriptor), that command currently goes into raft-retry limbo until the replica is garbage collected. This appears to be happening in the CI logs posted to #6118, and is why the system is unable to progress past the increment() call even during the intervals when one of the valid nodes can become leader. This makes it even more important to signal errors back to the originator of a rejected raft message so removed ranges can be garbage-collected promptly.

@vivekmenezes
Copy link
Contributor

This will be a very valuable addition. When we added consistency checking
and had to retry the consistency check to get better debug output, we had
to rely on the replica requesting a consistency check. It would be better
to have the leader receive an error and retry the consistency check.

On Mon, Apr 18, 2016 at 10:00 AM Ben Darnell [email protected]
wrote:

When a command is proposed on a replica which as been removed from its
group (due to a stale cached range descriptor), that command currently goes
into raft-retry limbo until the replica is garbage collected. This appears
to be happening in the CI logs posted to #6118
#6118, and is why the
system is unable to progress past the increment() call even during the
intervals when one of the valid nodes can become leader. This makes it even
more important to signal errors back to the originator of a rejected raft
message so removed ranges can be garbage-collected promptly.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5789 (comment)

@bdarnell
Copy link
Contributor Author

What I'm proposing operates at the level of raft messages and not application-level commands, so it won't really help for the consistency checker (unless you have an alternate design that could encompass both)

@vivekmenezes
Copy link
Contributor

gotcha. I don't have another proposal on the table.

On Mon, Apr 18, 2016 at 10:52 AM Ben Darnell [email protected]
wrote:

What I'm proposing operates at the level of raft messages and not
application-level commands, so it won't really help for the consistency
checker (unless you have an alternate design that could encompass both)


You are receiving this because you commented.

Reply to this email directly or view it on GitHub
#5789 (comment)

@bdarnell
Copy link
Contributor Author

From #6011: A second use case for this mechanism would be to explicitly acknowledge snapshots, so we can call ReportSnapshot once the snapshot has been applied, instead of optimistically reporting it as done as soon as the MsgSnap has been sent.

@bdarnell
Copy link
Contributor Author

bdarnell commented May 3, 2016

I don't think we can use this to report snapshot status more accurately (raft may internally decide to drop a snapshot on the floor so detecting when a snapshot has been either applied or rejected is tricky), so let's focus on returning some sort of feedback for messages that are rejected because their replica ID is too old, and ensure that those ranges can be GC'd promptly.

I'm not sure whether it's better to model this as a GRPC return stream (and adding plumbing between RaftTransport and Store.handleRaftMessage) or as a separate RaftMessageRequest that doesn't map to an underlying etcd/raft message.

@petermattis
Copy link
Collaborator

Per #7507, this is higher priority than initially suspected. @tamird is there a plan of action here? Or do we need to brainstorm on the approach to take?

@tamird
Copy link
Contributor

tamird commented Jun 29, 2016

I was just thinking about this; the most straightforward solution is to add
a field to storage.RaftMessageRequest that signals "i'm ignoring you, gc
yourself", but we could also possibly add a field to
storage.RaftMessageResponse. The latter might be better because we can
close the stream when we send that message back.

On Wed, Jun 29, 2016 at 11:13 AM, Peter Mattis [email protected]
wrote:

Per #7507 #7507, this is
higher priority than initially suspected. @tamird
https://github.com/tamird is there a plan of action here? Or do we need
to brainstorm on the approach to take?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5789 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABdsPGNI67_1B5MflahQdXF-Cwmn6rnvks5qQoulgaJpZM4H9Qi2
.

@tbg
Copy link
Member

tbg commented Jun 29, 2016

Why do you want to close the stream? The stream is per-node, but the message is per-replica.

using RaftMessageResponse sounds more idiomatic, doesn't it? We wouldn't ever send them unless there's information.

@tamird
Copy link
Contributor

tamird commented Jun 29, 2016

using RaftMessageResponse sounds more idiomatic, doesn't it? We wouldn't ever send them unless there's information.

The only trouble is that the stream is unidirectional - sending one of these back necessary terminates the stream.

tbg added a commit to tbg/cockroach that referenced this issue Jun 29, 2016
Running the 1-to-3 rebalance test showed that we are still sometimes
stalling for ~30 seconds; this could be caused by the GC queue taking
a bit to actually GC replicas.

In anticipation of cockroachdb#5789, we should soon be able to make the GC queue
mostly event-driven. In the meantime, aggressive queue timings serve
to rule out this factor when investigating upreplication stalls.
@tamird tamird changed the title storage: Return errors via raft transport storage: add replicas to replica GC queue when remote reports an error Jul 5, 2016
@petermattis petermattis modified the milestones: Q2, Q3 Jul 11, 2016
@tbg tbg unassigned tamird Aug 1, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
an signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 1, 2016
When the Raft transport stream returns an error we can use that error as
a signal that the replica may need to be GC'd.

Suggested in cockroachdb#8130.
Fixes cockroachdb#5789.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants