Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: proposal quota and bad replicas #18342

Closed
petermattis opened this issue Sep 7, 2017 · 9 comments
Closed

storage: proposal quota and bad replicas #18342

petermattis opened this issue Sep 7, 2017 · 9 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

#17524 revealed a bit of fragility in the use of proposal quota: a replica that is unable to make progress for some reason will freeze up the entire range even though 2 good replicas remain. The specific bug causing a replica to be unable to make progress is now fixed, but we should consider enhancing the proposal quota logic to lessen the impact of such bugs in the future.

The current proposal quota logic "finds the minimum index that active followers have acknowledged" and uses that value to determine when quota should be returned to the pool. An "active follower" is one that is on a healthy node where health is determined by the per-connection RPC heartbeats. This is an insufficient definition of health if a particular replica on an otherwise good node is having problems. When that occurs proposal quota is exhausted and the range will stop accepting writes.

Enhancing the proposal quota logic should be relatively straightforward. Rather than simply consider connection health, we could also have some measure of whether a replica has made any progress at all within a certain time period. For example, we could consider a follower inactive if its progress.Match has not changed in the last 10 seconds.

Cc @nvanbenschoten, @tschottdorf

@petermattis petermattis added this to the 1.2 milestone Sep 7, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Sep 7, 2017

Ignoring replicas in this state for proposal quota purposes would effectively leave the range under-replicated, which is dangerous. We need for the rebalancer to consider these ranges bad and rebalance away from them. (We probably need to do both, though. The rebalance operations can't proceed if the quota pool is blocking all writes)

@petermattis
Copy link
Collaborator Author

@bdarnell That's a good point and one that @andreimatei or @tschottdorf mentioned as well in person. Seems feasible to feed the "wedged replica" signal into the rebalancer too.

@a6802739
Copy link
Contributor

@petermattis, could I have a try for this?

@petermattis
Copy link
Collaborator Author

@a6802739 Sure. It is probably worthwhile to sketch out what you're going to do in this issue first.

@a6802739
Copy link
Contributor

Great! Thanks!

@a6802739 a6802739 self-assigned this Oct 11, 2017
@a6802739
Copy link
Contributor

a6802739 commented Oct 26, 2017

petermattis, Now the method we used to check whether a follower is active is just like:

		if r.store.cfg.Transport.resolver != nil {
			addr, err := r.store.cfg.Transport.resolver(rep.NodeID)
			if err != nil {
				continue
			}
			if err := r.store.cfg.Transport.rpcContext.ConnHealth(addr.String()); err != nil {
				continue
			}
		}

If we want to change the logic here, how about we use the RecentActive field of raft.Progress to Validate whether the follower is active?

// Progress represents a follower’s progress in the view of the leader. Leader maintains
// progresses of all followers, and sends entries to the follower based on its progress.
type Progress struct {
	Match, Next uint64
	// State defines how the leader should interact with the follower.
	//
	// When in ProgressStateProbe, leader sends at most one replication message
	// per heartbeat interval. It also probes actual progress of the follower.
	//
	// When in ProgressStateReplicate, leader optimistically increases next
	// to the latest entry sent after sending replication message. This is
	// an optimized state for fast replicating log entries to the follower.
	//
	// When in ProgressStateSnapshot, leader should have sent out snapshot
	// before and stops sending any replication message.
	State ProgressStateType
	// Paused is used in ProgressStateProbe.
	// When Paused is true, raft should pause sending replication message to this peer.
	Paused bool
	// PendingSnapshot is used in ProgressStateSnapshot.
	// If there is a pending snapshot, the pendingSnapshot will be set to the
	// index of the snapshot. If pendingSnapshot is set, the replication process of
	// this Progress will be paused. raft will not resend snapshot until the pending one
	// is reported to be failed.
	PendingSnapshot uint64

	// RecentActive is true if the progress is recently active. Receiving any messages
	// from the corresponding follower indicates the progress is active.
	// RecentActive can be reset to false after an election timeout.
	RecentActive bool

	// inflights is a sliding window for the inflight messages.
	// Each inflight message contains one or more log entries.
	// The max number of entries per message is defined in raft config as MaxSizePerMsg.
	// Thus inflight effectively limits both the number of inflight messages
	// and the bandwidth each Progress can use.
	// When inflights is full, no more message should be sent.
	// When a leader sends out a message, the index of the last
	// entry should be added to inflights. The index MUST be added
	// into inflights in order.
	// When a leader receives a reply, the previous inflights should
	// be freed by calling inflights.freeTo with the index of the last
	// received entry.
	ins *inflights
}

Rather than simply consider connection health, we could also have some measure of whether a replica has made any progress at all within a certain time period. For example, we could consider a follower inactive if its progress.Match has not changed in the last 10 seconds.

Like the method you said above:

  1. we could just add a field LastUpdateTimeStamp in raft.Progress, we could just use duration(now - LastUpdateTimeStamp) < 10s to make a validation. Much easier, but should change the upstream of etcd.

  2. we could start a goroutine outside, to get the RaftStatus() every 10s. Much Complicated.

Honestly, I still quite don't understand what is the proposal quota used for? Could you please have a
explanation for it?

Thank you very much.

@petermattis
Copy link
Collaborator Author

Raft is currently time agnostic. I doubt the maintainers would add a LastUpdateTimestamp field. Instead, we'd have to keep track in a parallel structure what the last update time for a Replica was.

Proposal quota is used to limit the outstanding Raft proposals to a Range to prevent a slow Replica from continually falling too far behind and requiring a snapshot to be caught up.

Cc @bdarnell

@a6802739
Copy link
Contributor

a6802739 commented Oct 26, 2017

@petermattis Thank you very much.

Do you mean we should make a map[peerID] LastUpdateTimestamp in Replica structure?

But how could we update LastUpdateTimestamp for each follower?

@petermattis
Copy link
Collaborator Author

@a6802739 Apologies for the delay in responding. This dropped off my radar.

Do you mean we should make a map[peerID] LastUpdateTimestamp in Replica structure?

Yes, something like that.

But how could we update LastUpdateTimestamp for each follower?

I think you have to snoop on the Raft message traffic. See Store.processRaftRequestWithReplica for how we already do this to intercept quiesce messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants