storage: proposal quota and bad replicas #18342

petermattis · 2017-09-07T20:13:26Z

#17524 revealed a bit of fragility in the use of proposal quota: a replica that is unable to make progress for some reason will freeze up the entire range even though 2 good replicas remain. The specific bug causing a replica to be unable to make progress is now fixed, but we should consider enhancing the proposal quota logic to lessen the impact of such bugs in the future.

The current proposal quota logic "finds the minimum index that active followers have acknowledged" and uses that value to determine when quota should be returned to the pool. An "active follower" is one that is on a healthy node where health is determined by the per-connection RPC heartbeats. This is an insufficient definition of health if a particular replica on an otherwise good node is having problems. When that occurs proposal quota is exhausted and the range will stop accepting writes.

Enhancing the proposal quota logic should be relatively straightforward. Rather than simply consider connection health, we could also have some measure of whether a replica has made any progress at all within a certain time period. For example, we could consider a follower inactive if its progress.Match has not changed in the last 10 seconds.

Cc @nvanbenschoten, @tschottdorf

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-09-07T21:17:19Z

Ignoring replicas in this state for proposal quota purposes would effectively leave the range under-replicated, which is dangerous. We need for the rebalancer to consider these ranges bad and rebalance away from them. (We probably need to do both, though. The rebalance operations can't proceed if the quota pool is blocking all writes)

petermattis · 2017-09-07T21:46:14Z

@bdarnell That's a good point and one that @andreimatei or @tschottdorf mentioned as well in person. Seems feasible to feed the "wedged replica" signal into the rebalancer too.

a6802739 · 2017-09-15T03:47:42Z

@petermattis, could I have a try for this?

petermattis · 2017-09-15T14:23:02Z

@a6802739 Sure. It is probably worthwhile to sketch out what you're going to do in this issue first.

a6802739 · 2017-09-18T02:37:44Z

Great! Thanks!

a6802739 · 2017-10-26T08:57:10Z

petermattis, Now the method we used to check whether a follower is active is just like:

		if r.store.cfg.Transport.resolver != nil {
			addr, err := r.store.cfg.Transport.resolver(rep.NodeID)
			if err != nil {
				continue
			}
			if err := r.store.cfg.Transport.rpcContext.ConnHealth(addr.String()); err != nil {
				continue
			}
		}

If we want to change the logic here, how about we use the RecentActive field of raft.Progress to Validate whether the follower is active?

// Progress represents a follower’s progress in the view of the leader. Leader maintains
// progresses of all followers, and sends entries to the follower based on its progress.
type Progress struct {
	Match, Next uint64
	// State defines how the leader should interact with the follower.
	//
	// When in ProgressStateProbe, leader sends at most one replication message
	// per heartbeat interval. It also probes actual progress of the follower.
	//
	// When in ProgressStateReplicate, leader optimistically increases next
	// to the latest entry sent after sending replication message. This is
	// an optimized state for fast replicating log entries to the follower.
	//
	// When in ProgressStateSnapshot, leader should have sent out snapshot
	// before and stops sending any replication message.
	State ProgressStateType
	// Paused is used in ProgressStateProbe.
	// When Paused is true, raft should pause sending replication message to this peer.
	Paused bool
	// PendingSnapshot is used in ProgressStateSnapshot.
	// If there is a pending snapshot, the pendingSnapshot will be set to the
	// index of the snapshot. If pendingSnapshot is set, the replication process of
	// this Progress will be paused. raft will not resend snapshot until the pending one
	// is reported to be failed.
	PendingSnapshot uint64

	// RecentActive is true if the progress is recently active. Receiving any messages
	// from the corresponding follower indicates the progress is active.
	// RecentActive can be reset to false after an election timeout.
	RecentActive bool

	// inflights is a sliding window for the inflight messages.
	// Each inflight message contains one or more log entries.
	// The max number of entries per message is defined in raft config as MaxSizePerMsg.
	// Thus inflight effectively limits both the number of inflight messages
	// and the bandwidth each Progress can use.
	// When inflights is full, no more message should be sent.
	// When a leader sends out a message, the index of the last
	// entry should be added to inflights. The index MUST be added
	// into inflights in order.
	// When a leader receives a reply, the previous inflights should
	// be freed by calling inflights.freeTo with the index of the last
	// received entry.
	ins *inflights
}

Rather than simply consider connection health, we could also have some measure of whether a replica has made any progress at all within a certain time period. For example, we could consider a follower inactive if its progress.Match has not changed in the last 10 seconds.

Like the method you said above:

we could just add a field LastUpdateTimeStamp in raft.Progress, we could just use duration(now - LastUpdateTimeStamp) < 10s to make a validation. Much easier, but should change the upstream of etcd.
we could start a goroutine outside, to get the RaftStatus() every 10s. Much Complicated.

Honestly, I still quite don't understand what is the proposal quota used for? Could you please have a
explanation for it?

Thank you very much.

petermattis · 2017-10-26T13:51:28Z

Raft is currently time agnostic. I doubt the maintainers would add a LastUpdateTimestamp field. Instead, we'd have to keep track in a parallel structure what the last update time for a Replica was.

Proposal quota is used to limit the outstanding Raft proposals to a Range to prevent a slow Replica from continually falling too far behind and requiring a snapshot to be caught up.

Cc @bdarnell

a6802739 · 2017-10-26T14:21:50Z

@petermattis Thank you very much.

Do you mean we should make a map[peerID] LastUpdateTimestamp in Replica structure?

But how could we update LastUpdateTimestamp for each follower?

petermattis · 2017-11-02T17:03:38Z

@a6802739 Apologies for the delay in responding. This dropped off my radar.

Do you mean we should make a map[peerID] LastUpdateTimestamp in Replica structure?

Yes, something like that.

But how could we update LastUpdateTimestamp for each follower?

I think you have to snoop on the Raft message traffic. See Store.processRaftRequestWithReplica for how we already do this to intercept quiesce messages.

petermattis added this to the 1.2 milestone Sep 7, 2017

a6802739 self-assigned this Oct 11, 2017

a6802739 mentioned this issue Nov 6, 2017

storage: enhance the proposal quota logic #19819

Merged

nvanbenschoten closed this as completed in #19819 Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: proposal quota and bad replicas #18342

storage: proposal quota and bad replicas #18342

petermattis commented Sep 7, 2017

bdarnell commented Sep 7, 2017

petermattis commented Sep 7, 2017

a6802739 commented Sep 15, 2017

petermattis commented Sep 15, 2017

a6802739 commented Sep 18, 2017

a6802739 commented Oct 26, 2017 •

edited

Loading

petermattis commented Oct 26, 2017

a6802739 commented Oct 26, 2017 •

edited

Loading

petermattis commented Nov 2, 2017

storage: proposal quota and bad replicas #18342

storage: proposal quota and bad replicas #18342

Comments

petermattis commented Sep 7, 2017

bdarnell commented Sep 7, 2017

petermattis commented Sep 7, 2017

a6802739 commented Sep 15, 2017

petermattis commented Sep 15, 2017

a6802739 commented Sep 18, 2017

a6802739 commented Oct 26, 2017 • edited Loading

petermattis commented Oct 26, 2017

a6802739 commented Oct 26, 2017 • edited Loading

petermattis commented Nov 2, 2017

a6802739 commented Oct 26, 2017 •

edited

Loading

a6802739 commented Oct 26, 2017 •

edited

Loading