RFC for segmented storage #1866

bdarnell · 2015-07-29T22:55:31Z

No description provided.

tbg · 2015-07-30T08:05:59Z

docs/RFCS/segmented_storage.md

+have a replica of the same range but doesn't.
+
+Each replica-creation path will need to consider whether the replica
+has already been created via the other path (comparaing replica IDs,


s/comparaing/comparing/

tbg · 2015-07-30T08:48:29Z

This RFC solves the correctness issue (at some cost) but it still leaves a performance issue open (kickstarting the new ranges without wasting resources on snapshots). I wonder whether this could be one of the rare occasions in which nailing the performance and correctness issues could be the same.
The thought is the following:

having an uninitialized range (Raft group) around isn't an issue, it can be gc'ed and doesn't do anything (with proper code: currently also not true - if the range gets created by before the splitTrigger, then there's an error).
it gets tricky if ownership over data is shared, and damage is done precisely when a SnapShot is applied.
a SnapShot contains the range descriptor, so the Store can check whether this swath of keyspace is currently in any active Range. If it is, then the Snapshot must not be applied (maybe some other action taken; tbd).

So all the conflict resolution happens during the snapshotting process.

This is pretty inefficient, but segmented storage is even less efficient. If that deals with correctness though, we've achieved the same thing already. And the key to performance is to make sure we don't actually send out snapshots too early, as @bdarnell suggested. If we set the new group's leader to the old one's (this could be a little tricky to make sure nobody can vote again in the same term; could also trigger votes but that has its own problems again) during splitTrigger and (maybe on the leader only, but shouldn't matter) additionally delay sending any snapshots for that group (transport.DelaySnapshot(raftID, time.Now.Add(time.Minute))?) we should get fairly close. The cherry on top could be queueing up a certain amount of Raft messages while the newly-split Range hasn't been splitTriggered yet to make the transition truly seamless (this could obviate the need to even delay Snapshots if it works reliably).

bdarnell · 2015-07-30T22:22:39Z

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

Discarding snapshots that conflict with an existing range would address the correctness issue, it is true. However, if this were our solution we could have availability problems: the range receiving the snapshot would be a member of the range, but it would be unable to act as one until the conflict had been resolved. In most cases this just means waiting for the left-hand range to catch up and process the split. However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

vivekmenezes · 2015-07-31T15:50:49Z

docs/RFCS/segmented_storage.md

+has caught up to the point of the split.
+
+# Detailed design
+


It will be valuable to define the notion of Segment and what it is meant to represent. What follows in this discussion is the pure mechanism of how things work, and explaining the notion of a segment will make a valuable read.

A segment doesn't really represent anything; it's just a way of partitioning the keyspace. There is no conceptual backing here besides the mechanism.

tbg · 2015-07-31T22:50:52Z

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

no, just that and having to rewrite on a merge (complexity aside).

However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

Hmm, good point. I'll have to think about that.

bdarnell · 2015-08-03T18:37:05Z

So after thinking about this some more, I'm doubtful that segmented storage is worth it. The situation in which it improves availability by a significant amount is pretty rare: when a node has been down long enough for the left-hand range to be moved away, but not for the right-hand range. I think that the simpler solution suffices: when an incoming snapshot would cause a snapshot, drop it. This will leave the range in an uninitialized state, in which it will vote in raft elections but not participate in committing log entries. We must make sure that this counts as "down" for the repair system, and when a snapshot is dropped we should send the left-hand range to the GC queue (to see if the local replica can be removed, resolving the conflict) and the right-hand range to the repair queue (to see if the local replica can be reassigned somewhere else).

If this proposal is dropped, #1867 also becomes obsolete.

tbg · 2015-08-04T16:39:38Z

Sounds good. I had the same intuition, but I don't trust my estimate of the complexities in resolving the race. If you've reached that conclusion, I'm confident that this is worth going for (worst scenario, we're back to the above). Wanna discuss and formalize the race detection mechanisms and measures we can take to ensure a smooth handoff (avoiding elections by keeping leadership for the split groups, possibly queuing up Raft messages at the uninitialized Range until it receives the Split) in normal operation in a new RFC?

vivekmenezes · 2015-08-04T17:13:03Z

Still worth checking this in as a rejected RFC.

RFC for segmented storage

RFC for segmented storage

3acf520

tbg added the PTAL label Jul 29, 2015

tbg reviewed Jul 30, 2015
View reviewed changes

tbg mentioned this pull request Jul 30, 2015

RFC for bounds generation numbers #1867

Merged

spencerkimball mentioned this pull request Jul 30, 2015

about cockroach multi-raft #1872

Closed

vivekmenezes reviewed Jul 31, 2015
View reviewed changes

Reject the segmented_storage RFC.

be03182

bdarnell added a commit that referenced this pull request Aug 4, 2015

Merge pull request #1866 from bdarnell/segmented-storage

e8495c8

RFC for segmented storage

bdarnell merged commit e8495c8 into cockroachdb:master Aug 4, 2015

tbg removed the PTAL label Aug 4, 2015

bdarnell deleted the segmented-storage branch August 4, 2015 20:12

bdarnell mentioned this pull request Aug 5, 2015

how to deal with failure #1930

Closed

bdarnell mentioned this pull request Jul 5, 2016

storage: clobbering between uninitialized replica and RHS of a split #7600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for segmented storage #1866

RFC for segmented storage #1866

bdarnell commented Jul 29, 2015

tbg Jul 30, 2015

tbg commented Jul 30, 2015

bdarnell commented Jul 30, 2015

vivekmenezes Jul 31, 2015

bdarnell Jul 31, 2015

tbg commented Jul 31, 2015

bdarnell commented Aug 3, 2015

tbg commented Aug 4, 2015

vivekmenezes commented Aug 4, 2015

RFC for segmented storage #1866

RFC for segmented storage #1866

Conversation

bdarnell commented Jul 29, 2015

tbg Jul 30, 2015

Choose a reason for hiding this comment

tbg commented Jul 30, 2015

bdarnell commented Jul 30, 2015

vivekmenezes Jul 31, 2015

Choose a reason for hiding this comment

bdarnell Jul 31, 2015

Choose a reason for hiding this comment

tbg commented Jul 31, 2015

bdarnell commented Aug 3, 2015

tbg commented Aug 4, 2015

vivekmenezes commented Aug 4, 2015