Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for segmented storage #1866

Merged
merged 2 commits into from
Aug 4, 2015
Merged

Conversation

bdarnell
Copy link
Contributor

No description provided.

@tbg tbg added the PTAL label Jul 29, 2015
have a replica of the same range but doesn't.

Each replica-creation path will need to consider whether the replica
has already been created via the other path (comparaing replica IDs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/comparaing/comparing/

@tbg
Copy link
Member

tbg commented Jul 30, 2015

This RFC solves the correctness issue (at some cost) but it still leaves a performance issue open (kickstarting the new ranges without wasting resources on snapshots). I wonder whether this could be one of the rare occasions in which nailing the performance and correctness issues could be the same.
The thought is the following:

  • having an uninitialized range (Raft group) around isn't an issue, it can be gc'ed and doesn't do anything (with proper code: currently also not true - if the range gets created by before the splitTrigger, then there's an error).
  • it gets tricky if ownership over data is shared, and damage is done precisely when a SnapShot is applied.
  • a SnapShot contains the range descriptor, so the Store can check whether this swath of keyspace is currently in any active Range. If it is, then the Snapshot must not be applied (maybe some other action taken; tbd).

So all the conflict resolution happens during the snapshotting process.

This is pretty inefficient, but segmented storage is even less efficient. If that deals with correctness though, we've achieved the same thing already. And the key to performance is to make sure we don't actually send out snapshots too early, as @bdarnell suggested. If we set the new group's leader to the old one's (this could be a little tricky to make sure nobody can vote again in the same term; could also trigger votes but that has its own problems again) during splitTrigger and (maybe on the leader only, but shouldn't matter) additionally delay sending any snapshots for that group (transport.DelaySnapshot(raftID, time.Now.Add(time.Minute))?) we should get fairly close. The cherry on top could be queueing up a certain amount of Raft messages while the newly-split Range hasn't been splitTriggered yet to make the transition truly seamless (this could obviate the need to even delay Snapshots if it works reliably).

@bdarnell
Copy link
Contributor Author

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

Discarding snapshots that conflict with an existing range would address the correctness issue, it is true. However, if this were our solution we could have availability problems: the range receiving the snapshot would be a member of the range, but it would be unable to act as one until the conflict had been resolved. In most cases this just means waiting for the left-hand range to catch up and process the split. However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

has caught up to the point of the split.

# Detailed design

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be valuable to define the notion of Segment and what it is meant to represent. What follows in this discussion is the pure mechanism of how things work, and explaining the notion of a segment will make a valuable read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A segment doesn't really represent anything; it's just a way of partitioning the keyspace. There is no conceptual backing here besides the mechanism.

@tbg
Copy link
Member

tbg commented Jul 31, 2015

How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing?

no, just that and having to rewrite on a merge (complexity aside).

However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind.

Hmm, good point. I'll have to think about that.

@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 3, 2015

So after thinking about this some more, I'm doubtful that segmented storage is worth it. The situation in which it improves availability by a significant amount is pretty rare: when a node has been down long enough for the left-hand range to be moved away, but not for the right-hand range. I think that the simpler solution suffices: when an incoming snapshot would cause a snapshot, drop it. This will leave the range in an uninitialized state, in which it will vote in raft elections but not participate in committing log entries. We must make sure that this counts as "down" for the repair system, and when a snapshot is dropped we should send the left-hand range to the GC queue (to see if the local replica can be removed, resolving the conflict) and the right-hand range to the repair queue (to see if the local replica can be reassigned somewhere else).

If this proposal is dropped, #1867 also becomes obsolete.

@tbg
Copy link
Member

tbg commented Aug 4, 2015

Sounds good. I had the same intuition, but I don't trust my estimate of the complexities in resolving the race. If you've reached that conclusion, I'm confident that this is worth going for (worst scenario, we're back to the above). Wanna discuss and formalize the race detection mechanisms and measures we can take to ensure a smooth handoff (avoiding elections by keeping leadership for the split groups, possibly queuing up Raft messages at the uninitialized Range until it receives the Split) in normal operation in a new RFC?

@vivekmenezes
Copy link
Contributor

Still worth checking this in as a rejected RFC.

bdarnell added a commit that referenced this pull request Aug 4, 2015
@bdarnell bdarnell merged commit e8495c8 into cockroachdb:master Aug 4, 2015
@tbg tbg removed the PTAL label Aug 4, 2015
@bdarnell bdarnell deleted the segmented-storage branch August 4, 2015 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants