-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for segmented storage #1866
Conversation
have a replica of the same range but doesn't. | ||
|
||
Each replica-creation path will need to consider whether the replica | ||
has already been created via the other path (comparaing replica IDs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/comparaing/comparing/
This RFC solves the correctness issue (at some cost) but it still leaves a performance issue open (kickstarting the new ranges without wasting resources on snapshots). I wonder whether this could be one of the rare occasions in which nailing the performance and correctness issues could be the same.
So all the conflict resolution happens during the snapshotting process. This is pretty inefficient, but segmented storage is even less efficient. If that deals with correctness though, we've achieved the same thing already. And the key to performance is to make sure we don't actually send out snapshots too early, as @bdarnell suggested. If we set the new group's leader to the old one's (this could be a little tricky to make sure nobody can vote again in the same term; could also trigger votes but that has its own problems again) during |
How is segmented storage "even less efficient"? Just the cost of putting segment IDs in the keys, or is there something I'm missing? Discarding snapshots that conflict with an existing range would address the correctness issue, it is true. However, if this were our solution we could have availability problems: the range receiving the snapshot would be a member of the range, but it would be unable to act as one until the conflict had been resolved. In most cases this just means waiting for the left-hand range to catch up and process the split. However, if the left-hand range has been rebalanced away from this node, it won't go away until the replica GC process has deleted it. This could be a long time to go with a non-functioning replica (Does it matter? Can we count on the rebalancer to treat this replica as dead and move the right-hand range away from it too? Or could we accelerate the replica GC process when we detect this conflict?). Segmented storage allows the right-hand replica to become usable immediately even while the left-hand replica lags behind. |
has caught up to the point of the split. | ||
|
||
# Detailed design | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be valuable to define the notion of Segment and what it is meant to represent. What follows in this discussion is the pure mechanism of how things work, and explaining the notion of a segment will make a valuable read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A segment doesn't really represent anything; it's just a way of partitioning the keyspace. There is no conceptual backing here besides the mechanism.
no, just that and having to rewrite on a merge (complexity aside).
Hmm, good point. I'll have to think about that. |
So after thinking about this some more, I'm doubtful that segmented storage is worth it. The situation in which it improves availability by a significant amount is pretty rare: when a node has been down long enough for the left-hand range to be moved away, but not for the right-hand range. I think that the simpler solution suffices: when an incoming snapshot would cause a snapshot, drop it. This will leave the range in an uninitialized state, in which it will vote in raft elections but not participate in committing log entries. We must make sure that this counts as "down" for the repair system, and when a snapshot is dropped we should send the left-hand range to the GC queue (to see if the local replica can be removed, resolving the conflict) and the right-hand range to the repair queue (to see if the local replica can be reassigned somewhere else). If this proposal is dropped, #1867 also becomes obsolete. |
Sounds good. I had the same intuition, but I don't trust my estimate of the complexities in resolving the race. If you've reached that conclusion, I'm confident that this is worth going for (worst scenario, we're back to the above). Wanna discuss and formalize the race detection mechanisms and measures we can take to ensure a smooth handoff (avoiding elections by keeping leadership for the split groups, possibly queuing up Raft messages at the uninitialized Range until it receives the Split) in normal operation in a new RFC? |
Still worth checking this in as a rejected RFC. |
No description provided.