-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve how M3DB handles data durability during topology changes #1183
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1183 +/- ##
========================================
+ Coverage 70.8% 71.1% +0.2%
========================================
Files 748 737 -11
Lines 63055 61944 -1111
========================================
- Hits 44653 44043 -610
+ Misses 15527 15042 -485
+ Partials 2875 2859 -16
Continue to review full report at Codecov.
|
// is going on are both handled correctly. In addition, this will ensure that we hold onto both | ||
// sets of data durably after topology changes and that the node can be properly bootstrapped | ||
// from just the filesystem and commitlog in a later portion of the test. | ||
seriesToWriteDuringPeerStreaming := []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add this as a separate test? Seems like we're repurposing this one and changing it quite substantially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two separate tests, they just call into this shared codepath. I think its fine, if I separated them out completely they'd be almost complete copy-pasta of each other, and the existing test benefits from this additional check as well (even if you don't verify the commitlog behavior, if you're doing a node add you probably want to make sure the node adding keeps track of the data it receives from its peer as well as all the data its receiving while actually joining)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah makes sense.
// We expect consistency errors because we're only running with | ||
// R.F = 2 and one node is leaving and one node is joining for | ||
// each of the shards that is changing hands. | ||
if !client.IsConsistencyResultError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm maybe it's better to make it just RF=3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly just didn't do it because it would probably take a few hours to re-write all the sharding logic and fix any little issues that crop up and it doesn't really make the test any better. Can change it if you feel strongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np, that's fine.
src/dbnode/storage/database.go
Outdated
d.queueBootstrapWithLock() | ||
} | ||
|
||
func (d *db) hasReceivedNewShards(incoming sharding.ShardSet) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename to hasReceivedNewShardsWithLock
?
src/dbnode/storage/database.go
Outdated
func (d *db) hasReceivedNewShards(incoming sharding.ShardSet) bool { | ||
var ( | ||
existing = d.shardSet | ||
existingSet = map[uint32]struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: existingSet = make(map[uint32]struct{}, len(existing.AllIDs())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than some nits and comment about only setting last bootstrapped time if multiErr.FinalError() == nil
for the bootstrap.
Previously, M3DB implemented durability during topology changes by forcing the peers bootstrapper to write out a snapshot file for any mutable data that was streamed in from a peer. This ensured that if the node crashed after the topology change, the commitlog bootstrapper could restore all the data the node was expected to have.
We're currently trying to move M3DB to a model where every set of snapshot files indicates that all the data that had been received by the node up until the beginning of the snapshot could be recovered from a combination of the data files, snapshot files, and commitlog files. The existing implementation clashes with that idea because it was writing out individual snapshot files that were not tied to any large snapshot process.
As a result, this P.R modifies M3DB to eschew writing out snapshot files for mutable data in the peers bootstrapper, and instead it prevents the clustered database from marking shards as available until the last successful snapshot began AFTER the last bootstrap completed and all bootstrapping for the topology change has completed. This is a much simpler model that is robust against future changes to the database.