-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add operational guides for M3DB (topology and bootstrapping) #924
Merged
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
d5dd6eb
Add topology docs
9325309
Add operational guides for bootstrapping and improve topology guide
93a7fe8
Add links
43a5121
add clarification
5241eb1
Fix typo
781cdb8
rewrite sentence
ffa7e40
Add periods
7ea2a52
Fix punctuation
25ebf86
Clarify filesystem bootstrapper
f653b17
Break into separate paragraph
ad36b8e
Add clarification for commitlog bootstrap
3b2ed8c
Improve introduction
5f3a2f3
Fix typo
05cfba8
more fixes
b6593ce
Clarify default config
1f793ee
Fix typo
2e710de
Fix typo
a74744b
fix typo
1874e84
fix typo
1b2006f
Address feedback changes
be7149f
link to etcd
02bd7cc
Address review feedback
25fdd7f
Update placement doc with better diagrams
e5e31bc
Add another placement picture
3791fc9
Add monopic resources
6f10d8f
replace tables with diagrams
232337e
Add diagrams into block quotes
b56e01d
Update placement doc
de13961
Add new docs to menu
bd59e49
Address review feedback
a2d1b9f
Update bootstrapping guide
2b08628
Fix headings
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# Bootstrapping | ||
|
||
## Introduction | ||
|
||
We recommend reading the [placement operational guide](placement.md) before reading the rest of this document. | ||
|
||
When an M3DB node is turned on (goes through a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers *does* matter. | ||
|
||
Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you to, this document is designed to help you understand the implications of doing so. | ||
|
||
M3DB currently supports 5 different bootstrappers: | ||
|
||
1. `filesystem` | ||
2. `commitlog` | ||
3. `peers` | ||
4. `uninitialized_topology` | ||
5. `noop_all` | ||
|
||
When the bootstrapping process begins, M3DB nodes need to determine two things: | ||
|
||
1. What shards the bootstrapping node should bootstrap, which can be determined from the cluster placement. | ||
2. What time-ranges the bootstrapping node needs to bootstrap those shards for, which can be determined from the namespace retention. | ||
|
||
For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. | ||
|
||
## Bootstrappers | ||
|
||
### Filesystem Bootstrapper | ||
|
||
The `filesystem` bootstrapper's responsibility is to determine which immutable [Fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which Fileset files exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. | ||
|
||
### Commitlog Bootstrapper | ||
|
||
The `commitlog` bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable Fileset file. Unlike the `filesystem` bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the `commitlog` bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. | ||
|
||
On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster placement to see if the node it is running on has ever achieved the `Available` status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last Fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the `commitlog` bootstrapper *assumes* it is running after the `filesystem` bootstrapper. M3DB will not allow you to run with a configuration where the `filesystem` bootstrapper is placed after the `commitlog` bootstrapper, but it will allow you to run the `commitlog` bootstrapper without the `filesystem` bootstrapper which can result in loss of data, depending on the workload. | ||
|
||
### Peers Bootstrapper | ||
|
||
The `peers` bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node A is trying to perform a peer bootstrap: | ||
|
||
``` | ||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ | ||
│ Node A │ │ Node B │ │ Node C │ | ||
────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── | ||
┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ | ||
│ │ │ │ │ │ | ||
│ │ │ │ │ │ | ||
│ Shard 1: Initializing │ │ Shard 1: Initializing │ │ Shard 1: Available │ | ||
│ Shard 2: Initializing │ │ Shard 2: Initializing │ │ Shard 2: Available │ | ||
│ Shard 3: Initializing │ │ Shard 3: Initializing │ │ Shard 3: Available │ | ||
│ │ │ │ │ │ | ||
│ │ │ │ │ │ | ||
└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ | ||
``` | ||
|
||
In this case, the `peers` bootstrapper running on node A will not be able to fullfill any requests because node B is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node A's `peers` bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node A would be able to peer bootstrap its shards in the following placement because its peers (nodes B/C) have sufficient replicas of the shards it needs in the `Available` state: | ||
|
||
``` | ||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ | ||
│ Node A │ │ Node B │ │ Node C │ | ||
────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── | ||
┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ | ||
│ │ │ │ │ │ | ||
│ │ │ │ │ │ | ||
│ Shard 1: Initializing │ │ Shard 1: Available │ │ Shard 1: Available │ | ||
│ Shard 2: Initializing │ │ Shard 2: Available │ │ Shard 2: Available │ | ||
│ Shard 3: Initializing │ │ Shard 3: Available │ │ Shard 3: Available │ | ||
│ │ │ │ │ │ | ||
│ │ │ │ │ │ | ||
└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ | ||
``` | ||
|
||
Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". For example, if an entire cluster with a replication factor of 3 was restarted simultaneously, all the nodes would get stuck in an infinite loop trying to peer bootstrap from each other and not achieving majority until an operator modified this value. | ||
|
||
**Note**: Any bootstrappers configuration that does not include the `peers` bootstrapper will be unable to handle dynamic placement changes of any kind. | ||
|
||
### Uninitialized Topology Bootstrapper | ||
|
||
The purpose of the `uninitialized_topology` bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the `filesystem` and `commitlog` bootstrappers are used by default in node restarts, the `peers` bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the `commitlog` and `peers` bootstrappers will be unable to succeed any bootstraps. In other words, the `uninitialized_topology` bootstrapper allows us to place the `commitlog` bootstrapper *before* the `peers` bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. | ||
|
||
The `uninitialized_topology` bootstrapper determines whether a placement is "new" for a given shard by counting the number of nodes in the `Initializing` state and `Leaving` states and there are more `Initializing` than `Leaving`, then it succeeds the bootstrap because that means the placement has never reached a state where all nodes are `Available`. | ||
|
||
### No Operational All Bootstrapper | ||
|
||
The `noop_all` bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. | ||
|
||
## Bootstrappers Configuration | ||
|
||
Now that we've gone over the various bootstrappers, let's consider how M3DB will behave in different configurations. Note that we include `uninitialized_topology` at the end of all the lists of bootstrappers because its required to get a new placement up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the `peers` bootstrapper will not be able to handle dynamic placement changes like node adds/removes/replaces. | ||
|
||
### filesystem,commitlog,peers,uninitialized_topology (default) | ||
|
||
This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new placements / topologies. **This is the only configuration that we recommend using in production**. | ||
|
||
In the general case, the node will use only the `filesystem` and `commitlog` bootstrappers on node startup. However, in the case of a node add/remove/replace, the `commitlog` bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the `Available` state) and defer to the `peers` bootstrapper to stream in the data. | ||
|
||
Additionally, if it is a brand new placement where even the `peers` bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. | ||
|
||
### filesystem,peers,uninitialized_topology (default) | ||
|
||
Everytime a node is restarted it will attempt to stream in all of the the data for any blocks that it has never flushed, which is generally the currently active block and possibly the previous block as well. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of any unflushed blocks. This mode can lead to violations of M3DB's consistency guarantees due to the fact that commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. | ||
|
||
### peers,uninitialized_topology | ||
|
||
Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of all data on an individual node. This mode can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. | ||
|
||
## Invalid bootstrappers configuration | ||
|
||
For the sake of completeness, we've included a short discussion below of some bootstrapping configurations that we consider "invalid" in that they are likely to lose data / violate M3DB's consistency guarantees and/or not handle placement changes in a correct way. | ||
|
||
### filesystem,commitlog,uninitialized_topology | ||
|
||
This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. | ||
|
||
### filesystem,uninitialized_topology | ||
|
||
Every time a node is restarted it will utilize the immutable Fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. | ||
|
||
### commitlog,uninitialized_topology | ||
|
||
Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable Fileset files that it has already written. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly orthogonal - we should probably be consistent about what we been by M3DB node v host v process. Dunno if we are already but something to lookout for thru our docs