diff --git a/docs/m3db/monodraw/placement_monodraw_template.monopic b/docs/m3db/monodraw/placement_monodraw_template.monopic new file mode 100644 index 0000000000..166e4c2c18 Binary files /dev/null and b/docs/m3db/monodraw/placement_monodraw_template.monopic differ diff --git a/docs/m3db/monodraw/placement_state_machine.monopic b/docs/m3db/monodraw/placement_state_machine.monopic new file mode 100644 index 0000000000..a337fd0953 Binary files /dev/null and b/docs/m3db/monodraw/placement_state_machine.monopic differ diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md new file mode 100644 index 0000000000..dbf690613e --- /dev/null +++ b/docs/operational_guide/bootstrapping.md @@ -0,0 +1,122 @@ +# Bootstrapping + +## Introduction + +We recommend reading the [placement operational guide](placement.md) before reading the rest of this document. + +When an M3DB node is turned on (goes through a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers *does* matter. + +Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you to, this document is designed to help you understand the implications of doing so. + +M3DB currently supports 5 different bootstrappers: + +1. `filesystem` +2. `commitlog` +3. `peers` +4. `uninitialized_topology` +5. `noop_all` + +When the bootstrapping process begins, M3DB nodes need to determine two things: + +1. What shards the bootstrapping node should bootstrap, which can be determined from the cluster placement. +2. What time-ranges the bootstrapping node needs to bootstrap those shards for, which can be determined from the namespace retention. + +For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. + +## Bootstrappers + +### Filesystem Bootstrapper + +The `filesystem` bootstrapper's responsibility is to determine which immutable [Fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which Fileset files exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. + +### Commitlog Bootstrapper + +The `commitlog` bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable Fileset file. Unlike the `filesystem` bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the `commitlog` bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. + +On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster placement to see if the node it is running on has ever achieved the `Available` status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last Fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the `commitlog` bootstrapper *assumes* it is running after the `filesystem` bootstrapper. M3DB will not allow you to run with a configuration where the `filesystem` bootstrapper is placed after the `commitlog` bootstrapper, but it will allow you to run the `commitlog` bootstrapper without the `filesystem` bootstrapper which can result in loss of data, depending on the workload. + +### Peers Bootstrapper + +The `peers` bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node A is trying to perform a peer bootstrap: + +``` + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Node A │ │ Node B │ │ Node C │ +────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── +┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ +│ │ │ │ │ │ +│ │ │ │ │ │ +│ Shard 1: Initializing │ │ Shard 1: Initializing │ │ Shard 1: Available │ +│ Shard 2: Initializing │ │ Shard 2: Initializing │ │ Shard 2: Available │ +│ Shard 3: Initializing │ │ Shard 3: Initializing │ │ Shard 3: Available │ +│ │ │ │ │ │ +│ │ │ │ │ │ +└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ +``` + +In this case, the `peers` bootstrapper running on node A will not be able to fullfill any requests because node B is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node A's `peers` bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node A would be able to peer bootstrap its shards in the following placement because its peers (nodes B/C) have sufficient replicas of the shards it needs in the `Available` state: + +``` + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Node A │ │ Node B │ │ Node C │ +────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── +┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ +│ │ │ │ │ │ +│ │ │ │ │ │ +│ Shard 1: Initializing │ │ Shard 1: Available │ │ Shard 1: Available │ +│ Shard 2: Initializing │ │ Shard 2: Available │ │ Shard 2: Available │ +│ Shard 3: Initializing │ │ Shard 3: Available │ │ Shard 3: Available │ +│ │ │ │ │ │ +│ │ │ │ │ │ +└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ +``` + +Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". For example, if an entire cluster with a replication factor of 3 was restarted simultaneously, all the nodes would get stuck in an infinite loop trying to peer bootstrap from each other and not achieving majority until an operator modified this value. + +**Note**: Any bootstrappers configuration that does not include the `peers` bootstrapper will be unable to handle dynamic placement changes of any kind. + +### Uninitialized Topology Bootstrapper + +The purpose of the `uninitialized_topology` bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the `filesystem` and `commitlog` bootstrappers are used by default in node restarts, the `peers` bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the `commitlog` and `peers` bootstrappers will be unable to succeed any bootstraps. In other words, the `uninitialized_topology` bootstrapper allows us to place the `commitlog` bootstrapper *before* the `peers` bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. + +The `uninitialized_topology` bootstrapper determines whether a placement is "new" for a given shard by counting the number of nodes in the `Initializing` state and `Leaving` states and there are more `Initializing` than `Leaving`, then it succeeds the bootstrap because that means the placement has never reached a state where all nodes are `Available`. + +### No Operational All Bootstrapper + +The `noop_all` bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. + +## Bootstrappers Configuration + +Now that we've gone over the various bootstrappers, let's consider how M3DB will behave in different configurations. Note that we include `uninitialized_topology` at the end of all the lists of bootstrappers because its required to get a new placement up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the `peers` bootstrapper will not be able to handle dynamic placement changes like node adds/removes/replaces. + +### filesystem,commitlog,peers,uninitialized_topology (default) + +This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new placements / topologies. **This is the only configuration that we recommend using in production**. + +In the general case, the node will use only the `filesystem` and `commitlog` bootstrappers on node startup. However, in the case of a node add/remove/replace, the `commitlog` bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the `Available` state) and defer to the `peers` bootstrapper to stream in the data. + +Additionally, if it is a brand new placement where even the `peers` bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. + +### filesystem,peers,uninitialized_topology (default) + +Everytime a node is restarted it will attempt to stream in all of the the data for any blocks that it has never flushed, which is generally the currently active block and possibly the previous block as well. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of any unflushed blocks. This mode can lead to violations of M3DB's consistency guarantees due to the fact that commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. + +### peers,uninitialized_topology + +Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of all data on an individual node. This mode can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. + +## Invalid bootstrappers configuration + +For the sake of completeness, we've included a short discussion below of some bootstrapping configurations that we consider "invalid" in that they are likely to lose data / violate M3DB's consistency guarantees and/or not handle placement changes in a correct way. + +### filesystem,commitlog,uninitialized_topology + +This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. + +### filesystem,uninitialized_topology + +Every time a node is restarted it will utilize the immutable Fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. + +### commitlog,uninitialized_topology + +Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable Fileset files that it has already written. diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md new file mode 100644 index 0000000000..f49abdc615 --- /dev/null +++ b/docs/operational_guide/placement.md @@ -0,0 +1,220 @@ +# Placement + +## Overview + +**Note**: The words *placement* and *topology* are used interchangeably throughout the M3DB documentation and codebase. + +A M3DB cluster has exactly one Placement. That placement maps the cluster's shard replicas to nodes. A cluster also has 0 or more namespaces (analogous to tables in other databases), and each node serves every namespace for the shards it owns. In other words, if the cluster topology states that node A owns shards 1, 2, and 3 then node A will own shards 1, 2, 3 for all configured namespaces in the cluster. + +M3DB stores its placement (mapping of which NODES are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each node/shard pair can be in: + +1. `Initializing` +2. `Available` +3. `Leaving` + +Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the `Initializing` state. Once all the nodes finish bootstrapping, they will mark all of their shards as `Available`. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the nodes as `Available`. + +## Initializing State + +The `Initializing` state is the state in which all new node/shard combinations begin. For example, upon creating a new placement all the node/shard pairs will begin in the `Initializing` state and only once they have successfully bootstrapped will they transition to the `Available` state. + +The `Initializing` state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new node will begin with all of its shards in the `Initializing` state until it can stream the data it is missing from its peers. During a node removal, all of the nodes who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as `Initializing` until they can stream in the data from the node leaving the cluster, or one of its peers. + +## Available State + +Once a node with a shard in the `Initializing` state successfully bootstraps all of the data for that shard, it will mark that shard as `Available` (for the single node) in the cluster placement. + +## Leaving State + +The `Leaving` state indicates that a node has been marked for removal from the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it. + +## Sample Cluster State Transitions - Node Add + +Node adds are performed by adding the new node to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the `Initializing` state. Similarly, the shards will be marked as `Leaving` on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are `Available` and that the `Leaving` node should no longer own that shard in the placement. + +``` +Replication factor: 3 + + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ 1) Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 1: Leaving │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││Shard 1: Initializing │ │ +│ 2) Begin Node Add │ │ Shard 2: Available │ │ │ Shard 2: Leaving │ │ │ Shard 2: Available │││Shard 2: Initializing │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Leaving │││Shard 3: Initializing │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ +│ 3) Complete Node Add │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ +``` + +## Sample Cluster State Transitions - Node Remove + +Node removes are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as `Leaving` and those shards are distributed to the remaining nodes (based on their weight) and assigned a state of `Initializing`. Once the existing nodes that are taking ownership of the leaving nodes shards finish bootstrapping, they will update the placement to indicate that the shards that they just acquired are `Available` and that the leaving node should no longer own those shards in the placement. + +``` +Replication factor: 3 + + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ +│ 1) Initial Placement │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ Shard 1: Available │││ │ │ +│ │ │ Shard 1: Initializing │ │ │ Shard 1: Available │ ││ Shard 2: Available │││ Shard 1: Leaving │ │ +│ 2) Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Initializing│ ││ Shard 3: Initializing│││ Shard 2: Leaving │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ │││ Shard 3: Leaving │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ 3) Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ +``` + +## Sample Cluster State Transitions - Node Replace + +Node replaces are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as `Leaving` and those shards are all added to the node that is being added and assigned a state of `Initializing`. Once the replacement node finishes bootstrapping, it will update the placement to indicate that the shards that it acquired are `Available` and that the leaving node should no longer own those shards in the placement. + +``` +Replication factor: 3 + + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ 1) Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ ││ Shard 1: Leaving │││Shard 1: Initializing │ │ +│ 2) Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Available │ ││ Shard 2: Leaving │││Shard 2: Initializing │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ Shard 3: Leaving │││Shard 3: Initializing │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ │┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ ││ Shard 1: Available │ │ +│ 3) Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ ││ Shard 2: Available │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ ││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ │└──────────────────────┘ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ +``` + +## Cluster State Transitions - Placement Updates Initiation + +The diagram below depicts the sequence of events that happen during a node replace and illustrates which entity is performing the placement update (in etcd) at each step. + +``` + ┌────────────────────────────────┐ + │ Node A │ + │ │ + │ Shard 1: Available │ + │ Shard 2: Available │ Operator performs node replace by + │ Shard 3: Available │ updating placement in etcd such + │ │ that shards on node A are marked + └────────────────────────────────┤ Leaving and shards on node B are + │ marked Initializing + └─────────────────────────────────┐ + │ + │ + │ + │ + │ + ▼ + ┌────────────────────────────────┐ + │ Node A │ + │ │ + │ Shard 1: Leaving │ + │ Shard 2: Leaving │ + │ Shard 3: Leaving │ + │ │ + └────────────────────────────────┘ + + ┌────────────────────────────────┐ + │ Node B │ + │ │ + │ Shard 1: Initializing │ +┌────────────────────────────────┐ │ Shard 2: Initializing │ +│ │ │ Shard 3: Initializing │ +│ │ │ │ +│ Node A │ └────────────────────────────────┘ +│ │ │ +│ │ │ +│ │ │ +└────────────────────────────────┘ │ + │ +┌────────────────────────────────┐ │ +│ Node B │ │ +│ │ │ +│ Shard 1: Available │ Node B completes bootstrapping and +│ Shard 2: Available │◀────updates placement (via etcd) to +│ Shard 3: Available │ indicate shard state is Available and +│ │ that Node A should no longer own any shards +└────────────────────────────────┘ +``` diff --git a/mkdocs.yml b/mkdocs.yml index a107cbbe56..1c02c9bcf4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -61,6 +61,9 @@ pages: - "M3DB Single Node Deployment": "how_to/single_node.md" - "M3DB Cluster Deployment, Manually": "how_to/cluster_hard_way.md" - "M3DB on Kubernetes": "how_to/kubernetes.md" + - "Operational Guides": + - "Placement / Topology": "operational_guide/placement.md" + - "Bootstrapping": "operational_guide/bootstrapping.md" - "Integrations": - "Prometheus": "integrations/prometheus.md" - "Troubleshooting": "troubleshooting/index.md"