From d5dd6eb6ef94d06d2ef15e0ba7dc620a503588ba Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 14:20:38 -0400 Subject: [PATCH 01/32] Add topology docs --- docs/operational_guide/topology.md | 137 +++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) create mode 100644 docs/operational_guide/topology.md diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md new file mode 100644 index 0000000000..a1366d8bc0 --- /dev/null +++ b/docs/operational_guide/topology.md @@ -0,0 +1,137 @@ +# Topology + +## Overview + +M3DB stores its topology (mapping of which hosts are responsible for which shards) in EtcD. There are three possible states that each host/shard pair can be in: + +1. Initializing +2. Available +3. Leaving + +Note that these states are not a reflection of the current status of an M3DB node, but an indicating of whether a given node has ever successfully bootstrapped and taken ownership of a given shard. For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster topology will still show all of the shards for all of the hosts as Available. + +## Initializing State + +The initializing state is the state in which all new host/shard combinations begin. For example, upon creating a new topology all the host/shard pairs will begin in the "Initializing" state and only once they have successfully bootstrapped will they transition to the "Available" state. + +The Initializing state is not limited to new topology, however, as it can also occur during topology changes. For example, during a node add/replace the new host will begin with all of its shards in the Initializing state until it can stream the data it is missing from its peers. During a node removal, all of the hosts who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as Initializing until they can stream in the data from the node leaving the cluster, or one of its peers. + +## Available State + +Once a node with a shard in the Initializing state successfully bootstraps all of the data for that shard, it will mark that shard as Available (for the single host) in the cluster topology. + +## Leaving State + +The leaving state indicates that a node is attempting to leave the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for thhe nodes that are taking over its responsibilities to stream data from it. + + +## Sample Cluster State Transitions - Node Add + +### Initial Topology + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | A | +| 3 | 3 | A | + +### Begin Node Add + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | L | +| 2 | 2 | A | +| 2 | 3 | L | +| 3 | 1 | A | +| 3 | 3 | A | +| 4 | 1 | I | +| 4 | 3 | I | + +### Complete Node Add + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 2 | 2 | A | +| 3 | 1 | A | +| 3 | 3 | A | +| 4 | 1 | A | +| 4 | 3 | A | + +## Sample Cluster State Transitions - Node Add + +### Initial Topology + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | A | +| 3 | 3 | A | + +### Begin Node Remove + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | +| 1 | 3 | I | +| 2 | 1 | I | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | L | +| 3 | 3 | L | + +### Complete Node Remove + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 2 | A | +| 2 | 3 | A | + +## Sample Cluster State Transitions - Node Replace + +### Initial Topology + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | A | +| 3 | 3 | A | + +### Begin Node Replace + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | L | +| 1 | 2 | L | +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | I | +| 3 | 3 | I | + +### Complete Node Replace + +| Node | Shard | State | +|------|-------|-------| +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | A | +| 3 | 3 | A | \ No newline at end of file From 932530991e449e1470c6ee0e1a7da119f7d53c93 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:06:47 -0400 Subject: [PATCH 02/32] Add operational guides for bootstrapping and improve topology guide --- docs/operational_guide/bootstrapping.md | 104 ++++++++++++++++++++++++ docs/operational_guide/topology.md | 84 +++++++++++++------ 2 files changed, 164 insertions(+), 24 deletions(-) create mode 100644 docs/operational_guide/bootstrapping.md diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md new file mode 100644 index 0000000000..fe7f159172 --- /dev/null +++ b/docs/operational_guide/bootstrapping.md @@ -0,0 +1,104 @@ +# Bootstrapping + +## Introduction + +We recommend reading the topology operational guide(TODO LINK) before reading the rest of this document. + +When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the rigth thing such that you don't lose data and its consistency guarantees are met. + +In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of the bootstrapping order are. + +M3DB currently supports 5 different bootstrappers: + +1. filesystem +2. commitlog +3. peers +4. uninitialized_topology +5. noop_all + +When the bootstrapping process begins, M3DB nodes need to determine two things: + +1. What shards they should bootstrap, which can be determined from the cluster topology +2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention + +For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster topology. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges it requires have been marked as fulfilled, otherwise the M3DB node will fail to start. + +### Filesystem Bootstrapper + +The filesystem bootstrapper's responsibility is to determine which immutable filesetfiles(TODO LINK) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. + +### Commitlog Bootstrapper + +The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard, it will either return that it can satisfy any time range for a given shard or none at all. + +### Peers Bootstrapper + +The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The peers bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster topology and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB topology where node 1 is trying to perform a peer bootstrap: + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | I | +| 1 | 2 | I | +| 1 | 3 | I | +| 2 | 1 | I | +| 2 | 2 | I | +| 2 | 3 | I | +| 3 | 1 | A | +| 3 | 2 | A | +| 3 | 3 | A | + +In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the Initializing state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the otherhand, node 1 would be able to peer bootstrap in the following topology because its peers (nodes 2/3) are available for all of their shards: + +| Node | Shard | State | +|------|-------|-------| +| 1 | 1 | I | +| 1 | 2 | I | +| 1 | 3 | I | +| 2 | 1 | A | +| 2 | 2 | A | +| 2 | 3 | A | +| 3 | 1 | A | +| 3 | 2 | A | +| 3 | 3 | A | + +Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in EtcD to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". + +**Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic topology changes of any kind. + +### Uninitialized Topology Bootstrapper + +The purpose of the uninitialzied topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped. This allows us to run the default bootstrapper configuration of: filesystem,commitlog,peers,topology_uninitialized such that filesystem and commitlog are used by default in node restarts, peer bootstrapper is only used for node adds/removes/replaces, and bootstraps still succeed for brand new topologies where both the commitlog and peers bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the commitlog bootstrapper *before* the peers bootstrapper and still succeed bootstraps with brand new topologies without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the topology. + +The uninitialized topology bootstrapper determines whether a topology is "new" for a given shard by counting the number of hosts in the Initializing state and Leaving states and if the number of Initializing - Leaving > 0 than it succeeds the bootstrap because that means the topology has never reached a state where all hosts are Available. + +### No Operational All Bootstrapper + +The noop_all bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. + +### Bootstrappers Configuration + +Now that we've gone over the various bootstrappers, lets consider how M3DB will behave in different configurations. Note that we include uninitialized_topology at the end of all the lists of bootstrappers because its required to get a new topology up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the peers bootstrapper will not be able to handle dynamic topology changes like node adds/removes/replaces. + +#### filesystem,commitlog,peers,uninitialized_topology (default) + +This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees. I.E Successful quorum writes will always be readable via quorum reads. + +In the general case, the node will use only the filesystem and commitlog bootstrappers on node startup. However, in the case of a node add/remove/replace, the commitlog bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the Available state) and defer to the peers bootstrapper to stream in the data. + +Additionally, if it is a brand new topology where even the peers bootstrapper cannot fulfill the bootstrap, this will be detected by the uninitialized_topology bootstrapper which will succeed the bootstrap. + +#### filesystem,commitlog,uninitialized_topology + +This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. + +#### peers,uninitialized_topology + +Everytime a node is restarted, it will attempt to stream in *all* of the data that it IS responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. + +#### filesystem,uninitialized_topology + +Everytime a node is restarted it will utilize the immutable fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. + +#### commitlog,uninitialized_topology + +Everytime a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable fileset files that it has already written. diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md index a1366d8bc0..b395f8173c 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/topology.md @@ -27,66 +27,87 @@ The leaving state indicates that a node is attempting to leave the cluster. The ## Sample Cluster State Transitions - Node Add +Replication factor: 3 + ### Initial Topology | Node | Shard | State | |------|-------|-------| | 1 | 1 | A | | 1 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | | 2 | 2 | A | | 2 | 3 | A | | 3 | 1 | A | +| 3 | 2 | A | | 3 | 3 | A | ### Begin Node Add | Node | Shard | State | |------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | L | -| 2 | 2 | A | -| 2 | 3 | L | +| 1 | 1 | L | +| 1 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 2 | L | +| 2 | 3 | A | | 3 | 1 | A | -| 3 | 3 | A | +| 3 | 2 | A | +| 3 | 3 | L | | 4 | 1 | I | +| 4 | 2 | I | | 4 | 3 | I | ### Complete Node Add | Node | Shard | State | |------|-------|-------| -| 1 | 1 | A | -| 2 | 2 | A | +| 1 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 3 | A | | 3 | 1 | A | -| 3 | 3 | A | +| 3 | 2 | A | | 4 | 1 | A | +| 4 | 2 | A | | 4 | 3 | A | -## Sample Cluster State Transitions - Node Add +## Sample Cluster State Transitions - Node Remove + +Replication factor: 3 ### Initial Topology | Node | Shard | State | |------|-------|-------| -| 1 | 1 | A | | 1 | 2 | A | -| 2 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | | 2 | 3 | A | | 3 | 1 | A | -| 3 | 3 | A | +| 3 | 2 | A | +| 4 | 1 | L | +| 4 | 2 | L | +| 4 | 3 | L | ### Begin Node Remove | Node | Shard | State | |------|-------|-------| -| 1 | 1 | A | +| 1 | 1 | I | | 1 | 2 | A | -| 1 | 3 | I | -| 2 | 1 | I | -| 2 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | +| 2 | 2 | I | | 2 | 3 | A | -| 3 | 1 | L | -| 3 | 3 | L | +| 3 | 1 | A | +| 3 | 2 | A | +| 3 | 3 | I | +| 4 | 1 | L | +| 4 | 2 | L | +| 4 | 3 | L | ### Complete Node Remove @@ -98,40 +119,55 @@ The leaving state indicates that a node is attempting to leave the cluster. The | 2 | 1 | A | | 2 | 2 | A | | 2 | 3 | A | +| 3 | 1 | A | +| 3 | 2 | A | +| 3 | 3 | A | ## Sample Cluster State Transitions - Node Replace +Replication factor: 3 + ### Initial Topology | Node | Shard | State | |------|-------|-------| | 1 | 1 | A | | 1 | 2 | A | +| 1 | 3 | A | +| 2 | 1 | A | | 2 | 2 | A | | 2 | 3 | A | | 3 | 1 | A | +| 3 | 2 | A | | 3 | 3 | A | ### Begin Node Replace | Node | Shard | State | |------|-------|-------| -| 1 | 1 | L | -| 1 | 2 | L | +| 1 | 1 | A | +| 1 | 2 | A | | 1 | 3 | A | | 2 | 1 | A | | 2 | 2 | A | | 2 | 3 | A | -| 3 | 1 | I | -| 3 | 3 | I | +| 3 | 1 | L | +| 3 | 2 | L | +| 3 | 3 | L | +| 4 | 1 | I | +| 4 | 2 | I | +| 4 | 3 | I | ### Complete Node Replace | Node | Shard | State | |------|-------|-------| +| 1 | 1 | A | +| 1 | 2 | A | | 1 | 3 | A | | 2 | 1 | A | | 2 | 2 | A | | 2 | 3 | A | -| 3 | 1 | A | -| 3 | 3 | A | \ No newline at end of file +| 4 | 1 | A | +| 4 | 2 | A | +| 4 | 3 | A | \ No newline at end of file From 93a7fe8ebd43c7f690d8f3d88508cfdb38d67d18 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:08:31 -0400 Subject: [PATCH 03/32] Add links --- docs/operational_guide/bootstrapping.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index fe7f159172..878e6f77ac 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -2,7 +2,7 @@ ## Introduction -We recommend reading the topology operational guide(TODO LINK) before reading the rest of this document. +We recommend reading the [topology operational guide](topology.md) before reading the rest of this document. When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the rigth thing such that you don't lose data and its consistency guarantees are met. @@ -25,7 +25,7 @@ For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 2 ### Filesystem Bootstrapper -The filesystem bootstrapper's responsibility is to determine which immutable filesetfiles(TODO LINK) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. +The filesystem bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. ### Commitlog Bootstrapper From 43a51215b1d4d822967c32f3b5d55dccdcea9487 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:12:46 -0400 Subject: [PATCH 04/32] add clarification --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 878e6f77ac..4e5883694e 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -4,7 +4,7 @@ We recommend reading the [topology operational guide](topology.md) before reading the rest of this document. -When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the rigth thing such that you don't lose data and its consistency guarantees are met. +When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the rigth thing such that you don't lose data and its consistency guarantees are met. In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of the bootstrapping order are. From 5241eb1c29ab8d4e384cf7c009df0114e0026fec Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:13:23 -0400 Subject: [PATCH 05/32] Fix typo --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 4e5883694e..aaf0be265f 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -4,7 +4,7 @@ We recommend reading the [topology operational guide](topology.md) before reading the rest of this document. -When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the rigth thing such that you don't lose data and its consistency guarantees are met. +When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of the bootstrapping order are. From 781cdb84d093eccac1e63e7a8228e2160529e51c Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:13:47 -0400 Subject: [PATCH 06/32] rewrite sentence --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index aaf0be265f..1d6da0613d 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -6,7 +6,7 @@ We recommend reading the [topology operational guide](topology.md) before readin When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. -In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of the bootstrapping order are. +In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of changing the bootstrappers order is. M3DB currently supports 5 different bootstrappers: From ffa7e4038dd414098707e2753ccebe9c330f97c7 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:14:10 -0400 Subject: [PATCH 07/32] Add periods --- docs/operational_guide/bootstrapping.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 1d6da0613d..1234275d8b 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -18,8 +18,8 @@ M3DB currently supports 5 different bootstrappers: When the bootstrapping process begins, M3DB nodes need to determine two things: -1. What shards they should bootstrap, which can be determined from the cluster topology -2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention +1. What shards they should bootstrap, which can be determined from the cluster topology. +2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention. For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster topology. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges it requires have been marked as fulfilled, otherwise the M3DB node will fail to start. From 7ea2a52ce5dd0bf236de88eaaacc62e7c3e44ae9 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:15:07 -0400 Subject: [PATCH 08/32] Fix punctuation --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 1234275d8b..ec64f23c51 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -21,7 +21,7 @@ When the bootstrapping process begins, M3DB nodes need to determine two things: 1. What shards they should bootstrap, which can be determined from the cluster topology. 2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention. -For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster topology. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges it requires have been marked as fulfilled, otherwise the M3DB node will fail to start. +For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster topology. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges it requires have been marked as fulfilled. Otherwise the M3DB node will fail to start. ### Filesystem Bootstrapper From 25ebf86d1187c9f5c853c96cd8ffc48f9ec53fba Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:16:48 -0400 Subject: [PATCH 09/32] Clarify filesystem bootstrapper --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index ec64f23c51..5fdf53d836 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -25,7 +25,7 @@ For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 2 ### Filesystem Bootstrapper -The filesystem bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. +The filesystem bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. Unlike the other bootstrappers, the filesystem bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and the M3DB node itself will handle reading (and caching) the data dynamically once it begins to serve reads. ### Commitlog Bootstrapper From f653b170841a0a0c496a2b9ee66b9f724b28dd5d Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:17:29 -0400 Subject: [PATCH 10/32] Break into separate paragraph --- docs/operational_guide/bootstrapping.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 5fdf53d836..cb3bb9336a 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -29,7 +29,9 @@ The filesystem bootstrapper's responsibility is to determine which immutable [fi ### Commitlog Bootstrapper -The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard, it will either return that it can satisfy any time range for a given shard or none at all. +The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. + +On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard, it will either return that it can satisfy any time range for a given shard or none at all. ### Peers Bootstrapper From ad36b8ef63a64718153a8286d46fb67178a80281 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:19:32 -0400 Subject: [PATCH 11/32] Add clarification for commitlog bootstrap --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index cb3bb9336a..4d76432d6e 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -31,7 +31,7 @@ The filesystem bootstrapper's responsibility is to determine which immutable [fi The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. -On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard, it will either return that it can satisfy any time range for a given shard or none at all. +On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the commitlog bootstrapper *assumes* it is running after the filesystem bootstrapper. M3DB will not allow you to run with a configuration where the filesystem bootstrappe is placed after the commitlog bootstrapper, but it will allow you to run the commitlog bootstrapper without the filesystem bootstrapper which can result in loss of data, depending on the workload. ### Peers Bootstrapper From 3b2ed8c8a72ead3bb798780077e721103633eb0b Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:20:44 -0400 Subject: [PATCH 12/32] Improve introduction --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 4d76432d6e..2c1ab4cec9 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -6,7 +6,7 @@ We recommend reading the [topology operational guide](topology.md) before readin When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. -In some rare cases, you may want to modify the bootstrapper configuration. The purpose of this document is to explain how all the different bootstrappers work. and what the implications of changing the bootstrappers order is. +Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you need to this document is designed to help you understand the implications of doing so. M3DB currently supports 5 different bootstrappers: From 5f3a2f319fef4b9bbe038b58fc390ddee858e939 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:22:15 -0400 Subject: [PATCH 13/32] Fix typo --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 2c1ab4cec9..8301fdd9d8 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -49,7 +49,7 @@ The peer bootstrapper's responsibility is to stream in data for shard/ranges fro | 3 | 2 | A | | 3 | 3 | A | -In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the Initializing state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the otherhand, node 1 would be able to peer bootstrap in the following topology because its peers (nodes 2/3) are available for all of their shards: +In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the Initializing state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following topology because its peers (nodes 2/3) are available for all of their shards: | Node | Shard | State | |------|-------|-------| From 05cfba8ebe8d45d7694f6655f7f69cb6d2591575 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:24:43 -0400 Subject: [PATCH 14/32] more fixes --- docs/operational_guide/bootstrapping.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 8301fdd9d8..3dad07fe51 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -4,7 +4,7 @@ We recommend reading the [topology operational guide](topology.md) before reading the rest of this document. -When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: "filesystem,commitlog,peers,uninitialized_topology" then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. +When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you need to this document is designed to help you understand the implications of doing so. @@ -69,7 +69,7 @@ Note that a bootstrap consistency level of majority is the default value, but ca ### Uninitialized Topology Bootstrapper -The purpose of the uninitialzied topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped. This allows us to run the default bootstrapper configuration of: filesystem,commitlog,peers,topology_uninitialized such that filesystem and commitlog are used by default in node restarts, peer bootstrapper is only used for node adds/removes/replaces, and bootstraps still succeed for brand new topologies where both the commitlog and peers bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the commitlog bootstrapper *before* the peers bootstrapper and still succeed bootstraps with brand new topologies without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the topology. +The purpose of the uninitialzied topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the filesystem and commitlog bootstrappers are used by default in node restarts, the peer bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new topologies where both the commitlog and peers bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the commitlog bootstrapper *before* the peers bootstrapper and still succeed bootstraps with brand new topologies without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the topology. The uninitialized topology bootstrapper determines whether a topology is "new" for a given shard by counting the number of hosts in the Initializing state and Leaving states and if the number of Initializing - Leaving > 0 than it succeeds the bootstrap because that means the topology has never reached a state where all hosts are Available. From b6593cec8e62c03c17c8a41af0329963b518c42f Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:25:51 -0400 Subject: [PATCH 15/32] Clarify default config --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 3dad07fe51..db6ab17794 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -83,7 +83,7 @@ Now that we've gone over the various bootstrappers, lets consider how M3DB will #### filesystem,commitlog,peers,uninitialized_topology (default) -This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees. I.E Successful quorum writes will always be readable via quorum reads. +This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new topologies. In the general case, the node will use only the filesystem and commitlog bootstrappers on node startup. However, in the case of a node add/remove/replace, the commitlog bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the Available state) and defer to the peers bootstrapper to stream in the data. From 1f793ee2535a2b995e687dbc54308950559a1723 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:26:19 -0400 Subject: [PATCH 16/32] Fix typo --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index db6ab17794..d31d3233b0 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -95,7 +95,7 @@ This bootstrapping configuration will work just fine if nodes are never added/re #### peers,uninitialized_topology -Everytime a node is restarted, it will attempt to stream in *all* of the data that it IS responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. +Everytime a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. #### filesystem,uninitialized_topology From 2e710de74c79dfd4e9b92dfbfcf9b085738bb769 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:27:01 -0400 Subject: [PATCH 17/32] Fix typo --- docs/operational_guide/topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md index b395f8173c..3285da1765 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/topology.md @@ -8,7 +8,7 @@ M3DB stores its topology (mapping of which hosts are responsible for which shard 2. Available 3. Leaving -Note that these states are not a reflection of the current status of an M3DB node, but an indicating of whether a given node has ever successfully bootstrapped and taken ownership of a given shard. For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster topology will still show all of the shards for all of the hosts as Available. +Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard. For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster topology will still show all of the shards for all of the hosts as Available. ## Initializing State From a74744b889e25e60d55e536907a41c6462f2a42f Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:28:19 -0400 Subject: [PATCH 18/32] fix typo --- docs/operational_guide/topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md index 3285da1765..29985d94d4 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/topology.md @@ -22,7 +22,7 @@ Once a node with a shard in the Initializing state successfully bootstraps all o ## Leaving State -The leaving state indicates that a node is attempting to leave the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for thhe nodes that are taking over its responsibilities to stream data from it. +The leaving state indicates that a node is attempting to leave the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it. ## Sample Cluster State Transitions - Node Add From 1874e8483e6a90c3e819f99cd67a72281a602db3 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:35:06 -0400 Subject: [PATCH 19/32] fix typo --- docs/operational_guide/bootstrapping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index d31d3233b0..ab838b93d2 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -31,7 +31,7 @@ The filesystem bootstrapper's responsibility is to determine which immutable [fi The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. -On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the commitlog bootstrapper *assumes* it is running after the filesystem bootstrapper. M3DB will not allow you to run with a configuration where the filesystem bootstrappe is placed after the commitlog bootstrapper, but it will allow you to run the commitlog bootstrapper without the filesystem bootstrapper which can result in loss of data, depending on the workload. +On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the commitlog bootstrapper *assumes* it is running after the filesystem bootstrapper. M3DB will not allow you to run with a configuration where the filesystem bootstrapper is placed after the commitlog bootstrapper, but it will allow you to run the commitlog bootstrapper without the filesystem bootstrapper which can result in loss of data, depending on the workload. ### Peers Bootstrapper From 1b2006f5ef71db111165015a09011e187f465170 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:36:05 -0400 Subject: [PATCH 20/32] Address feedback changes --- docs/operational_guide/bootstrapping.md | 4 ++-- docs/operational_guide/topology.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index ab838b93d2..77a04096e3 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -63,7 +63,7 @@ In this case, the peer bootstrapper running on node 1 will not be able to fullfi | 3 | 2 | A | | 3 | 3 | A | -Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in EtcD to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". +Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in etcd to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". **Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic topology changes of any kind. @@ -75,7 +75,7 @@ The uninitialized topology bootstrapper determines whether a topology is "new" f ### No Operational All Bootstrapper -The noop_all bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. +The `noop_all` bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. ### Bootstrappers Configuration diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md index 29985d94d4..e112794ee3 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/topology.md @@ -2,7 +2,7 @@ ## Overview -M3DB stores its topology (mapping of which hosts are responsible for which shards) in EtcD. There are three possible states that each host/shard pair can be in: +M3DB stores its topology (mapping of which hosts are responsible for which shards) in etcd. There are three possible states that each host/shard pair can be in: 1. Initializing 2. Available From be7149f7b3568296d8f9eec6cd182f9bab686a7b Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Thu, 20 Sep 2018 16:36:38 -0400 Subject: [PATCH 21/32] link to etcd --- docs/operational_guide/bootstrapping.md | 2 +- docs/operational_guide/topology.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 77a04096e3..a783caddf4 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -63,7 +63,7 @@ In this case, the peer bootstrapper running on node 1 will not be able to fullfi | 3 | 2 | A | | 3 | 3 | A | -Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in etcd to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". +Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". **Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic topology changes of any kind. diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/topology.md index e112794ee3..d0a7ab280c 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/topology.md @@ -2,7 +2,7 @@ ## Overview -M3DB stores its topology (mapping of which hosts are responsible for which shards) in etcd. There are three possible states that each host/shard pair can be in: +M3DB stores its topology (mapping of which hosts are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each host/shard pair can be in: 1. Initializing 2. Available From 02bd7cc184f85d8a1d9a2d2ca09d6a9c033c3182 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 16:09:56 -0400 Subject: [PATCH 22/32] Address review feedback --- docs/operational_guide/bootstrapping.md | 48 +++++++++---------- .../{topology.md => placement.md} | 31 ++++++------ 2 files changed, 41 insertions(+), 38 deletions(-) rename docs/operational_guide/{topology.md => placement.md} (57%) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index a783caddf4..79503fc681 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -2,40 +2,40 @@ ## Introduction -We recommend reading the [topology operational guide](topology.md) before reading the rest of this document. +We recommend reading the [placement operational guide](placement.md) before reading the rest of this document. -When an M3DB node is turned on (or experiences a topology change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. +When an M3DB node is turned on (or experiences a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers *does* matter. -Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you need to this document is designed to help you understand the implications of doing so. +Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you to, this document is designed to help you understand the implications of doing so. M3DB currently supports 5 different bootstrappers: -1. filesystem -2. commitlog -3. peers -4. uninitialized_topology -5. noop_all +1. `filesystem` +2. `commitlog` +3. `peers` +4. `uninitialized_topology` +5. `noop_all` When the bootstrapping process begins, M3DB nodes need to determine two things: -1. What shards they should bootstrap, which can be determined from the cluster topology. +1. What shards they should bootstrap, which can be determined from the cluster placement. 2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention. -For example, imagine an M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster topology. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges it requires have been marked as fulfilled. Otherwise the M3DB node will fail to start. +For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention starting 48 hours ago and ending at the current time. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. ### Filesystem Bootstrapper -The filesystem bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. Unlike the other bootstrappers, the filesystem bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and the M3DB node itself will handle reading (and caching) the data dynamically once it begins to serve reads. +The `filesystem` bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. ### Commitlog Bootstrapper -The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. +The `commitlog` bootstrapper's responsibility is to read the `commitlog` and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the `filesystem` bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the `commitlog` bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. -On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster topology to see if the host it is running on has ever achieved the "Available" status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the commitlog bootstrapper *assumes* it is running after the filesystem bootstrapper. M3DB will not allow you to run with a configuration where the filesystem bootstrapper is placed after the commitlog bootstrapper, but it will allow you to run the commitlog bootstrapper without the filesystem bootstrapper which can result in loss of data, depending on the workload. +On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster placement to see if the host it is running on has ever achieved the `Available` status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the `commitlog` bootstrapper *assumes* it is running after the `filesystem` bootstrapper. M3DB will not allow you to run with a configuration where the `filesystem` bootstrapper is placed after the `commitlog` bootstrapper, but it will allow you to run the `commitlog` bootstrapper without the `filesystem` bootstrapper which can result in loss of data, depending on the workload. ### Peers Bootstrapper -The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The peers bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster topology and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB topology where node 1 is trying to perform a peer bootstrap: +The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node 1 is trying to perform a peer bootstrap: | Node | Shard | State | |------|-------|-------| @@ -49,7 +49,7 @@ The peer bootstrapper's responsibility is to stream in data for shard/ranges fro | 3 | 2 | A | | 3 | 3 | A | -In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the Initializing state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following topology because its peers (nodes 2/3) are available for all of their shards: +In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following placement because its peers (nodes 2/3) are `Available` for all of their shards: | Node | Shard | State | |------|-------|-------| @@ -65,13 +65,13 @@ In this case, the peer bootstrapper running on node 1 will not be able to fullfi Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". -**Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic topology changes of any kind. +**Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic placement changes of any kind. ### Uninitialized Topology Bootstrapper -The purpose of the uninitialzied topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the filesystem and commitlog bootstrappers are used by default in node restarts, the peer bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new topologies where both the commitlog and peers bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the commitlog bootstrapper *before* the peers bootstrapper and still succeed bootstraps with brand new topologies without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the topology. +The purpose of the uninitialized topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the `filesystem` and `commitlog` bootstrappers are used by default in node restarts, the peer bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the `commitlog` and `peers` bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the `commitlog` bootstrapper *before* the `peers` bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. -The uninitialized topology bootstrapper determines whether a topology is "new" for a given shard by counting the number of hosts in the Initializing state and Leaving states and if the number of Initializing - Leaving > 0 than it succeeds the bootstrap because that means the topology has never reached a state where all hosts are Available. +The uninitialized topology bootstrapper determines whether a placement is "new" for a given shard by counting the number of hosts in the `Initializing` state and `Leaving` states and there are more `Initializing` than `Leaving`, then it succeeds the bootstrap because that means the placement has never reached a state where all hosts are `Available`. ### No Operational All Bootstrapper @@ -79,15 +79,15 @@ The `noop_all` bootstrapper succeeds all bootstraps regardless of requests shard ### Bootstrappers Configuration -Now that we've gone over the various bootstrappers, lets consider how M3DB will behave in different configurations. Note that we include uninitialized_topology at the end of all the lists of bootstrappers because its required to get a new topology up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the peers bootstrapper will not be able to handle dynamic topology changes like node adds/removes/replaces. +Now that we've gone over the various bootstrappers, let's consider how M3DB will behave in different configurations. Note that we include `uninitialized_topology` at the end of all the lists of bootstrappers because its required to get a new placement up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the `peers` bootstrapper will not be able to handle dynamic placement changes like node adds/removes/replaces. #### filesystem,commitlog,peers,uninitialized_topology (default) This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new topologies. -In the general case, the node will use only the filesystem and commitlog bootstrappers on node startup. However, in the case of a node add/remove/replace, the commitlog bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the Available state) and defer to the peers bootstrapper to stream in the data. +In the general case, the node will use only the `filesystem` and `commitlog` bootstrappers on node startup. However, in the case of a node add/remove/replace, the `commitlog` bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the `Available` state) and defer to the `peers` bootstrapper to stream in the data. -Additionally, if it is a brand new topology where even the peers bootstrapper cannot fulfill the bootstrap, this will be detected by the uninitialized_topology bootstrapper which will succeed the bootstrap. +Additionally, if it is a brand new placement where even the peers bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. #### filesystem,commitlog,uninitialized_topology @@ -95,12 +95,12 @@ This bootstrapping configuration will work just fine if nodes are never added/re #### peers,uninitialized_topology -Everytime a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. +Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. #### filesystem,uninitialized_topology -Everytime a node is restarted it will utilize the immutable fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. +Every time a node is restarted it will utilize the immutable fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. #### commitlog,uninitialized_topology -Everytime a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable fileset files that it has already written. +Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable fileset files that it has already written. diff --git a/docs/operational_guide/topology.md b/docs/operational_guide/placement.md similarity index 57% rename from docs/operational_guide/topology.md rename to docs/operational_guide/placement.md index d0a7ab280c..4a630b7ea6 100644 --- a/docs/operational_guide/topology.md +++ b/docs/operational_guide/placement.md @@ -1,35 +1,38 @@ -# Topology +# Placement ## Overview -M3DB stores its topology (mapping of which hosts are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each host/shard pair can be in: +**Note**: The words *placement* and *topology* are used interchangeably throughout the M3DB documentation and codebase. -1. Initializing -2. Available -3. Leaving +A M3DB cluster has exactly one Placement. That placement maps the cluster's shard replicas to hosts. A cluster also has 0 or more namespaces, and each host serves every namespace for the shards it owns. In other words, if the cluster topology states that host A owns shards 1, 2, and 3 then host A will own shards 1, 2, 3 for all configured namespaces in the cluster. -Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard. For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster topology will still show all of the shards for all of the hosts as Available. +M3DB stores its placement (mapping of which hosts are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each host/shard pair can be in: + +1. `Initializing` +2. `Available` +3. `Leaving` + +Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the `Initializing` state. Once all the nodes finish bootstrapping, they will mark all of their shards as `Available`. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the hosts as `Available`. ## Initializing State -The initializing state is the state in which all new host/shard combinations begin. For example, upon creating a new topology all the host/shard pairs will begin in the "Initializing" state and only once they have successfully bootstrapped will they transition to the "Available" state. +The `Initializing` state is the state in which all new host/shard combinations begin. For example, upon creating a new placement all the host/shard pairs will begin in the `Initializing` state and only once they have successfully bootstrapped will they transition to the `Available`` state. -The Initializing state is not limited to new topology, however, as it can also occur during topology changes. For example, during a node add/replace the new host will begin with all of its shards in the Initializing state until it can stream the data it is missing from its peers. During a node removal, all of the hosts who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as Initializing until they can stream in the data from the node leaving the cluster, or one of its peers. +The `Initializing` state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new host will begin with all of its shards in the `Initializing` state until it can stream the data it is missing from its peers. During a node removal, all of the hosts who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as `Initializing` until they can stream in the data from the node leaving the cluster, or one of its peers. ## Available State -Once a node with a shard in the Initializing state successfully bootstraps all of the data for that shard, it will mark that shard as Available (for the single host) in the cluster topology. +Once a node with a shard in the `Initializing` state successfully bootstraps all of the data for that shard, it will mark that shard as `Available` (for the single host) in the cluster placement. ## Leaving State -The leaving state indicates that a node is attempting to leave the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it. - +The `Leaving` state indicates that a node has been marked for removal from the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it. ## Sample Cluster State Transitions - Node Add Replication factor: 3 -### Initial Topology +### Initial Placement | Node | Shard | State | |------|-------|-------| @@ -78,7 +81,7 @@ Replication factor: 3 Replication factor: 3 -### Initial Topology +### Initial Placement | Node | Shard | State | |------|-------|-------| @@ -127,7 +130,7 @@ Replication factor: 3 Replication factor: 3 -### Initial Topology +### Initial Placement | Node | Shard | State | |------|-------|-------| From 25fdd7f6dd98ce507b5434aa73df5ea187e5022f Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 17:49:05 -0400 Subject: [PATCH 23/32] Update placement doc with better diagrams --- docs/operational_guide/placement.md | 244 +++++++++++++--------------- 1 file changed, 112 insertions(+), 132 deletions(-) diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index 4a630b7ea6..de51043249 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -32,145 +32,125 @@ The `Leaving` state indicates that a node has been marked for removal from the c Replication factor: 3 -### Initial Placement - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | A | - -### Begin Node Add - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | L | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | L | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | L | -| 4 | 1 | I | -| 4 | 2 | I | -| 4 | 3 | I | - -### Complete Node Add - -| Node | Shard | State | -|------|-------|-------| -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 4 | 1 | A | -| 4 | 2 | A | -| 4 | 3 | A | + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 1: Leaving │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││Shard 1: Initializing │ │ +│ Begin Node Add │ │ Shard 2: Available │ │ │ Shard 2: Leaving │ │ │ Shard 2: Available │││Shard 2: Initializing │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Leaving │││Shard 3: Initializing │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ +│ Complete Node Add │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ + ## Sample Cluster State Transitions - Node Remove Replication factor: 3 -### Initial Placement - -| Node | Shard | State | -|------|-------|-------| -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 4 | 1 | L | -| 4 | 2 | L | -| 4 | 3 | L | - -### Begin Node Remove - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | I | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | I | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | I | -| 4 | 1 | L | -| 4 | 2 | L | -| 4 | 3 | L | - -### Complete Node Remove - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | A | + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ +│ Initial Placement │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ │ │ │ │ │ │ │ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ Shard 1: Available │││ │ │ +│ │ │ Shard 1: Initializing │ │ │ Shard 1: Available │ ││ Shard 2: Available │││ Shard 1: Leaving │ │ +│ Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Initializing│ ││ Shard 3: Initializing│││ Shard 2: Leaving │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ │││ Shard 3: Leaving │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ ## Sample Cluster State Transitions - Node Replace Replication factor: 3 -### Initial Placement - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | A | - -### Begin Node Replace - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 3 | 1 | L | -| 3 | 2 | L | -| 3 | 3 | L | -| 4 | 1 | I | -| 4 | 2 | I | -| 4 | 3 | I | - -### Complete Node Replace - -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | A | -| 1 | 2 | A | -| 1 | 3 | A | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 4 | 1 | A | -| 4 | 2 | A | -| 4 | 3 | A | \ No newline at end of file + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ +┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ +│ Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ │ │ │ │ │ │ │ ││ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │┌───────────────────────┐│┌──────────────────────┐ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ ││ Shard 1: Leaving │││Shard 1: Initializing │ │ +│ Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Available │ ││ Shard 2: Leaving │││Shard 2: Initializing │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ Shard 3: Leaving │││Shard 3: Initializing │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ │ │ │ │ │ ││ │││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │└───────────────────────┘│└──────────────────────┘ │ +│ │ │ │ │ │ +├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ +│ │ │ │ │ │ +│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ │┌──────────────────────┐ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ ││ Shard 1: Available │ │ +│ Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ ││ Shard 2: Available │ │ +│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ ││ Shard 3: Available │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ │ │ │ │ │ │ ││ │ │ +│ │ └─────────────────────────┘ │ └───────────────────────┘ │ │└──────────────────────┘ │ +│ │ │ │ │ │ +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ \ No newline at end of file From e5e31bc24aa08af60ab1a52a0e5d1cb279ba9fee Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 17:58:16 -0400 Subject: [PATCH 24/32] Add another placement picture --- docs/operational_guide/placement.md | 51 ++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index de51043249..6e4273bfcc 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -153,4 +153,53 @@ Replication factor: 3 │ │ │ │ │ │ │ │ ││ │ │ │ │ └─────────────────────────┘ │ └───────────────────────┘ │ │└──────────────────────┘ │ │ │ │ │ │ │ -└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ \ No newline at end of file +└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ + +## Cluster State Transitions - Placement Updates Initiation + + ┌────────────────────────────────┐ + │ Host 1 │ + │ │ + │ Shard 1: Available │ + │ Shard 2: Available │ Operator performs node replace by + │ Shard 3: Available │ updating placement in etcd such + │ │ that shards on host 1 are marked + └────────────────────────────────┤ Leaving and shards on host 2 are + │ marked Initializing + └─────────────────────────────────┐ + │ + │ + │ + │ + │ + ▼ + ┌────────────────────────────────┐ + │ Host 1 │ + │ │ + │ Shard 1: Leaving │ + │ Shard 2: Leaving │ + │ Shard 3: Leaving │ + │ │ + └────────────────────────────────┘ + + ┌────────────────────────────────┐ + │ Host 2 │ + │ │ + │ Shard 1: Initializing │ +┌────────────────────────────────┐ │ Shard 2: Initializing │ +│ │ │ Shard 3: Initializing │ +│ │ │ │ +│ Host 1 │ └────────────────────────────────┘ +│ │ │ +│ │ │ +│ │ │ +└────────────────────────────────┘ │ + │ +┌────────────────────────────────┐ │ +│ Host 2 │ │ +│ │ │ +│ Shard 1: Available │ Host 2 completes bootstrapping and +│ Shard 2: Available │◀────updates placement (via etcd) to +│ Shard 3: Available │ indicate shard state is Available +│ │ +└────────────────────────────────┘ \ No newline at end of file From 3791fc97d65d261899ecd1978eb41234bc094c44 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 17:59:45 -0400 Subject: [PATCH 25/32] Add monopic resources --- .../monodraw/placement_monodraw_template.monopic | Bin 0 -> 3118 bytes .../monodraw/placement_state_machine.monopic | Bin 0 -> 1636 bytes 2 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/m3db/monodraw/placement_monodraw_template.monopic create mode 100644 docs/m3db/monodraw/placement_state_machine.monopic diff --git a/docs/m3db/monodraw/placement_monodraw_template.monopic b/docs/m3db/monodraw/placement_monodraw_template.monopic new file mode 100644 index 0000000000000000000000000000000000000000..166e4c2c182ddea78105fe1a4c78639f8d93f75d GIT binary patch literal 3118 zcmV+}4AJxdO;1iwP)S1pABzY8000000u$|B-EQN$5q_0G*ABqpKXZ4^xoLreqG&JM zg`wC=;vUCNZD-H!SuD_}>4Wu2Dv~B8Ia1d883m4V<5jfMh#Zo`nQwmhE!p1wD_b1) z$;~%=c@0L|%jI@4-(`2njlY@9lI3c*&ko5AnI(s>PuYC6Ol|;X$dFwZvW?K;a}xJ%djY?iFIi{HwBh6n9iPwo@X?suF2<|H3h%VoBS zkMAGS<@Wa%L^tVUHqVx;L+RdPyE$Z=!#sa*v$XdQyXUOb0pY0Sf7m<84L!DHfB3r2 z|73X!cB}hU{^^9wz~o#wr1!36_G9_oU-PH0XkxjrUmaH4jb&QS%YQgFaGrR6a1kzK zxKQCjC!g{jX6bUi*&b3$!?ok(0PLT4cj+RV=btM2?)$UP?jLP8^G;S9Ad1g@vJ`H8 zZ;#J>Dj0NFeR)iuN?#NtiJacv-MOL9nPBfa7I}uT_{_@E$$9WKx#96uRg1&JZu@-y5C<(i9MZ*u?V#`9aZ(9z)1$=b&{WIjS0}7vuWxZ>`jR7QFy@Do zewVM3dH8z%t}GLxr5SLyPVe(BLTI+Kq>D#&U0WTa+!{P&Uk>x*sQ4jrcibv-WA^<> zS5qXqx-~1znXCXxTGOWI(&m=*?RK}!c7EQ?4?92gTu0OO>VEU+IuFfq!9Kalr|~k2 zOY6lQ`@%U$q@cH}}u$bT{`8xRr6YvtlaUEFQMIG8d?o3>89l+$^CJVfm7K zC*sj|BLCmNKcu_m72I6?^f_IvZFB#*@%L!WJyyDB8 zQZkMlB~8M_^JBmI!-`WXW+thXQw=gdjAJCq9MK9NcSk-}=f?5oQI5aHtpEMzzpjaP!|=!*~;vM$J*9G z@XmU2(Q%2?aj$gXblfW*Jq-i^JWW6gw9ZrDcoM*CUc=7@#wv78!;;UN?Hhc2O?I5% z%FI|8XJ(=y_+)kewR`PVn&IKq@=J1a4P_;QAE_FFHV70hT8&g{U@h{`o19JQ`s%Oi zd>eQDelz6|?2ge(IV9bb!#WbU$W_{H_ zM?6Ss=(+_$vG)Tf0}uhcy-sxOB;*@Cu%ep`t%Z((byMJ#>GncUm1nA~#og+qOTEM@ zOZDo5<#~LRfc4PXiG1_h(s8}Afae|m4ftL#`@)hBD?S>0VZj4LV=&r+NkmYB_82ar zH5rY`XiEmlG*G62G7XezpiBd08Yo9QHJYi^;h-hBmAu@vPPeEyM;dqHqXhpY-F!~> zS5O1geu^szN&BMe-C^O^bf(&3;VcY23q#Ms(6cb9SeR5iOezkD=y=f*-4-3^Yy@vu!jzFy(GTtA0X947DQP?zrK@G4zY2=vH)j;MY@)IWh1s%_M5a}{(N zeh#Nu59Q~e1WDZTj?RL7LoS(=3i@yHN_C?_Wv==LQA_|2Ww(^m^;0f?GcU^hjN51b zo^PHXZ?m1_Jh=ao?w31-le%Bwr*r&N-jap0fhx<0B+>dI97%g*NMQ;LtHDI!>3&S$ z(UwR~G}+`ll>MED49Vs;SPn(Wx(YB|#>AxTWCMT!wE1JO%tvXO%mS(V$88}=0ZQK; zXv29UXah~4jS4E23L2FPm|kwi;vqA|=XhIW#@U+4$cz%LkQt-UeBFXI#YED+Ym4gl zhg)*eehtK+DHbdZ7FGqLnCK*_e5K7AU>tu9FxF-b(2E8wD3A}(BB8u5XaR2vT2SxU z3erHanOoaqfff`BS6TFxMK>IcemEOrGzrjxI{)HOpoQlBEqc+UtBU@rD!FXj)+Jed z#HaTaf%J`WTVX2k=`gMc^s~rc#UKL_qYOk0GZ1 z=^t(>N%tU_h-GE_t&J^atLg9H2JZNO9TY`kdiDh1T^4?FByGyqs7r4BI( zovZ$w0!1-WMlQU$akhG_E?d4jbnqjJ=hCBid?aBhif2@}a)g*6Ld-U6DqKaOOUzW< z$}pgS<+PJKl533k9K{9@$O^HV0-<YF~3PKo)W=zVaa5ULJ9L<-;t<7m>{fT3Fn>#T3 z+-YX?0A3@5g|_ndm&pnpGFe5i%tY8FW!NMoW}Us_bI{fWE=Q=dC)?H8GaAjB(FCuU ztH3Z1FFeUYQw%c!rfMd@`DTKyKP@USZY9K6D%r;7g{vsucv0MY zB1Hv&%RzIz@N6bqvk@u2{Gx@V3yu1y5PRtrVtsN-DGFpvA7u=$GsOuAQBaa7C<&d2 zP;~i0g9$}}XQy(8bs`!TVTy+baMQ^ZteVhC)gPP`h`PN-3W}sK#7R5CVC~i%jinSA z;yHjwI5mSIg1`W&p6UBL%|&&9lKqhm;n;^FdM{iVL^!$pMbj2gwn7FSQMM%oT0qK7 zbvX;CiZZ0Uk+fpjI!e_`rNys)bNnlk_=ZGaj0LydEk&qhEtq!~f!~l2Fc#2bh>w)9d zwRwlts%3|KcVARL6y7h!v^dMHNYS0eaYaqvgT{dpY8we8WOEZB}WCR zzL!Qk_+82V3GkF49)WV%b)0N;B@*!cYnFbl`UOw!uy|+6X*oY)Q!D(N>66lr8!YM&I7KwMQK0?c}z~-{Oi!iW|{bl?3 zw9XFM)nB&D?CPiG@=e3P#G8bF)$!mwG3je#M3vW!n=2kuLt$E(C1eU*);kHEWAU+RU3^ ziAP&oMy)IsrUNNZ?QdH8Vq5!Lj(|(r!w}op-wd(-1TlEo#JXv{ss8z6cm*#cmdup2 zF@*~Hr%<6}JAfg0*=pTFJq$<89Ms)`t^)44Xk$SSedy88S-zooE%|SmRnsPDEY$Jf zJYzwd6JudiNF^qS#QYE)P!t^^67)pQEiz;LRSTJaX2n8Q;XDi{p=6)sf)@FjZyA^298AKrkWwS79HE}Ky+r5xq!eL<^5-65RgNw})$mMthxH)XY` z*h+QQ+LLM6lT(OVPEn>GwT3HN2sKt#rp{|Xb=c%t#(3L7&q4kBb17s1gS6c43MQEd zmSri0tmtl+^L0RZ7VLK!)&J}Ynlf#g;Jdq=227^e_kk>foM(BedIZV_RTc=TpK}rX zct`WX%wMTY(~s+_-1BT9?{~@~P#ITlRLgqPs^oNHT)%jnj-=~w9;t8j@*-F5FludH zu2zESxkQsFw3ZY37d1+V$ONf2v0tjr>Q!YDWv}0F+n4(TS&#RGc1&J1Chta3)5z86 z&$8>kfB(~Jo0b*VjweS2j8xJ1FV;GgRld%7Z=R3oxfD)dw-fEeYpw?qxx6<)w=30S zUUJ7wjlhhG%~8ZdxHI0^o6k~x9{ScOIwsPuwLX+9+5a|ZL43eJo#U+S@5*Ly1TOpE- z(yKrw9F2)gAbQu(1b`3b1ODKD@ZSt?0rMkcv2Adm4b8`4HjP1F3M2TnF!CB=^Q75^ zpah(O5)gkCa&wF-^8oDwvJcokkn4aOFnulDXk`zrJ(YD%b1};+>j*Z)F#mnja(G@kqeP0 zeV-1DGuINsX*^Mmx5|;13BhQhGTwS*Hi8g1y*OgXGb09r5%fbQSzPS1k~D#P zRQQHbyQTqmsK`z|xp?ZybEl5HSrv8jFgarcCB2qwH6q?7i(VLq37)7BZiEo%O+SMD zAOVT|rOwok$~f`1su_)ooU(iAO~ZuU7^XH8IzviNH&0f@j(V}y>SR~Z-)C7oxuQsd=OT%a#-k** z4^xi;Z=mv2-7*E4F zGwgHMC*6+jCa+0_;gIMBUvunb!J%2nY`liHZf%#L`g_)VCt+JXx;@Fr9+dR!EUmlF zVnLJ2KT&>1#o)4VYggWPUA)3GFUNM@idQc>l^e5oMUY#(x&ae&gz{G9D{Rvg+dL(a zeXx}+wKKSi@)f}&mak^-8g3x*4>#^{nu9n)&1gQ_;oJWQcZThE3G|@Or2rC5jN9J| zdZ1+v&{79zxdXJQ1e5E49t6g$oR3lV1O5tV&|ldG_$#tOfAulosn~=(ZF-=mdLE3u z700lrLk{4to(E^%j*_syvo%E8;?Yi*5EL{=g51HHoSrU9i6ve7{|?$~Y33zhd)=&8M0=JU9~7aH-inb`~eKS+%zEcUXvg z<$dC*-tOr{kG$`#xu1I_S9>$$+4L31y`3{m-g4(`x5wC4#&BC2v$ZjNB^N*Nm=`~Y zzbh1Yj01aw&Qz(*+0%2Y)X`9t8hg6pk-YDVdo%wR_nJGvP937xg4&jESM~-Cc1hs@ zqF&*_+~9k&D|@!O0Bp?@+L|Y{pboF3<_X8Fd6MY>_a_J>0M8N>gJ}c-9D(~2U;LJw iL3WtD7TM#`_n(J)P@QHE;hkRN4*vmr%+r*sHUI#)k0zP` literal 0 HcmV?d00001 From 6f10d8fe676ff69161a8134fc0fbec5d268e3732 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 18:02:45 -0400 Subject: [PATCH 26/32] replace tables with diagrams --- docs/operational_guide/bootstrapping.md | 46 +++++++++++++------------ 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 79503fc681..2e8ffe6269 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -37,31 +37,33 @@ On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node 1 is trying to perform a peer bootstrap: -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | I | -| 1 | 2 | I | -| 1 | 3 | I | -| 2 | 1 | I | -| 2 | 2 | I | -| 2 | 3 | I | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | A | + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Host 1 │ │ Host 2 │ │ Host 3 │ +────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── +┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ +│ │ │ │ │ │ +│ │ │ │ │ │ +│ Shard 1: Initializing │ │ Shard 1: Initializing │ │ Shard 1: Available │ +│ Shard 2: Initializing │ │ Shard 2: Initializing │ │ Shard 2: Available │ +│ Shard 3: Initializing │ │ Shard 3: Initializing │ │ Shard 3: Available │ +│ │ │ │ │ │ +│ │ │ │ │ │ +└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following placement because its peers (nodes 2/3) are `Available` for all of their shards: -| Node | Shard | State | -|------|-------|-------| -| 1 | 1 | I | -| 1 | 2 | I | -| 1 | 3 | I | -| 2 | 1 | A | -| 2 | 2 | A | -| 2 | 3 | A | -| 3 | 1 | A | -| 3 | 2 | A | -| 3 | 3 | A | + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ Host 1 │ │ Host 2 │ │ Host 3 │ +────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── +┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ +│ │ │ │ │ │ +│ │ │ │ │ │ +│ Shard 1: Initializing │ │ Shard 1: Available │ │ Shard 1: Available │ +│ Shard 2: Initializing │ │ Shard 2: Available │ │ Shard 2: Available │ +│ Shard 3: Initializing │ │ Shard 3: Available │ │ Shard 3: Available │ +│ │ │ │ │ │ +│ │ │ │ │ │ +└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". From 232337e2b65e8c21c3b9c11aa7d89aa2a41d636d Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 18:18:04 -0400 Subject: [PATCH 27/32] Add diagrams into block quotes --- docs/operational_guide/bootstrapping.md | 4 ++++ docs/operational_guide/placement.md | 8 +++++++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 2e8ffe6269..24ff61970b 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -37,6 +37,7 @@ On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node 1 is trying to perform a peer bootstrap: +``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ ────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── @@ -49,9 +50,11 @@ The peer bootstrapper's responsibility is to stream in data for shard/ranges fro │ │ │ │ │ │ │ │ │ │ │ │ └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ +``` In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following placement because its peers (nodes 2/3) are `Available` for all of their shards: +``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ ────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── @@ -64,6 +67,7 @@ In this case, the peer bootstrapper running on node 1 will not be able to fullfi │ │ │ │ │ │ │ │ │ │ │ │ └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ +``` Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index 6e4273bfcc..9ea9792201 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -75,6 +75,7 @@ Replication factor: 3 Replication factor: 3 +``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ @@ -112,11 +113,13 @@ Replication factor: 3 │ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ │ │ │ │ │ │ └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ +``` ## Sample Cluster State Transitions - Node Replace Replication factor: 3 +``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ @@ -154,9 +157,11 @@ Replication factor: 3 │ │ └─────────────────────────┘ │ └───────────────────────┘ │ │└──────────────────────┘ │ │ │ │ │ │ │ └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ +``` ## Cluster State Transitions - Placement Updates Initiation +``` ┌────────────────────────────────┐ │ Host 1 │ │ │ @@ -202,4 +207,5 @@ Replication factor: 3 │ Shard 2: Available │◀────updates placement (via etcd) to │ Shard 3: Available │ indicate shard state is Available │ │ -└────────────────────────────────┘ \ No newline at end of file +└────────────────────────────────┘ +``` \ No newline at end of file From b56e01d47dcb5a1b7e6024e702266c37c43ea7e0 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 18:35:45 -0400 Subject: [PATCH 28/32] Update placement doc --- docs/operational_guide/placement.md | 33 ++++++++++++++++++----------- 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index 9ea9792201..f01c34e14f 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -30,6 +30,9 @@ The `Leaving` state indicates that a node has been marked for removal from the c ## Sample Cluster State Transitions - Node Add +Node adds are performed by adding the new host to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the `Initializing` state. Similarly, the shards will be marked as `Leaving` on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are `Available` and that the `Leaving` host should no longer own that shard in the placement. + +``` Replication factor: 3 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ @@ -39,7 +42,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ -│ Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ 1) Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ @@ -50,7 +53,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ Shard 1: Leaving │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││Shard 1: Initializing │ │ -│ Begin Node Add │ │ Shard 2: Available │ │ │ Shard 2: Leaving │ │ │ Shard 2: Available │││Shard 2: Initializing │ │ +│ 2) Begin Node Add │ │ Shard 2: Available │ │ │ Shard 2: Leaving │ │ │ Shard 2: Available │││Shard 2: Initializing │ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Leaving │││Shard 3: Initializing │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ @@ -62,20 +65,22 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ -│ Complete Node Add │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ 3) Complete Node Add │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ │ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ │ │ │ │ │ │ └──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ - +``` ## Sample Cluster State Transitions - Node Remove -Replication factor: 3 +Node removes are performed by updating the placement such that all the shards on the host that will be removed from the cluster are marked as `Leaving` and those shards are distributed to the remaining nodes (based on their weight) and assigned a state of `Initializing`. Once the existing nodes that are taking ownership of the leaving nodes shards finish bootstrapping, they will update the placement to indicate that the shards that they just acquired are `Available` and that the leaving host should no longer own those shards in the placement. ``` +Replication factor: 3 + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ @@ -83,7 +88,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ -│ Initial Placement │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ +│ 1) Initial Placement │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ │ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ │ │ │ │ │ │ │ │ │ │││ │ │ │ │ │ │ │ │ │ │ │ │││ │ │ @@ -94,7 +99,7 @@ Replication factor: 3 │ │ │ │ │ │ │ ││ │││ │ │ │ │ │ │ │ │ │ ││ Shard 1: Available │││ │ │ │ │ │ Shard 1: Initializing │ │ │ Shard 1: Available │ ││ Shard 2: Available │││ Shard 1: Leaving │ │ -│ Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Initializing│ ││ Shard 3: Initializing│││ Shard 2: Leaving │ │ +│ 2) Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Initializing│ ││ Shard 3: Initializing│││ Shard 2: Leaving │ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ │││ Shard 3: Leaving │ │ │ │ │ │ │ │ │ ││ │││ │ │ │ │ │ │ │ │ │ ││ │││ │ │ @@ -106,7 +111,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ -│ Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ 3) Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ @@ -117,9 +122,11 @@ Replication factor: 3 ## Sample Cluster State Transitions - Node Replace -Replication factor: 3 +Node replaces are performed by updating the placement such that all the shards on the host that will be removed from the cluster are marked as `Leaving` and those shards are all added to the host that is being added and assigned a state of `Initializing`. Once the replacement node finishes bootstrapping, it will update the placement to indicate that the shards that it acquired are `Available` and that the leaving host should no longer own those shards in the placement. ``` +Replication factor: 3 + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ @@ -127,7 +134,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ -│ Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ +│ 1) Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ @@ -138,7 +145,7 @@ Replication factor: 3 │ │ │ │ │ │ │ ││ │││ │ │ │ │ │ │ │ │ │ ││ │││ │ │ │ │ │ Shard 1: Available │ │ │ Shard 1: Available │ ││ Shard 1: Leaving │││Shard 1: Initializing │ │ -│ Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Available │ ││ Shard 2: Leaving │││Shard 2: Initializing │ │ +│ 2) Begin Node Remove │ │ Shard 2: Available │ │ │ Shard 2: Available │ ││ Shard 2: Leaving │││Shard 2: Initializing │ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ ││ Shard 3: Leaving │││Shard 3: Initializing │ │ │ │ │ │ │ │ │ ││ │││ │ │ │ │ │ │ │ │ │ ││ │││ │ │ @@ -150,7 +157,7 @@ Replication factor: 3 │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ Shard 1: Avaiable │ │ │ Shard 1: Available │ │ ││ Shard 1: Available │ │ -│ Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ ││ Shard 2: Available │ │ +│ 3) Complete Node Add │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ ││ Shard 2: Available │ │ │ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ ││ Shard 3: Available │ │ │ │ │ │ │ │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ @@ -161,6 +168,8 @@ Replication factor: 3 ## Cluster State Transitions - Placement Updates Initiation +The diagram below depicts the sequence of events that happen during a node replace and illustrates which entity is performing the placement update (in etcd) at each step. + ``` ┌────────────────────────────────┐ │ Host 1 │ From de139612632d6c4b222b25cfd3ea7c6d8ee252c9 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 18:43:00 -0400 Subject: [PATCH 29/32] Add new docs to menu --- docs/operational_guide/placement.md | 2 +- mkdocs.yml | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index f01c34e14f..497d1e57d8 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -217,4 +217,4 @@ The diagram below depicts the sequence of events that happen during a node repla │ Shard 3: Available │ indicate shard state is Available │ │ └────────────────────────────────┘ -``` \ No newline at end of file +``` diff --git a/mkdocs.yml b/mkdocs.yml index a107cbbe56..1c02c9bcf4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -61,6 +61,9 @@ pages: - "M3DB Single Node Deployment": "how_to/single_node.md" - "M3DB Cluster Deployment, Manually": "how_to/cluster_hard_way.md" - "M3DB on Kubernetes": "how_to/kubernetes.md" + - "Operational Guides": + - "Placement / Topology": "operational_guide/placement.md" + - "Bootstrapping": "operational_guide/bootstrapping.md" - "Integrations": - "Prometheus": "integrations/prometheus.md" - "Troubleshooting": "troubleshooting/index.md" From bd59e49b019bc6f2e208f4eff811837a2559e0c0 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 19:09:25 -0400 Subject: [PATCH 30/32] Address review feedback --- docs/operational_guide/bootstrapping.md | 38 ++++++++++----------- docs/operational_guide/placement.md | 44 ++++++++++++------------- 2 files changed, 41 insertions(+), 41 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 24ff61970b..ab8100f346 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -4,7 +4,7 @@ We recommend reading the [placement operational guide](placement.md) before reading the rest of this document. -When an M3DB node is turned on (or experiences a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers *does* matter. +When an M3DB node is turned on (goes through a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: `filesystem,commitlog,peers,uninitialized_topology` then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers *does* matter. Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you to, this document is designed to help you understand the implications of doing so. @@ -18,28 +18,28 @@ M3DB currently supports 5 different bootstrappers: When the bootstrapping process begins, M3DB nodes need to determine two things: -1. What shards they should bootstrap, which can be determined from the cluster placement. -2. What time-ranges they need to bootstrap those shards for, which can be determined from the namespace retention. +1. What shards the bootstrapping node should bootstrap, which can be determined from the cluster placement. +2. What time-ranges the bootstrapping node needs to bootstrap those shards for, which can be determined from the namespace retention. -For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention starting 48 hours ago and ending at the current time. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. +For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. ### Filesystem Bootstrapper -The `filesystem` bootstrapper's responsibility is to determine which immutable [fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which fileset files already exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. +The `filesystem` bootstrapper's responsibility is to determine which immutable [Fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which Fileset files exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. ### Commitlog Bootstrapper -The `commitlog` bootstrapper's responsibility is to read the `commitlog` and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable fileset file. Unlike the `filesystem` bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the `commitlog` bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. +The `commitlog` bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable Fileset file. Unlike the `filesystem` bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the `commitlog` bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. -On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster placement to see if the host it is running on has ever achieved the `Available` status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the `commitlog` bootstrapper *assumes* it is running after the `filesystem` bootstrapper. M3DB will not allow you to run with a configuration where the `filesystem` bootstrapper is placed after the `commitlog` bootstrapper, but it will allow you to run the `commitlog` bootstrapper without the `filesystem` bootstrapper which can result in loss of data, depending on the workload. +On a shard-by-shard basis, the `commitlog` bootstrapper will consult the cluster placement to see if the node it is running on has ever achieved the `Available` status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last Fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the `commitlog` bootstrapper *assumes* it is running after the `filesystem` bootstrapper. M3DB will not allow you to run with a configuration where the `filesystem` bootstrapper is placed after the `commitlog` bootstrapper, but it will allow you to run the `commitlog` bootstrapper without the `filesystem` bootstrapper which can result in loss of data, depending on the workload. ### Peers Bootstrapper -The peer bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node 1 is trying to perform a peer bootstrap: +The `peers` bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node *and* where the replication factor is set to a value larger than 1. The `peers` bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node A is trying to perform a peer bootstrap: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Host 1 │ │ Host 2 │ │ Host 3 │ + │ Node A │ │ Node B │ │ Node C │ ────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ │ │ │ │ │ │ @@ -52,11 +52,11 @@ The peer bootstrapper's responsibility is to stream in data for shard/ranges fro └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ ``` -In this case, the peer bootstrapper running on node 1 will not be able to fullfill any requests because node 2 is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node 1's peer bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node 1 would be able to peer bootstrap in the following placement because its peers (nodes 2/3) are `Available` for all of their shards: +In this case, the `peers` bootstrapper running on node A will not be able to fullfill any requests because node B is in the `Initializing` state for all of its shards and cannot fulfill bootstrap requests. This means that node A's `peers` bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node A would be able to peer bootstrap its shards in the following placement because its peers (nodes B/C) have sufficient replicas of the shards it needs in the `Available` state: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Host 1 │ │ Host 2 │ │ Host 3 │ + │ Node A │ │ Node B │ │ Node C │ ────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── ┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ │ │ │ │ │ │ @@ -69,15 +69,15 @@ In this case, the peer bootstrapper running on node 1 will not be able to fullfi └─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ ``` -Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". +Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key "m3db.client.bootstrap-consistency-level" in [etcd](https://coreos.com/etcd/) to one of: "none", "one", "unstrict_majority" (attempt to read from majority, but settle for less if any errors occur), "majority" (strict majority), and "all". For example, if an entire cluster with a replication factor of 3 was restarted simultaneously, all the nodes would get stuck in an infinite loop trying to peer bootstrap from each other and not achieving majority until an operator modified this value. -**Note**: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic placement changes of any kind. +**Note**: Any bootstrappers configuration that does not include the `peers` bootstrapper will be unable to handle dynamic placement changes of any kind. ### Uninitialized Topology Bootstrapper -The purpose of the uninitialized topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the `filesystem` and `commitlog` bootstrappers are used by default in node restarts, the peer bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the `commitlog` and `peers` bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized topology bootstrapper allows us to place the `commitlog` bootstrapper *before* the `peers` bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. +The purpose of the `uninitialized_topology` bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: `filesystem,commitlog,peers,topology_uninitialized` such that the `filesystem` and `commitlog` bootstrappers are used by default in node restarts, the `peers` bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the `commitlog` and `peers` bootstrappers will be unable to succeed any bootstraps. In other words, the `uninitialized_topology` bootstrapper allows us to place the `commitlog` bootstrapper *before* the `peers` bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. -The uninitialized topology bootstrapper determines whether a placement is "new" for a given shard by counting the number of hosts in the `Initializing` state and `Leaving` states and there are more `Initializing` than `Leaving`, then it succeeds the bootstrap because that means the placement has never reached a state where all hosts are `Available`. +The `uninitialized_topology` bootstrapper determines whether a placement is "new" for a given shard by counting the number of nodes in the `Initializing` state and `Leaving` states and there are more `Initializing` than `Leaving`, then it succeeds the bootstrap because that means the placement has never reached a state where all nodes are `Available`. ### No Operational All Bootstrapper @@ -93,7 +93,7 @@ This is the default bootstrappers configuration for M3DB and will behave "as exp In the general case, the node will use only the `filesystem` and `commitlog` bootstrappers on node startup. However, in the case of a node add/remove/replace, the `commitlog` bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the `Available` state) and defer to the `peers` bootstrapper to stream in the data. -Additionally, if it is a brand new placement where even the peers bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. +Additionally, if it is a brand new placement where even the `peers` bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. #### filesystem,commitlog,uninitialized_topology @@ -101,12 +101,12 @@ This bootstrapping configuration will work just fine if nodes are never added/re #### peers,uninitialized_topology -Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. +Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. #### filesystem,uninitialized_topology -Every time a node is restarted it will utilize the immutable fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. +Every time a node is restarted it will utilize the immutable Fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. #### commitlog,uninitialized_topology -Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable fileset files that it has already written. +Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable Fileset files that it has already written. diff --git a/docs/operational_guide/placement.md b/docs/operational_guide/placement.md index 497d1e57d8..f49abdc615 100644 --- a/docs/operational_guide/placement.md +++ b/docs/operational_guide/placement.md @@ -4,25 +4,25 @@ **Note**: The words *placement* and *topology* are used interchangeably throughout the M3DB documentation and codebase. -A M3DB cluster has exactly one Placement. That placement maps the cluster's shard replicas to hosts. A cluster also has 0 or more namespaces, and each host serves every namespace for the shards it owns. In other words, if the cluster topology states that host A owns shards 1, 2, and 3 then host A will own shards 1, 2, 3 for all configured namespaces in the cluster. +A M3DB cluster has exactly one Placement. That placement maps the cluster's shard replicas to nodes. A cluster also has 0 or more namespaces (analogous to tables in other databases), and each node serves every namespace for the shards it owns. In other words, if the cluster topology states that node A owns shards 1, 2, and 3 then node A will own shards 1, 2, 3 for all configured namespaces in the cluster. -M3DB stores its placement (mapping of which hosts are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each host/shard pair can be in: +M3DB stores its placement (mapping of which NODES are responsible for which shards) in [etcd](https://coreos.com/etcd/). There are three possible states that each node/shard pair can be in: 1. `Initializing` 2. `Available` 3. `Leaving` -Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the `Initializing` state. Once all the nodes finish bootstrapping, they will mark all of their shards as `Available`. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the hosts as `Available`. +Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the `Initializing` state. Once all the nodes finish bootstrapping, they will mark all of their shards as `Available`. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the nodes as `Available`. ## Initializing State -The `Initializing` state is the state in which all new host/shard combinations begin. For example, upon creating a new placement all the host/shard pairs will begin in the `Initializing` state and only once they have successfully bootstrapped will they transition to the `Available`` state. +The `Initializing` state is the state in which all new node/shard combinations begin. For example, upon creating a new placement all the node/shard pairs will begin in the `Initializing` state and only once they have successfully bootstrapped will they transition to the `Available` state. -The `Initializing` state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new host will begin with all of its shards in the `Initializing` state until it can stream the data it is missing from its peers. During a node removal, all of the hosts who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as `Initializing` until they can stream in the data from the node leaving the cluster, or one of its peers. +The `Initializing` state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new node will begin with all of its shards in the `Initializing` state until it can stream the data it is missing from its peers. During a node removal, all of the nodes who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as `Initializing` until they can stream in the data from the node leaving the cluster, or one of its peers. ## Available State -Once a node with a shard in the `Initializing` state successfully bootstraps all of the data for that shard, it will mark that shard as `Available` (for the single host) in the cluster placement. +Once a node with a shard in the `Initializing` state successfully bootstraps all of the data for that shard, it will mark that shard as `Available` (for the single node) in the cluster placement. ## Leaving State @@ -30,13 +30,13 @@ The `Leaving` state indicates that a node has been marked for removal from the c ## Sample Cluster State Transitions - Node Add -Node adds are performed by adding the new host to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the `Initializing` state. Similarly, the shards will be marked as `Leaving` on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are `Available` and that the `Leaving` host should no longer own that shard in the placement. +Node adds are performed by adding the new node to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the `Initializing` state. Similarly, the shards will be marked as `Leaving` on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are `Available` and that the `Leaving` node should no longer own that shard in the placement. ``` Replication factor: 3 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ │ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ │ │ │ │ │ │ │ │ │ ││ │ @@ -76,13 +76,13 @@ Replication factor: 3 ## Sample Cluster State Transitions - Node Remove -Node removes are performed by updating the placement such that all the shards on the host that will be removed from the cluster are marked as `Leaving` and those shards are distributed to the remaining nodes (based on their weight) and assigned a state of `Initializing`. Once the existing nodes that are taking ownership of the leaving nodes shards finish bootstrapping, they will update the placement to indicate that the shards that they just acquired are `Available` and that the leaving host should no longer own those shards in the placement. +Node removes are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as `Leaving` and those shards are distributed to the remaining nodes (based on their weight) and assigned a state of `Initializing`. Once the existing nodes that are taking ownership of the leaving nodes shards finish bootstrapping, they will update the placement to indicate that the shards that they just acquired are `Available` and that the leaving node should no longer own those shards in the placement. ``` Replication factor: 3 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ │ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ │ │ │ │ │ │ │ │ │ │││ │ │ @@ -122,13 +122,13 @@ Replication factor: 3 ## Sample Cluster State Transitions - Node Replace -Node replaces are performed by updating the placement such that all the shards on the host that will be removed from the cluster are marked as `Leaving` and those shards are all added to the host that is being added and assigned a state of `Initializing`. Once the replacement node finishes bootstrapping, it will update the placement to indicate that the shards that it acquired are `Available` and that the leaving host should no longer own those shards in the placement. +Node replaces are performed by updating the placement such that all the shards on the node that will be removed from the cluster are marked as `Leaving` and those shards are all added to the node that is being added and assigned a state of `Initializing`. Once the replacement node finishes bootstrapping, it will update the placement to indicate that the shards that it acquired are `Available` and that the leaving node should no longer own those shards in the placement. ``` Replication factor: 3 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Host 1 │ │ Host 2 │ │ Host 3 │ │ Host 4 │ + │ Node A │ │ Node B │ │ Node C │ │ Node D │ ┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ │ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ │ │ │ │ │ │ │ │ │ ││ │ @@ -172,13 +172,13 @@ The diagram below depicts the sequence of events that happen during a node repla ``` ┌────────────────────────────────┐ - │ Host 1 │ + │ Node A │ │ │ │ Shard 1: Available │ │ Shard 2: Available │ Operator performs node replace by │ Shard 3: Available │ updating placement in etcd such - │ │ that shards on host 1 are marked - └────────────────────────────────┤ Leaving and shards on host 2 are + │ │ that shards on node A are marked + └────────────────────────────────┤ Leaving and shards on node B are │ marked Initializing └─────────────────────────────────┐ │ @@ -188,7 +188,7 @@ The diagram below depicts the sequence of events that happen during a node repla │ ▼ ┌────────────────────────────────┐ - │ Host 1 │ + │ Node A │ │ │ │ Shard 1: Leaving │ │ Shard 2: Leaving │ @@ -197,24 +197,24 @@ The diagram below depicts the sequence of events that happen during a node repla └────────────────────────────────┘ ┌────────────────────────────────┐ - │ Host 2 │ + │ Node B │ │ │ │ Shard 1: Initializing │ ┌────────────────────────────────┐ │ Shard 2: Initializing │ │ │ │ Shard 3: Initializing │ │ │ │ │ -│ Host 1 │ └────────────────────────────────┘ +│ Node A │ └────────────────────────────────┘ │ │ │ │ │ │ │ │ │ └────────────────────────────────┘ │ │ ┌────────────────────────────────┐ │ -│ Host 2 │ │ +│ Node B │ │ │ │ │ -│ Shard 1: Available │ Host 2 completes bootstrapping and +│ Shard 1: Available │ Node B completes bootstrapping and │ Shard 2: Available │◀────updates placement (via etcd) to -│ Shard 3: Available │ indicate shard state is Available -│ │ +│ Shard 3: Available │ indicate shard state is Available and +│ │ that Node A should no longer own any shards └────────────────────────────────┘ ``` From a2d1b9fcd2dafda429ad1f7d84a256098a3f897b Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 19:29:05 -0400 Subject: [PATCH 31/32] Update bootstrapping guide --- docs/operational_guide/bootstrapping.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index ab8100f346..77fc14f13a 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -89,19 +89,27 @@ Now that we've gone over the various bootstrappers, let's consider how M3DB will #### filesystem,commitlog,peers,uninitialized_topology (default) -This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new topologies. +This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new placements / topologies. **This is the only configuration that we recommend using in production**. In the general case, the node will use only the `filesystem` and `commitlog` bootstrappers on node startup. However, in the case of a node add/remove/replace, the `commitlog` bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the `Available` state) and defer to the `peers` bootstrapper to stream in the data. Additionally, if it is a brand new placement where even the `peers` bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. -#### filesystem,commitlog,uninitialized_topology +#### filesystem,peers,uninitialized_topology (default) -This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. +Everytime a node is restarted it will attempt to stream in all of the the data for any blocks that it has never flushed, which is generally the currently active block and possibly the previous block as well. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of any unflushed blocks. This mode can lead to violations of M3DB's consistency guarantees due to the fact that commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. #### peers,uninitialized_topology -Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. We do not recommend running in this mode as it can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored, however, it *can* be useful if you want to repair the data on a node by forcing it to stream from its peers. +Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of all data on an individual node. This mode can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. + +### Invalid bootstrappers configuration + +For the sake of completeness, we've included a short discussion below of some bootstrapping configurations that we consider "invalid" in that they are likely to lose data / violate M3DB's consistency guarantees and/or not handle placement changes in a correct way. + +#### filesystem,commitlog,uninitialized_topology + +This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. #### filesystem,uninitialized_topology From 2b0862872784e144ab7b1e048b6b7bdea6021a19 Mon Sep 17 00:00:00 2001 From: Richard Artoul Date: Wed, 26 Sep 2018 19:46:00 -0400 Subject: [PATCH 32/32] Fix headings --- docs/operational_guide/bootstrapping.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/operational_guide/bootstrapping.md b/docs/operational_guide/bootstrapping.md index 77fc14f13a..dbf690613e 100644 --- a/docs/operational_guide/bootstrapping.md +++ b/docs/operational_guide/bootstrapping.md @@ -23,6 +23,8 @@ When the bootstrapping process begins, M3DB nodes need to determine two things: For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. +## Bootstrappers + ### Filesystem Bootstrapper The `filesystem` bootstrapper's responsibility is to determine which immutable [Fileset files](../m3db/architecture/storage.md) exist on disk, and if so, mark them as fulfilled. The `filesystem` bootstrapper achieves this by scanning M3DB's directory structure and determining which Fileset files exist on disk. Unlike the other bootstrappers, the `filesystem` bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. @@ -83,11 +85,11 @@ The `uninitialized_topology` bootstrapper determines whether a placement is "new The `noop_all` bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. -### Bootstrappers Configuration +## Bootstrappers Configuration Now that we've gone over the various bootstrappers, let's consider how M3DB will behave in different configurations. Note that we include `uninitialized_topology` at the end of all the lists of bootstrappers because its required to get a new placement up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the `peers` bootstrapper will not be able to handle dynamic placement changes like node adds/removes/replaces. -#### filesystem,commitlog,peers,uninitialized_topology (default) +### filesystem,commitlog,peers,uninitialized_topology (default) This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new placements / topologies. **This is the only configuration that we recommend using in production**. @@ -95,26 +97,26 @@ In the general case, the node will use only the `filesystem` and `commitlog` boo Additionally, if it is a brand new placement where even the `peers` bootstrapper cannot fulfill the bootstrap, this will be detected by the `uninitialized_topology` bootstrapper which will succeed the bootstrap. -#### filesystem,peers,uninitialized_topology (default) +### filesystem,peers,uninitialized_topology (default) Everytime a node is restarted it will attempt to stream in all of the the data for any blocks that it has never flushed, which is generally the currently active block and possibly the previous block as well. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of any unflushed blocks. This mode can lead to violations of M3DB's consistency guarantees due to the fact that commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. -#### peers,uninitialized_topology +### peers,uninitialized_topology Every time a node is restarted, it will attempt to stream in *all* of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of all data on an individual node. This mode can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see `peers` bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. -### Invalid bootstrappers configuration +## Invalid bootstrappers configuration For the sake of completeness, we've included a short discussion below of some bootstrapping configurations that we consider "invalid" in that they are likely to lose data / violate M3DB's consistency guarantees and/or not handle placement changes in a correct way. -#### filesystem,commitlog,uninitialized_topology +### filesystem,commitlog,uninitialized_topology This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. -#### filesystem,uninitialized_topology +### filesystem,uninitialized_topology Every time a node is restarted it will utilize the immutable Fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. -#### commitlog,uninitialized_topology +### commitlog,uninitialized_topology Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable Fileset files that it has already written.