diff --git a/docs/content/code_assets/README.md b/docs/code_assets/README.md similarity index 100% rename from docs/content/code_assets/README.md rename to docs/code_assets/README.md diff --git a/docs/content/code_assets/commitlog/queue.monopic b/docs/code_assets/commitlog/queue.monopic similarity index 100% rename from docs/content/code_assets/commitlog/queue.monopic rename to docs/code_assets/commitlog/queue.monopic diff --git a/docs/content/glossary/_index.md b/docs/content/glossary/_index.md deleted file mode 100644 index feaf453fb6..0000000000 --- a/docs/content/glossary/_index.md +++ /dev/null @@ -1,47 +0,0 @@ ---- -title: "Glossary" -weight: 11 -chapter: true ---- - -- **Bootstrapping**: Process by which an M3DB node is brought up. Bootstrapping consists of determining - the integrity of data that the node has, replay writes from the commit log, and/or stream missing data - from its peers. - -- **Cardinality**: The number of unique metrics within the M3DB index. Cardinality increases with the - number of unique tag/value combinations that are being emitted. - -- **Datapoint**: A single timestamp/value. Timeseries are composed of multiple datapoints and a series - of tag/value pairs. - -- **Labels**: Pairs of descriptive words that give meaning to a metric. `Tags` and `Labels` are interchangeable terms. - -- **Metric**: A collection of uniquely identifiable tags. - -- **M3**: Highly scalable, distributed metrics platform that is comprised of a native, distributed time - series database, a highly-dynamic and performant aggregation service, a query engine, and other - supporting infrastructure. - -- **M3Coordinator**: A service within M3 that coordinates reads and writes between upstream systems, - such as Prometheus, and downstream systems, such as M3DB. - -- **M3DB**: Distributed time series database influenced by [Gorilla](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf) - and [Cassandra](http://cassandra.apache.org/) released as open source by Uber Technologies. - -- **M3Query**: A distributed query engine for M3DB. Unlike M3Coordinator, M3Query only provides supports for reads. - -- **Namespace**: Similar to a table in other types of databases, namespaces in M3DB have a unique name - and a set of configuration options, such as data retention and block size. - -- **Placement**: Map of the M3DB cluster's shard replicas to nodes. Each M3DB cluster has only one placement. - `Placement` and `Topology` are interchangeable terms. - -- **Shard**: Effectively the same as a "virtual shard" in Cassandra in that it provides an arbitrary - distribution of time series data via a simple hash of the series ID. - -- **Tags**: Pairs of descriptive words that give meaning to a metric. `Tags` and `Labels` are interchangeable terms. - -- **Timeseries**: A series of data points tracking a particular metric over time. - -- **Topology**: Map of the M3DB cluster's shard replicas to nodes. Each M3DB cluster has only one placement. - `Placement` and `Topology` are interchangeable terms. diff --git a/docs/content/glossary/glossary.md b/docs/content/glossary/glossary.md deleted file mode 100644 index a3628526fb..0000000000 --- a/docs/content/glossary/glossary.md +++ /dev/null @@ -1,35 +0,0 @@ ---- -title: "Glossary" -date: 2020-04-21T20:45:40-04:00 -draft: true ---- - -1. **Bootstrapping:** Process by which an M3DB node is brought up. Bootstrapping consists of determining the integrity of data that the node has, replay writes from the commit log, and/or stream missing data from its peers. - -2. **Cardinality:** The number of unique metrics within the M3DB index. Cardinality increases with the number of unique tag/value combinations that are being emitted. - -3. **Datapoint:** A single timestamp/value. Timeseries are composed of multiple datapoints and a series of tag/value pairs. - -4. **Labels:** Pairs of descriptive words that give meaning to a metric. Tags and Labels are interchangeable terms. - -5. **Metric:** A collection of uniquely identifiable tags. - -6. **M3:** Highly scalable, distributed metrics platform that is comprised of a native, distributed time series database, a highly-dynamic and performant aggregation service, a query engine, and other supporting infrastructure. - -7. **M3Coordinator:** A service within M3 that coordinates reads and writes between upstream systems, such as Prometheus, and downstream systems, such as M3DB. - -8. **M3DB:** Distributed time series database influenced by Gorilla and Cassandra released as open source by Uber Technologies. - -9. **M3Query:** A distributed query engine for M3DB. Unlike M3Coordinator, M3Query only provides supports for reads. - -10. **Namespace:** Similar to a table in other types of databases, namespaces in M3DB have a unique name and a set of configuration options, such as data retention and block size. - -11. **Placement:** Map of the M3DB cluster's shard replicas to nodes. Each M3DB cluster has only one placement. Placement and Topology are interchangeable terms. - -12. **Shard:** Effectively the same as a "virtual shard" in Cassandra in that it provides an arbitrary distribution of time series data via a simple hash of the series ID. - -13. **Tags:** Pairs of descriptive words that give meaning to a metric. Tags and Labels are interchangeable terms. - -14. **Timeseries:** A series of data points tracking a particular metric over time. - -15. **Topology:** Map of the M3DB cluster's shard replicas to nodes. Each M3DB cluster has only one placement. Placement and Topology are interchangeable terms. diff --git a/docs/content/reference/glossary/index.md b/docs/content/glossary/index.md similarity index 100% rename from docs/content/reference/glossary/index.md rename to docs/content/glossary/index.md diff --git a/docs/content/reference/glossary/test.md b/docs/content/glossary/test.md similarity index 100% rename from docs/content/reference/glossary/test.md rename to docs/content/glossary/test.md diff --git a/docs/content/reference_docs/_index.md b/docs/content/reference_docs/_index.md deleted file mode 100644 index f5f8537f03..0000000000 --- a/docs/content/reference_docs/_index.md +++ /dev/null @@ -1,12 +0,0 @@ -+++ -title = "Reference Documentation" -date = 2020-04-21T21:00:26-04:00 -weight = 8 -chapter = true -+++ - -### Chapter X - -# Some Chapter title - -Lorem Ipsum. \ No newline at end of file diff --git a/docs/content/reference_docs/architecture/_index.md b/docs/content/reference_docs/architecture/_index.md deleted file mode 100644 index b4466e7a88..0000000000 --- a/docs/content/reference_docs/architecture/_index.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "Architecture" -date: 2020-04-21T21:04:11-04:00 -draft: true ---- - diff --git a/docs/content/reference_docs/architecture/aggregator.md b/docs/content/reference_docs/architecture/aggregator.md deleted file mode 100644 index 2b22c1080d..0000000000 --- a/docs/content/reference_docs/architecture/aggregator.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "M3 Aggregator" -date: 2020-04-21T21:01:14-04:00 -draft: true ---- - diff --git a/docs/content/reference_docs/architecture/coordinator.md b/docs/content/reference_docs/architecture/coordinator.md deleted file mode 100644 index dac573cf5e..0000000000 --- a/docs/content/reference_docs/architecture/coordinator.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -title: "M3 Coordinator" -date: 2020-04-21T21:01:05-04:00 -draft: true ---- \ No newline at end of file diff --git a/docs/content/reference_docs/architecture/m3db.md b/docs/content/reference_docs/architecture/m3db.md deleted file mode 100644 index af67c020d8..0000000000 --- a/docs/content/reference_docs/architecture/m3db.md +++ /dev/null @@ -1,45 +0,0 @@ ---- -title: "M3DB" -date: 2020-04-21T21:00:55-04:00 -draft: true ---- - -M3DB is a persistent database with durable storage, but it is best understood via the boundary between its in-memory object layout and on-disk representations. -In-Memory Object Layout - ┌────────────────────────────────────────────────────────────┐ - │ Database │ - ├────────────────────────────────────────────────────────────┤ - │ │ - │ ┌────────────────────────────────────────────────────┐ │ - │ │ Namespaces │ │ - │ ├────────────────────────────────────────────────────┤ │ - │ │ │ │ - │ │ ┌────────────────────────────────────────────┐ │ │ - │ │ │ Shards │ │ │ - │ │ ├────────────────────────────────────────────┤ │ │ - │ │ │ │ │ │ - │ │ │ ┌────────────────────────────────────┐ │ │ │ - │ │ │ │ Series │ │ │ │ - │ │ │ ├────────────────────────────────────┤ │ │ │ - │ │ │ │ │ │ │ │ - │ │ │ │ ┌────────────────────────────┐ │ │ │ │ - │ │ │ │ │ Buffer │ │ │ │ │ - │ │ │ │ └────────────────────────────┘ │ │ │ │ - │ │ │ │ │ │ │ │ - │ │ │ │ │ │ │ │ - │ │ │ │ ┌────────────────────────────┐ │ │ │ │ - │ │ │ │ │ Cached blocks │ │ │ │ │ - │ │ │ │ └────────────────────────────┘ │ │ │ │ - │ │ │ │ ... │ │ │ │ - │ │ │ │ │ │ │ │ - │ │ │ └────────────────────────────────────┘ │ │ │ - │ │ │ ... │ │ │ - │ │ │ │ │ │ - │ │ └────────────────────────────────────────────┘ │ │ - │ │ ... │ │ - │ │ │ │ - │ └────────────────────────────────────────────────────┘ │ - │ ... │ - │ │ - └────────────────────────────────────────────────────────────┘ - diff --git a/docs/content/reference_docs/architecture/query.md b/docs/content/reference_docs/architecture/query.md deleted file mode 100644 index 6640c24297..0000000000 --- a/docs/content/reference_docs/architecture/query.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -title: "M3 Query" -date: 2020-04-21T21:00:59-04:00 -draft: true ---- - -### Overview -M3 Query and M3 Coordinator are written entirely in Go, M3 Query is as a query engine for M3DB and M3 Coordinator is a remote read/write endpoint for Prometheus and M3DB. To learn more about Prometheus's remote endpoints and storage, see here. - -### Blocks -Please note: This documentation is a work in progress and more detail is required. - -#### Overview -The fundamental data structures that M3 Query uses are Blocks. Blocks are what get created from the series iterators that M3DB returns. A Block is associated with a start and end time. It contains data from multiple time series stored in columnar format. -Most transformations within M3 Query will be applied across different series for each time interval. Therefore, having data stored in columnar format helps with the memory locality of the data. Moreover, most transformations within M3 Query can work in parallel on different blocks which can significantly increase the computation speed. - -#### Diagram -Below is a visual representation of a set of Blocks. On top is the M3QL query that gets executed, and on the bottom, are the results of the query containing 3 different Blocks. - ┌───────────────────────────────────────────────────────────────────────┐ - │ │ - │ fetch name:sign_up city_id:{new_york,san_diego,toronto} os:* │ - │ │ - └───────────────────────────────────────────────────────────────────────┘ - │ │ │ - │ │ │ - │ │ │ - ▼ ▼ ▼ - ┌────────────┐ ┌────────────┐ ┌─────────────┐ - │ Block One │ │ Block Two │ │ Block Three │ - └────────────┘ └────────────┘ └─────────────┘ - ┌──────┬──────┬──────┐ ┌──────┬──────┬──────┐ ┌──────┬──────┬──────┐ - │ t │ t+1 │ t+2 │ │ t+3 │ t+4 │ t+5 │ │ t+6 │ t+7 │ t+8 │ - ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ -┌───────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ -│ name:sign_up │ │ │ │ │ │ │ │ │ │ │ │ │ -│ city_id:new_york os:ios │ │ 5 │ 2 │ 10 │ │ 10 │ 2 │ 10 │ │ 5 │ 3 │ 5 │ -└───────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ - ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ -┌───────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ -│ name:sign_up │ │ │ │ │ │ │ │ │ │ │ │ │ -│city_id:new_york os:android│ │ 10 │ 8 │ 5 │ │ 20 │ 4 │ 5 │ │ 10 │ 8 │ 5 │ -└───────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ - ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ -┌───────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ -│ name:sign_up │ │ │ │ │ │ │ │ │ │ │ │ │ -│ city_id:san_diego os:ios │ │ 10 │ 5 │ 10 │ │ 2 │ 5 │ 10 │ │ 8 │ 6 │ 6 │ -└───────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ - ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ ├──────┼──────┼──────▶ -┌───────────────────────────┐ │ │ │ │ │ │ │ │ │ │ │ │ -│ name:sign_up │ │ │ │ │ │ │ │ │ │ │ │ │ -│ city_id:toronto os:ios │ │ 2 │ 5 │ 10 │ │ 2 │ 5 │ 10 │ │ 2 │ 5 │ 10 │ -└───────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ - └──────┴──────┴──────┘ └──────┴──────┴──────┘ └──────┴──────┴──────┘ - diff --git a/docs/content/reference_docs/configurations/_index.md b/docs/content/reference_docs/configurations/_index.md deleted file mode 100644 index 092b2329a0..0000000000 --- a/docs/content/reference_docs/configurations/_index.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "Configurations" -date: 2020-04-21T21:03:58-04:00 -draft: true ---- - diff --git a/docs/content/reference_docs/configurations/annotated_config.md b/docs/content/reference_docs/configurations/annotated_config.md deleted file mode 100644 index 1ea54572aa..0000000000 --- a/docs/content/reference_docs/configurations/annotated_config.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: "Annotated configuration file" -date: 2020-04-21T21:01:32-04:00 -draft: true ---- - -Link to Yaml: https://github.com/chronosphereio/collector/blob/master/config/chronocollector/config.yml - diff --git a/docs/content/reference_docs/configurations/apis/_index.md b/docs/content/reference_docs/configurations/apis/_index.md deleted file mode 100644 index ebe4fb5781..0000000000 --- a/docs/content/reference_docs/configurations/apis/_index.md +++ /dev/null @@ -1,343 +0,0 @@ ---- -title: "Apis" -date: 2020-05-08T12:41:49-04:00 -draft: true ---- - -### M3 Coordinator, API for reading/writing metrics and M3 management -M3 Coordinator is a service that coordinates reads and writes between upstream systems, such as Prometheus, and downstream systems, such as M3DB. -It also provides management APIs to setup and configure different parts of M3. -The coordinator is generally a bridge for read and writing different types of metrics formats and a management layer for M3. - -### API -The M3 Coordinator implements the Prometheus Remote Read and Write HTTP endpoints, they also can be used however as general purpose metrics write and read APIs. Any metrics that are written to the remote write API can be queried using PromQL through the query APIs as well as being able to be read back by the Prometheus Remote Read endpoint. - -### Remote Write -Write a Prometheus Remote write query to M3. -URL -/api/v1/prom/remote/write -Method -POST -URL Params -None. -Header Params -Optional -M3-Metrics-Type: -If this header is set, it determines what type of metric to store this metric value as. Otherwise by default, metrics will be stored in all namespaces that are configured. You can also disable this default behavior by setting downsample options to all: false for a namespace in the coordinator config, for more see disabling automatic aggregation. - -Must be one of: -unaggregated: Write metrics directly to configured unaggregated namespace. -aggregated: Write metrics directly to a configured aggregated namespace (bypassing any aggregation), this requires the M3-Storage-Policy header to be set to resolve which namespace to write metrics to. - - -### M3-Storage-Policy: -If this header is set, it determines which aggregated namespace to read/write metrics directly to/from (bypassing any aggregation). -The value of the header must be in the format of resolution:retention in duration shorthand. e.g. 1m:48h specifices 1 minute resolution and 48 hour retention. Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". - -Here is an example of querying metrics from a specific namespace. - -### Data Params -Binary snappy compressed Prometheus WriteRequest protobuf message. -Available Tuning Params -Refer here for an up to date list of remote tuning parameters. - -#### Sample Call -There isn't a straightforward way to Snappy compress and marshal a Prometheus WriteRequest protobuf message using just shell, so this example uses a specific command line utility instead. -This sample call is made using promremotecli which is a command line tool that uses a Go client to Prometheus Remote endpoints. For more information visit the GitHub repository. -There is also a Java client that can be used to make requests to the endpoint. -Each -t parameter specifies a label (dimension) to add to the metric. -The -h parameter can be used as many times as necessary to add headers to the outgoing request in the form of "Header-Name: HeaderValue". - -Here is an example of writing the datapoint at the current unix timestamp with value 123.456: -docker run -it --rm \ - quay.io/m3db/prometheus_remote_client_golang:latest \ - -u http://host.docker.internal:7201/api/v1/prom/remote/write \ - -t __name__:http_requests_total \ - -t code:200 \ - -t handler:graph \ - -t method:get \ - -d $(date +"%s"),123.456 -promremotecli_log 2019/06/25 04:13:56 writing datapoint [2019-06-25 04:13:55 +0000 UTC 123.456] -promremotecli_log 2019/06/25 04:13:56 labelled [[__name__ http_requests_total] [code 200] [handler graph] [method get]] -promremotecli_log 2019/06/25 04:13:56 writing to http://host.docker.internal:7201/api/v1/prom/remote/write -{"success":true,"statusCode":200} -promremotecli_log 2019/06/25 04:13:56 write success - -# If you are paranoid about image tags being hijacked/replaced with nefarious code, you can use this SHA256 tag: -# quay.io/m3db/prometheus_remote_client_golang@sha256:fc56df819bff9a5a087484804acf3a584dd4a78c68900c31a28896ed66ca7e7b - -For more details on querying data in PromQL that was written using this endpoint, see the query API documentation. -Remote Read -Read Prometheus metrics from M3. -URL -/api/v1/prom/remote/read -Method -POST -URL Params -None. -Header Params -Optional - -### M3-Metrics-Type: -If this header is set, it determines what type of metric to store this metric value as. Otherwise by default, metrics will be stored in all namespaces that are configured. You can also disable this default behavior by setting downsample options to all: false for a namespace in the coordinator config, for more see disabling automatic aggregation. - -#### Must be one of: -unaggregated: Write metrics directly to configured unaggregated namespace. -aggregated: Write metrics directly to a configured aggregated namespace (bypassing any aggregation), this requires the M3-Storage-Policy header to be set to resolve which namespace to write metrics to. - - -### M3-Storage-Policy: -If this header is set, it determines which aggregated namespace to read/write metrics directly to/from (bypassing any aggregation). -The value of the header must be in the format of resolution:retention in duration shorthand. e.g. 1m:48h specifices 1 minute resolution and 48 hour retention. Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". - -Here is an example of querying metrics from a specific namespace. -Data Params -Binary snappy compressed Prometheus WriteRequest protobuf message. - -### Query Engine - -API -Please note: This documentation is a work in progress and more detail is required. -Query using PromQL -Query using PromQL and returns JSON datapoints compatible with the Prometheus Grafana plugin. -URL -/api/v1/query_range -Method -GET -URL Params -Required -start=[time in RFC3339Nano] -end=[time in RFC3339Nano] -step=[time duration] -target=[string] -Optional -debug=[bool] -lookback=[string|time duration]: This sets the per request lookback duration to something other than the default set in config, can either be a time duration or the string "step" which sets the lookback to the same as the step request parameter. -Header Params -Optional - -### M3-Metrics-Type: -If this header is set, it determines what type of metric to store this metric value as. Otherwise by default, metrics will be stored in all namespaces that are configured. You can also disable this default behavior by setting downsample options to all: false for a namespace in the coordinator config, for more see disabling automatic aggregation. - -#### Must be one of: -unaggregated: Write metrics directly to configured unaggregated namespace. -aggregated: Write metrics directly to a configured aggregated namespace (bypassing any aggregation), this requires the M3-Storage-Policy header to be set to resolve which namespace to write metrics to. - - -### M3-Storage-Policy: -If this header is set, it determines which aggregated namespace to read/write metrics directly to/from (bypassing any aggregation). -The value of the header must be in the format of resolution:retention in duration shorthand. e.g. 1m:48h specifices 1 minute resolution and 48 hour retention. Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". - -#### Here is an example of querying metrics from a specific namespace. -Tag Mutation -The M3-Map-Tags-JSON header enables dynamically mutating tags in Prometheus write request. See 2254 for more background. -Currently only write is supported. As an example, the following header would unconditionally cause globaltag=somevalue to be added to all metrics in a write request: -M3-Map-Tags-JSON: '{"tagMappers":[{"write":{"tag":"globaltag","value":"somevalue"}}]}' - -#### Data Params -None. -Sample Call -curl 'http://localhost:7201/api/v1/query_range?query=abs(http_requests_total)&start=1530220860&end=1530220900&step=15s' -{ - "status": "success", - "data": { - "resultType": "matrix", - "result": [ - { - "metric": { - "code": "200", - "handler": "graph", - "method": "get" - }, - "values": [ - [ - 1530220860, - "6" - ], - [ - 1530220875, - "6" - ], - [ - 1530220890, - "6" - ] - ] - }, - { - "metric": { - "code": "200", - "handler": "label_values", - "method": "get" - }, - "values": [ - [ - 1530220860, - "6" - ], - [ - 1530220875, - "6" - ], - [ - 1530220890, - "6" - ] - ] - } - ] - } -} - - - -### ClusterCondition -ClusterCondition represents various conditions the cluster can be in. - -Field Description Scheme Required -type Type of cluster condition. ClusterConditionType false -status Status of the condition (True, False, Unknown). corev1.ConditionStatus false -lastUpdateTime Last time this condition was updated. string false -lastTransitionTime Last time this condition transitioned from one status to another. string false -reason Reason this condition last changed. string false -message Human-friendly message about this condition. string false -Back to TOC - -### ClusterSpec -ClusterSpec defines the desired state for a M3 cluster to be converge to. - -Field Description Scheme Required -image Image specifies which docker image to use with the cluster string false -replicationFactor ReplicationFactor defines how many replicas int32 false -numberOfShards NumberOfShards defines how many shards in total int32 false -isolationGroups IsolationGroups specifies a map of key-value pairs. Defines which isolation groups to deploy persistent volumes for data nodes []IsolationGroup false -namespaces Namespaces specifies the namespaces this cluster will hold. []Namespace false -etcdEndpoints EtcdEndpoints defines the etcd endpoints to use for service discovery. Must be set if no custom configmap is defined. If set, etcd endpoints will be templated in to the default configmap template. []string false -keepEtcdDataOnDelete KeepEtcdDataOnDelete determines whether the operator will remove cluster metadata (placement + namespaces) in etcd when the cluster is deleted. Unless true, etcd data will be cleared when the cluster is deleted. bool false -enableCarbonIngester EnableCarbonIngester enables the listener port for the carbon ingester bool false -configMapName ConfigMapName specifies the ConfigMap to use for this cluster. If unset a default configmap with template variables for etcd endpoints will be used. See \"Configuring M3DB\" in the docs for more. *string false -podIdentityConfig PodIdentityConfig sets the configuration for pod identity. If unset only pod name and UID will be used. *PodIdentityConfig false -containerResources Resources defines memory / cpu constraints for each container in the cluster. corev1.ResourceRequirements false -dataDirVolumeClaimTemplate DataDirVolumeClaimTemplate is the volume claim template for an M3DB instance's data. It claims PersistentVolumes for cluster storage, volumes are dynamically provisioned by when the StorageClass is defined. *corev1.PersistentVolumeClaim false -podSecurityContext PodSecurityContext allows the user to specify an optional security context for pods. *corev1.PodSecurityContext false -securityContext SecurityContext allows the user to specify a container-level security context. *corev1.SecurityContext false -imagePullSecrets ImagePullSecrets will be added to every pod. []corev1.LocalObjectReference false -envVars EnvVars defines custom environment variables to be passed to M3DB containers. []corev1.EnvVar false -labels Labels sets the base labels that will be applied to resources created by the cluster. // TODO(schallert): design doc on labeling scheme. map[string]string false -annotations Annotations sets the base annotations that will be applied to resources created by the cluster. map[string]string false -tolerations Tolerations sets the tolerations that will be applied to all M3DB pods. []corev1.Toleration false -priorityClassName PriorityClassName sets the priority class for all M3DB pods. string false -nodeEndpointFormat NodeEndpointFormat allows overriding of the endpoint used for a node in the M3DB placement. Defaults to \"{{ .PodName }}.{{ .M3DBService }}:{{ .Port }}\". Useful if access to the cluster from other namespaces is desired. See \"Node Endpoint\" docs for full variables available. string false -hostNetwork HostNetwork indicates whether M3DB pods should run in the same network namespace as the node its on. This option should be used sparingly due to security concerns outlined in the linked documentation. https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces bool false -dnsPolicy DNSPolicy allows the user to set the pod's DNSPolicy. This is often used in conjunction with HostNetwork.+optional *corev1.DNSPolicy false -externalCoordinatorSelector Specify a \"controlling\" coordinator for the cluster It is expected that there is a separate standalone coordinator cluster It is externally managed - not managed by this operator It is expected to have a service endpoint Setup this db cluster, but do not assume a co-located coordinator Instead provide a selector here so we can point to a separate coordinator service Specify here the labels required for the selector map[string]string false -initContainers Custom setup for db nodes can be done via initContainers Provide the complete spec for the initContainer here If any storage volumes are needed in the initContainer see InitVolumes below []corev1.Container false -initVolumes If the InitContainers require any storage volumes Provide the complete specification for the required Volumes here []corev1.Volume false -podMetadata PodMetadata is for any Metadata that is unique to the pods, and does not belong on any other objects, such as Prometheus scrape tags metav1.ObjectMeta false -Back to TOC - -### IsolationGroup -IsolationGroup defines the name of zone as well attributes for the zone configuration - -Field Description Scheme Required -name Name is the value that will be used in StatefulSet labels, pod labels, and M3DB placement \"isolationGroup\" fields. string true -nodeAffinityTerms NodeAffinityTerms is an array of NodeAffinityTerm requirements, which are ANDed together to indicate what nodes an isolation group can be assigned to. []NodeAffinityTerm false -numInstances NumInstances defines the number of instances. int32 true -storageClassName StorageClassName is the name of the StorageClass to use for this isolation group. This allows ensuring that PVs will be created in the same zone as the pinned statefulset on Kubernetes < 1.12 (when topology aware volume scheduling was introduced). Only has effect if the clusters dataDirVolumeClaimTemplate is non-nil. If set, the volume claim template will have its storageClassName field overridden per-isolationgroup. If unset the storageClassName of the volumeClaimTemplate will be used. string false -Back to TOC - -### M3DBCluster -M3DBCluster defines the cluster - -Field Description Scheme Required -metadata metav1.ObjectMeta false -type string true -spec ClusterSpec true -status M3DBStatus false -Back to TOC - -### M3DBClusterList -M3DBClusterList represents a list of M3DB Clusters - -Field Description Scheme Required -metadata metav1.ListMeta false -items []M3DBCluster true -Back to TOC - -### M3DBStatus -M3DBStatus contains the current state the M3DB cluster along with a human readable message - -Field Description Scheme Required -state State is a enum of green, yellow, and red denoting the health of the cluster M3DBState false -conditions Various conditions about the cluster. []ClusterCondition false -message Message is a human readable message indicating why the cluster is in it's current state string false -observedGeneration ObservedGeneration is the last generation of the cluster the controller observed. Kubernetes will automatically increment metadata.Generation every time the cluster spec is changed. int64 false -Back to TOC - -### NodeAffinityTerm -NodeAffinityTerm represents a node label and a set of label values, any of which can be matched to assign a pod to a node. - -### Field Description Scheme Required -key Key is the label of the node. string true -values Values is an array of values, any of which a node can have for a pod to be assigned to it. []string true -Back to TOC - -### IndexOptions -IndexOptions defines parameters for indexing. - -### Field Description Scheme Required -enabled Enabled controls whether metric indexing is enabled. bool false -blockSize BlockSize controls the index block size. string false -Back to TOC - -### Namespace -Namespace defines an M3DB namespace or points to a preset M3DB namespace. - -Field Description Scheme Required -name Name is the namespace name. string false -preset Preset indicates preset namespace options. string false -options Options points to optional custom namespace configuration. *NamespaceOptions false -Back to TOC - -### NamespaceOptions -NamespaceOptions defines parameters for an M3DB namespace. See https://m3db.github.io/m3/operational_guide/namespace_configuration/ for more details. - -Field Description Scheme Required -bootstrapEnabled BootstrapEnabled control if bootstrapping is enabled. bool false -flushEnabled FlushEnabled controls whether flushing is enabled. bool false -writesToCommitLog WritesToCommitLog controls whether commit log writes are enabled. bool false -cleanupEnabled CleanupEnabled controls whether cleanups are enabled. bool false -repairEnabled RepairEnabled controls whether repairs are enabled. bool false -snapshotEnabled SnapshotEnabled controls whether snapshotting is enabled. bool false -retentionOptions RetentionOptions sets the retention parameters. RetentionOptions false -indexOptions IndexOptions sets the indexing parameters. IndexOptions false -Back to TOC - -### RetentionOptions -RetentionOptions defines parameters for data retention. - -Field Description Scheme Required -retentionPeriod RetentionPeriod controls how long data for the namespace is retained. string false -blockSize BlockSize controls the block size for the namespace. string false -bufferFuture BufferFuture controls how far in the future metrics can be written. string false -bufferPast BufferPast controls how far in the past metrics can be written. string false -blockDataExpiry BlockDataExpiry controls the block expiry. bool false -blockDataExpiryAfterNotAccessPeriod BlockDataExpiry controls the not after access period for expiration. string false -Back to TOC - -### PodIdentity -PodIdentity contains all the fields that may be used to identify a pod's identity in the M3DB placement. Any non-empty fields will be used to identity uniqueness of a pod for the purpose of M3DB replace operations. - -Field Description Scheme Required -name string false -uid string false -nodeName string false -nodeExternalID string false -nodeProviderID string false -Back to TOC - -### PodIdentityConfig -PodIdentityConfig contains cluster-level configuration for deriving pod identity. - -Field Description Scheme Required -sources Sources enumerates the sources from which to derive pod identity. Note that a pod's name will always be used. If empty, defaults to pod name and UID. []PodIdentitySource true \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/apis/ingest.md b/docs/content/reference_docs/configurations/apis/ingest.md deleted file mode 100644 index d888da7c02..0000000000 --- a/docs/content/reference_docs/configurations/apis/ingest.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "Ingest APIs" -date: 2020-05-08T12:42:14-04:00 -draft: true ---- - diff --git a/docs/content/reference_docs/configurations/apis/operator.md b/docs/content/reference_docs/configurations/apis/operator.md deleted file mode 100644 index f0d96bfe74..0000000000 --- a/docs/content/reference_docs/configurations/apis/operator.md +++ /dev/null @@ -1,175 +0,0 @@ ---- -title: "Operator API" -date: 2020-05-08T12:42:20-04:00 -draft: true ---- - -API Docs -This document enumerates the Custom Resource Definitions used by the M3DB Operator. It is auto-generated from code comments. - -Table of Contents -ClusterCondition -ClusterSpec -IsolationGroup -M3DBCluster -M3DBClusterList -M3DBStatus -NodeAffinityTerm -IndexOptions -Namespace -NamespaceOptions -RetentionOptions -PodIdentity -PodIdentityConfig -ClusterCondition -ClusterCondition represents various conditions the cluster can be in. - -Field Description Scheme Required -type Type of cluster condition. ClusterConditionType false -status Status of the condition (True, False, Unknown). corev1.ConditionStatus false -lastUpdateTime Last time this condition was updated. string false -lastTransitionTime Last time this condition transitioned from one status to another. string false -reason Reason this condition last changed. string false -message Human-friendly message about this condition. string false -Back to TOC - -ClusterSpec -ClusterSpec defines the desired state for a M3 cluster to be converge to. - -Field Description Scheme Required -image Image specifies which docker image to use with the cluster string false -replicationFactor ReplicationFactor defines how many replicas int32 false -numberOfShards NumberOfShards defines how many shards in total int32 false -isolationGroups IsolationGroups specifies a map of key-value pairs. Defines which isolation groups to deploy persistent volumes for data nodes []IsolationGroup false -namespaces Namespaces specifies the namespaces this cluster will hold. []Namespace false -etcdEndpoints EtcdEndpoints defines the etcd endpoints to use for service discovery. Must be set if no custom configmap is defined. If set, etcd endpoints will be templated in to the default configmap template. []string false -keepEtcdDataOnDelete KeepEtcdDataOnDelete determines whether the operator will remove cluster metadata (placement + namespaces) in etcd when the cluster is deleted. Unless true, etcd data will be cleared when the cluster is deleted. bool false -enableCarbonIngester EnableCarbonIngester enables the listener port for the carbon ingester bool false -configMapName ConfigMapName specifies the ConfigMap to use for this cluster. If unset a default configmap with template variables for etcd endpoints will be used. See \"Configuring M3DB\" in the docs for more. *string false -podIdentityConfig PodIdentityConfig sets the configuration for pod identity. If unset only pod name and UID will be used. *PodIdentityConfig false -containerResources Resources defines memory / cpu constraints for each container in the cluster. corev1.ResourceRequirements false -dataDirVolumeClaimTemplate DataDirVolumeClaimTemplate is the volume claim template for an M3DB instance's data. It claims PersistentVolumes for cluster storage, volumes are dynamically provisioned by when the StorageClass is defined. *corev1.PersistentVolumeClaim false -podSecurityContext PodSecurityContext allows the user to specify an optional security context for pods. *corev1.PodSecurityContext false -securityContext SecurityContext allows the user to specify a container-level security context. *corev1.SecurityContext false -imagePullSecrets ImagePullSecrets will be added to every pod. []corev1.LocalObjectReference false -envVars EnvVars defines custom environment variables to be passed to M3DB containers. []corev1.EnvVar false -labels Labels sets the base labels that will be applied to resources created by the cluster. // TODO(schallert): design doc on labeling scheme. map[string]string false -annotations Annotations sets the base annotations that will be applied to resources created by the cluster. map[string]string false -tolerations Tolerations sets the tolerations that will be applied to all M3DB pods. []corev1.Toleration false -priorityClassName PriorityClassName sets the priority class for all M3DB pods. string false -nodeEndpointFormat NodeEndpointFormat allows overriding of the endpoint used for a node in the M3DB placement. Defaults to \"{{ .PodName }}.{{ .M3DBService }}:{{ .Port }}\". Useful if access to the cluster from other namespaces is desired. See \"Node Endpoint\" docs for full variables available. string false -hostNetwork HostNetwork indicates whether M3DB pods should run in the same network namespace as the node its on. This option should be used sparingly due to security concerns outlined in the linked documentation. https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces bool false -dnsPolicy DNSPolicy allows the user to set the pod's DNSPolicy. This is often used in conjunction with HostNetwork.+optional *corev1.DNSPolicy false -externalCoordinatorSelector Specify a \"controlling\" coordinator for the cluster It is expected that there is a separate standalone coordinator cluster It is externally managed - not managed by this operator It is expected to have a service endpoint Setup this db cluster, but do not assume a co-located coordinator Instead provide a selector here so we can point to a separate coordinator service Specify here the labels required for the selector map[string]string false -initContainers Custom setup for db nodes can be done via initContainers Provide the complete spec for the initContainer here If any storage volumes are needed in the initContainer see InitVolumes below []corev1.Container false -initVolumes If the InitContainers require any storage volumes Provide the complete specification for the required Volumes here []corev1.Volume false -podMetadata PodMetadata is for any Metadata that is unique to the pods, and does not belong on any other objects, such as Prometheus scrape tags metav1.ObjectMeta false -parallelPodManagement ParallelPodManagement sets StatefulSets created by the operator to have Parallel pod management instead of OrderedReady. This is an EXPERIMENTAL flag and subject to deprecation in a future release. This has not been tested in production and users should not depend on it without validating it for their own use case. bool true -Back to TOC - -IsolationGroup -IsolationGroup defines the name of zone as well attributes for the zone configuration - -Field Description Scheme Required -name Name is the value that will be used in StatefulSet labels, pod labels, and M3DB placement \"isolationGroup\" fields. string true -nodeAffinityTerms NodeAffinityTerms is an array of NodeAffinityTerm requirements, which are ANDed together to indicate what nodes an isolation group can be assigned to. []NodeAffinityTerm false -numInstances NumInstances defines the number of instances. int32 true -storageClassName StorageClassName is the name of the StorageClass to use for this isolation group. This allows ensuring that PVs will be created in the same zone as the pinned statefulset on Kubernetes < 1.12 (when topology aware volume scheduling was introduced). Only has effect if the clusters dataDirVolumeClaimTemplate is non-nil. If set, the volume claim template will have its storageClassName field overridden per-isolationgroup. If unset the storageClassName of the volumeClaimTemplate will be used. string false -Back to TOC - -M3DBCluster -M3DBCluster defines the cluster - -Field Description Scheme Required -metadata metav1.ObjectMeta false -type string true -spec ClusterSpec true -status M3DBStatus false -Back to TOC - -M3DBClusterList -M3DBClusterList represents a list of M3DB Clusters - -Field Description Scheme Required -metadata metav1.ListMeta false -items []M3DBCluster true -Back to TOC - -M3DBStatus -M3DBStatus contains the current state the M3DB cluster along with a human readable message - -Field Description Scheme Required -state State is a enum of green, yellow, and red denoting the health of the cluster M3DBState false -conditions Various conditions about the cluster. []ClusterCondition false -message Message is a human readable message indicating why the cluster is in it's current state string false -observedGeneration ObservedGeneration is the last generation of the cluster the controller observed. Kubernetes will automatically increment metadata.Generation every time the cluster spec is changed. int64 false -Back to TOC - -NodeAffinityTerm -NodeAffinityTerm represents a node label and a set of label values, any of which can be matched to assign a pod to a node. - -Field Description Scheme Required -key Key is the label of the node. string true -values Values is an array of values, any of which a node can have for a pod to be assigned to it. []string true -Back to TOC - -IndexOptions -IndexOptions defines parameters for indexing. - -Field Description Scheme Required -enabled Enabled controls whether metric indexing is enabled. bool false -blockSize BlockSize controls the index block size. string false -Back to TOC - -Namespace -Namespace defines an M3DB namespace or points to a preset M3DB namespace. - -Field Description Scheme Required -name Name is the namespace name. string false -preset Preset indicates preset namespace options. string false -options Options points to optional custom namespace configuration. *NamespaceOptions false -Back to TOC - -NamespaceOptions -NamespaceOptions defines parameters for an M3DB namespace. See https://m3db.github.io/m3/operational_guide/namespace_configuration/ for more details. - -Field Description Scheme Required -bootstrapEnabled BootstrapEnabled control if bootstrapping is enabled. bool false -flushEnabled FlushEnabled controls whether flushing is enabled. bool false -writesToCommitLog WritesToCommitLog controls whether commit log writes are enabled. bool false -cleanupEnabled CleanupEnabled controls whether cleanups are enabled. bool false -repairEnabled RepairEnabled controls whether repairs are enabled. bool false -snapshotEnabled SnapshotEnabled controls whether snapshotting is enabled. bool false -retentionOptions RetentionOptions sets the retention parameters. RetentionOptions false -indexOptions IndexOptions sets the indexing parameters. IndexOptions false -Back to TOC - -RetentionOptions -RetentionOptions defines parameters for data retention. - -Field Description Scheme Required -retentionPeriod RetentionPeriod controls how long data for the namespace is retained. string false -blockSize BlockSize controls the block size for the namespace. string false -bufferFuture BufferFuture controls how far in the future metrics can be written. string false -bufferPast BufferPast controls how far in the past metrics can be written. string false -blockDataExpiry BlockDataExpiry controls the block expiry. bool false -blockDataExpiryAfterNotAccessPeriod BlockDataExpiry controls the not after access period for expiration. string false -Back to TOC - -PodIdentity -PodIdentity contains all the fields that may be used to identify a pod's identity in the M3DB placement. Any non-empty fields will be used to identity uniqueness of a pod for the purpose of M3DB replace operations. - -Field Description Scheme Required -name string false -uid string false -nodeName string false -nodeExternalID string false -nodeProviderID string false -Back to TOC - -PodIdentityConfig -PodIdentityConfig contains cluster-level configuration for deriving pod identity. - -Field Description Scheme Required -sources Sources enumerates the sources from which to derive pod identity. Note that a pod's name will always be used. If empty, defaults to pod name and UID. []PodIdentitySource true -Back to TOC \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/apis/query.md b/docs/content/reference_docs/configurations/apis/query.md deleted file mode 100644 index 5b7b5c402e..0000000000 --- a/docs/content/reference_docs/configurations/apis/query.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: "Query APIs" -date: 2020-05-08T12:42:09-04:00 -draft: true ---- - diff --git a/docs/content/reference_docs/configurations/availability.md b/docs/content/reference_docs/configurations/availability.md deleted file mode 100644 index 9f52fa65af..0000000000 --- a/docs/content/reference_docs/configurations/availability.md +++ /dev/null @@ -1,100 +0,0 @@ ---- -title: "Availability, consistency, and durability" -date: 2020-04-21T21:02:08-04:00 -draft: true ---- - -### Consistency Levels -M3DB provides variable consistency levels for read and write operations, as well as cluster connection operations. These consistency levels are handled at the client level. -#### Write consistency levels -- One: Corresponds to a single node succeeding for an operation to succeed. -- Majority: Corresponds to the majority of nodes succeeding for an operation to succeed. -- All: Corresponds to all nodes succeeding for an operation to succeed. - -#### Read consistency levels -- One: Corresponds to reading from a single node to designate success. -- UnstrictMajority: Corresponds to reading from the majority of nodes but relaxing the constraint when it cannot be met, falling back to returning success when reading from at least a single node after attempting reading from the majority of nodes. -- Majority: Corresponds to reading from the majority of nodes to designate success. -- All: Corresponds to reading from all of the nodes to designate success. - -#### Connect consistency levels -Connect consistency levels are used to determine when a client session is deemed as connected before operations can be attempted. -- Any: Corresponds to connecting to any number of nodes for all shards, this strategy will attempt to connect to all, then the majority, then one and then fallback to none and as such will always succeed. -- None: Corresponds to connecting to no nodes for all shards and as such will always succeed. -- One: Corresponds to connecting to a single node for all shards. -- Majority: Corresponds to connecting to the majority of nodes for all shards. -- All: Corresponds to connecting to all of the nodes for all shards. - - -### Tuning Availability, Consistency, and Durability -#### Overview -M3DB is designed as a High Availability HA system because it doesn't use a consensus protocol like Raft or Paxos to enforce strong consensus and consistency guarantees. However, even within the category of HA systems, there is a broad spectrum of consistency and durability guarantees that a database can provide. To address as many use cases as possible, M3DB can be tuned to achieve the desired balance between performance, availability, durability, and consistency. - -Generally speaking, the default and example configuration for M3DB favors performance and availability, as that is well-suited for M3DB's most common metrics and Observability use cases. To instead favor consistency and durability, consider tuning values as described in the "Tuning for Consistency and Durability" section. Database operators who are using M3DB for workloads that require stricter consistency and durability guarantees should consider tuning the default configuration to better suit their use case. -The rest of this document describes the various configuration options that are available to M3DB operators to make such tradeoffs. While reading it, we recommend referring to the default configuration file (which has every possible configuration value set) to see how the described values fit into M3DB's configuration as a whole. - -### Tuning for Performance and Availability -#### Client Write and Read consistency -We recommend running the client with writeConsistencyLevel set to majority and readConsistencyLevel set to unstrict_majority. This means that all write must be acknowledged by a quorums of nodes in order to be considered succesful, and that reads will attempt to achieve quorum, but will return the data from a single node if they are unable to achieve quorum. This ensures that reads will normally ensure consistency, but degraded conditions will cause reads to fail outright as long as at least a single node can satisfy the request. -You can read about the consistency levels in more detail in the Consistency Levels section - -#### Commitlog Configuration -We recommend running M3DB with an asynchronous commitlog. This means that writes will be reported as successful by the client, though the data may not have been flushed to disk yet. - -For example, consider the default configuration: -commitlog: - flushMaxBytes: 524288 - flushEvery: 1s - queue: - calculationType: fixed - size: 2097152 - -This configuration states that the commitlog should be flushed whenever either of the following is true: -524288 or more bytes have been written since the last time M3DB flushed the commitlog. -One or more seconds has elapsed since the last time M3DB flushed the commitlog. -In addition, the configuration also states that M3DB should allow up to 2097152 writes to be buffered in the commitlog queue before the database node will begin rejecting incoming writes so it can attempt to drain the queue and catch up. Increasing the size of this queue can often increase the write throughput of an M3DB node at the cost of potentially losing more data if the node experiences a sudden failure like a hard crash or power loss. - -### Writing New Series Asynchronously -The default M3DB YAML configuration will contain the following as a top-level key under the db section: -writeNewSeriesAsync: true - -This instructs M3DB to handle writes for new timeseries (for a given time block) asynchronously. Creating a new timeseries in memory is much more expensive than simply appending a new write to an existing series, so the default configuration of creating them asynchronously improves M3DBs write throughput significantly when many new series are being created all at once. - -However, since new time series are created asynchronously, it's possible that there may be a brief delay inbetween when a write is acknowledged by the client and when that series becomes available for subsequent reads. - -M3DB also allows operators to rate limit the number of new series that can be created per second via the following configuration: -writeNewSeriesLimitPerSecond: 1048576 - -This value can be set much lower than the default value for workloads in which a significant increase in cardinality usually indicates a misbehaving caller. - -### Ignoring Corrupt Commitlogs on Bootstrap -If M3DB is shut down gracefully (i.e via SIGTERM), it will ensure that all pending writes are flushed to the commitlog on disk before the process exists. However, in situations where the process crashed/exited unexpectedly or the node itself experienced a sudden failure, the tail end of the commitlog may be corrupt. In such situations, M3DB will read as much of the commitlog as possible in an attempt to recover the maximum amount of data. However, it then needs to make a decision: it can either (a) come up successfully and tolerate an ostensibly minor amount of data or loss, or (b) attempt to stream the missing data from its peers. This behavior is controlled by the following default configuration: -bootstrap: - commitlog: - returnUnfulfilledForCorruptCommitLogFiles: false - -In the situation where only a single node fails, the optimal outcome is for the node to attempt to repair itself from one of its peers. However, if a quorum of nodes fail and encounter corrupt commitlog files, they will deadlock while attempting to stream data from each other, as no nodes will be able to make progress due to a lack of quorum. This issue requires an operator with significant M3DB operational experience to manually bootstrap the cluster; thus the official recommendation is to set returnUnfulfilledForCorruptCommitLogFiles: false to avoid this issue altogether. In most cases, a small amount of data loss is preferable to a quorum of nodes that crash and fail to start back up automatically. - -### Tuning for Consistency and Durability -#### Client Write and Read consistency -The most important thing to understand is that if you want to guarantee that you will be able to read the result of every successful write, then both writes and reads must be done with majority consistency. This means that both writes and reads will fail if a quorum of nodes are unavailable for a given shard. You can read about the consistency levels in more detail in the Consistency Levels section - -#### Commitlog Configuration -M3DB supports running the commitlog synchronously such that every write is flushed to disk and fsync'd before the client receives a successful acknowledgement, but this is not currently exposed to users in the YAML configuration and generally leads to a massive performance degradation. We only recommend operating M3DB this way for workloads where data consistency and durability is strictly required, and even then there may be better alternatives such as running M3DB with the bootstrapping configuration: filesystem,peers,uninitialized_topology as described in our bootstrapping operational guide. - -#### Writing New Series Asynchronously -If you want to guarantee that M3DB will immediately allow you to read data for writes that have been acknowledged by the client, including the situation where the previous write was for a brand new timeseries, then you will need to change the default M3DB configuration to set writeNewSeriesAsync: false as a top-level key under the db section: -writeNewSeriesAsync: false - -This instructs M3DB to handle writes for new timeseries (for a given time block) synchronously. Creating a new timeseries in memory is much more expensive than simply appending a new write to an existing series, so this configuration could have an adverse effect on performance when many new timeseries are being inserted into M3DB concurrently. -Since this operation is so expensive, M3DB allows operator to rate limit the number of new series that can be created per second via the following configuration (also a top-level key under the db section): -writeNewSeriesLimitPerSecond: 1048576 - -### Ignoring Corrupt Commitlogs on Bootstrap -As described in the "Tuning for Performance and Availability" section, we recommend configuring M3DB to ignore corrupt commitlog files on bootstrap. However, if you want to avoid any amount of inconsistency or data loss, no matter how minor, then you should configure M3DB to return unfulfilled when the commitlog bootstrapper encounters corrupt commitlog files. You can do so by modifying your configuration to look like this: -bootstrap: - commitlog: - returnUnfulfilledForCorruptCommitLogFiles: true - -This will force your M3DB nodes to attempt to repair corrupted commitlog files on bootstrap by streaming the data from their peers. In most situations this will be transparent to the operator and the M3DB node will finish bootstrapping without trouble. However, in the scenario where a quorum of nodes for a given shard failed in unison, the nodes will deadlock while attempting to stream data from each other, as no nodes will be able to make progress due to a lack of quorum. This issue requires an operator with significant M3DB operational experience to manually bootstrap the cluster; thus the official recommendation is to avoid configuring M3DB in this way unless data consistency and durability are of utmost importance. - diff --git a/docs/content/reference_docs/configurations/bootstrapping.md b/docs/content/reference_docs/configurations/bootstrapping.md deleted file mode 100644 index 9d58c8d1c1..0000000000 --- a/docs/content/reference_docs/configurations/bootstrapping.md +++ /dev/null @@ -1,127 +0,0 @@ ---- -title: "Bootstrapping" -date: 2020-04-21T21:02:17-04:00 -draft: true ---- - -### Bootstrapping & Crash Recovery -#### Introduction -We recommend reading the placement operational guide before reading the rest of this document. -When an M3DB node is turned on (goes through a placement change) it needs to go through a bootstrapping process to determine the integrity of data that it has, replay writes from the commit log, and/or stream missing data from its peers. In most cases, as long as you're running with the default and recommended bootstrapper configuration of: filesystem,commitlog,peers,uninitialized_topology then you should not need to worry about the bootstrapping process at all and M3DB will take care of doing the right thing such that you don't lose data and consistency guarantees are met. Note that the order of the configured bootstrappers does matter. - -Generally speaking, we recommend that operators do not modify the bootstrappers configuration, but in the rare case that you to, this document is designed to help you understand the implications of doing so. - -#### M3DB currently supports 5 different bootstrappers: -filesystem -commitlog -peers -uninitialized_topology -noop-all - -When the bootstrapping process begins, M3DB nodes need to determine two things: -What shards the bootstrapping node should bootstrap, which can be determined from the cluster placement. -What time-ranges the bootstrapping node needs to bootstrap those shards for, which can be determined from the namespace retention. - -For example, imagine a M3DB node that is responsible for shards 1, 5, 13, and 25 according to the cluster placement. In addition, it has a single namespace called "metrics" with a retention of 48 hours. When the M3DB node is started, the node will determine that it needs to bootstrap shards 1, 5, 13, and 25 for the time range starting at the current time and ending 48 hours ago. In order to obtain all this data, it will run the configured bootstrappers in the specified order. Every bootstrapper will notify the bootstrapping process of which shard/ranges it was able to bootstrap and the bootstrapping process will continue working its way through the list of bootstrappers until all the shards/ranges required have been marked as fulfilled. Otherwise the M3DB node will fail to start. - -### Bootstrappers -#### Filesystem Bootstrapper -The filesystem bootstrapper's responsibility is to determine which immutable Fileset files exist on disk, and if so, mark them as fulfilled. The filesystem bootstrapper achieves this by scanning M3DB's directory structure and determining which Fileset files exist on disk. Unlike the other bootstrappers, the filesystem bootstrapper does not need to load any data into memory, it simply verifies the checksums of the data on disk and other components of the M3DB node will handle reading (and caching) the data dynamically once it begins to serve reads. - -#### Commitlog Bootstrapper -The commitlog bootstrapper's responsibility is to read the commitlog and snapshot (compacted commitlogs) files on disk and recover any data that has not yet been written out as an immutable Fileset file. Unlike the filesystem bootstrapper, the commit log bootstrapper cannot simply check which files are on disk in order to determine if it can satisfy a bootstrap request. Instead, the commitlog bootstrapper determines whether it can satisfy a bootstrap request using a simple heuristic. -On a shard-by-shard basis, the commitlog bootstrapper will consult the cluster placement to see if the node it is running on has ever achieved the Available status for the specified shard. If so, then the commit log bootstrapper should have all the data since the last Fileset file was flushed and will return that it can satisfy any time range for that shard. In other words, the commit log bootstrapper is all-or-nothing for a given shard: it will either return that it can satisfy any time range for a given shard or none at all. In addition, the commitlog bootstrapper assumes it is running after the filesystem bootstrapper. M3DB will not allow you to run with a configuration where the filesystem bootstrapper is placed after the commitlog bootstrapper, but it will allow you to run the commitlog bootstrapper without the filesystem bootstrapper which can result in loss of data, depending on the workload. - -#### Peers Bootstrapper -The peers bootstrapper's responsibility is to stream in data for shard/ranges from other M3DB nodes (peers) in the cluster. This bootstrapper is only useful in M3DB clusters with more than a single node and where the replication factor is set to a value larger than 1. The peers bootstrapper will determine whether or not it can satisfy a bootstrap request on a shard-by-shard basis by consulting the cluster placement and determining if there are enough peers to satisfy the bootstrap request. For example, imagine the following M3DB placement where node A is trying to perform a peer bootstrap: - ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Node A │ │ Node B │ │ Node C │ -────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── -┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ -│ │ │ │ │ │ -│ │ │ │ │ │ -│ Shard 1: Initializing │ │ Shard 1: Initializing │ │ Shard 1: Available │ -│ Shard 2: Initializing │ │ Shard 2: Initializing │ │ Shard 2: Available │ -│ Shard 3: Initializing │ │ Shard 3: Initializing │ │ Shard 3: Available │ -│ │ │ │ │ │ -│ │ │ │ │ │ -└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ - -In this case, the peers bootstrapper running on node A will not be able to fullfill any requests because node B is in the Initializing state for all of its shards and cannot fulfill bootstrap requests. This means that node A's peers bootstrapper cannot meet its default consistency level of majority for bootstrapping (1 < 2 which is majority with a replication factor of 3). On the other hand, node A would be able to peer bootstrap its shards in the following placement because its peers (nodes B/C) have sufficient replicas of the shards it needs in the Available state: - ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Node A │ │ Node B │ │ Node C │ -────┴─────────────────┴──────────┴─────────────────┴────────┴─────────────────┴─── -┌─────────────────────────┐ ┌───────────────────────┐ ┌──────────────────────┐ -│ │ │ │ │ │ -│ │ │ │ │ │ -│ Shard 1: Initializing │ │ Shard 1: Available │ │ Shard 1: Available │ -│ Shard 2: Initializing │ │ Shard 2: Available │ │ Shard 2: Available │ -│ Shard 3: Initializing │ │ Shard 3: Available │ │ Shard 3: Available │ -│ │ │ │ │ │ -│ │ │ │ │ │ -└─────────────────────────┘ └───────────────────────┘ └──────────────────────┘ - -Note that a bootstrap consistency level of majority is the default value, but can be modified by changing the value of the key m3db.client.bootstrap-consistency-level in etcd to one of: none, one, unstrict_majority (attempt to read from majority, but settle for less if any errors occur), majority (strict majority), and all. For example, if an entire cluster with a replication factor of 3 was restarted simultaneously, all the nodes would get stuck in an infinite loop trying to peer bootstrap from each other and not achieving majority until an operator modified this value. Note that this can happen even if all the shards were in the Available state because M3DB nodes will reject all read requests for a shard until they have bootstrapped that shard (which has to happen everytime the node is restarted). -Note: Any bootstrappers configuration that does not include the peers bootstrapper will be unable to handle dynamic placement changes of any kind. - -#### Uninitialized Topology Bootstrapper -The purpose of the uninitialized_topology bootstrapper is to succeed bootstraps for all time ranges for shards that have never been completely bootstrapped (at a cluster level). This allows us to run the default bootstrapper configuration of: filesystem,commitlog,peers,topology_uninitialized such that the filesystem and commitlog bootstrappers are used by default in node restarts, the peers bootstrapper is used for node adds/removes/replaces, and bootstraps still succeed for brand new placement where both the commitlog and peers bootstrappers will be unable to succeed any bootstraps. In other words, the uninitialized_topology bootstrapper allows us to place the commitlog bootstrapper before the peers bootstrapper and still succeed bootstraps with brand new placements without resorting to using the noop-all bootstrapper which suceeds bootstraps for all shard/time-ranges regardless of the status of the placement. - -The uninitialized_topology bootstrapper determines whether a placement is "new" for a given shard by counting the number of nodes in the Initializing state and Leaving states and there are more Initializing than Leaving, then it succeeds the bootstrap because that means the placement has never reached a state where all nodes are Available. - -#### No Operational All Bootstrapper -The noop-all bootstrapper succeeds all bootstraps regardless of requests shards/time ranges. - -### Bootstrappers Configuration -Now that we've gone over the various bootstrappers, let's consider how M3DB will behave in different configurations. Note that we include uninitialized_topology at the end of all the lists of bootstrappers because its required to get a new placement up and running in the first place, but is not required after that (although leaving it in has no detrimental effects). Also note that any configuration that does not include the peers bootstrapper will not be able to handle dynamic placement changes like node adds/removes/replaces. - -filesystem,commitlog,peers,uninitialized_topology (default) -This is the default bootstrappers configuration for M3DB and will behave "as expected" in the sense that it will maintain M3DB's consistency guarantees at all times, handle node adds/replaces/removes correctly, and still work with brand new placements / topologies. This is the only configuration that we recommend using in production. - -In the general case, the node will use only the filesystem and commitlog bootstrappers on node startup. However, in the case of a node add/remove/replace, the commitlog bootstrapper will detect that it is unable to fulfill the bootstrap request (because the node has never reached the Available state) and defer to the peers bootstrapper to stream in the data. - -Additionally, if it is a brand new placement where even the peers bootstrapper cannot fulfill the bootstrap, this will be detected by the uninitialized_topology bootstrapper which will succeed the bootstrap. -filesystem,peers,uninitialized_topology (default) - -Everytime a node is restarted it will attempt to stream in all of the the data for any blocks that it has never flushed, which is generally the currently active block and possibly the previous block as well. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of any unflushed blocks. This mode can lead to violations of M3DB's consistency guarantees due to the fact that commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see peers bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. -peers,uninitialized_topology - -Every time a node is restarted, it will attempt to stream in all of the data that it is responsible for from its peers, completely ignoring the immutable Fileset files it already has on disk. This mode can be useful if you want to improve performance or save disk space by operating nodes without a commitlog, or want to force a repair of all data on an individual node. This mode can lead to violations of M3DB's consistency guarantees due to the fact that the commit logs are being ignored. In addition, if you lose a replication factors worth or more of hosts at the same time, the node will not be able to bootstrap unless an operator modifies the bootstrap consistency level configuration in etcd (see peers bootstrap section above). Finally, this mode adds additional network and resource pressure on other nodes in the cluster while one node is peer bootstrapping from them which can be problematic in catastrophic scenarios where all the nodes are trying to stream data from each other. - -#### Invalid bootstrappers configuration -For the sake of completeness, we've included a short discussion below of some bootstrapping configurations that we consider "invalid" in that they are likely to lose data / violate M3DB's consistency guarantees and/or not handle placement changes in a correct way. - -filesystem,commitlog,uninitialized_topology -This bootstrapping configuration will work just fine if nodes are never added/replaced/removed, but will fail when attempting a node add/replace/remove. -filesystem,uninitialized_topology - -Every time a node is restarted it will utilize the immutable Fileset files its already written out to disk, but any data that it had received since it wrote out the last set of immutable files will be lost. -commitlog,uninitialized_topology - -Every time a node is restarted it will read all the commit log and snapshot files it has on disk, but it will ignore all the data in the immutable Fileset files that it has already written. - -### Crash Recovery -NOTE: These steps should not be necessary in most cases, especially if using the default bootstrappers configuration of filesystem,commitlog,peers,uninitialized_topology. However in the case the configuration is non-default or the cluster has been down for a prolonged period of time these steps may be necessary. A good indicator would be log messages related to failing to bootstrap from peers due to consistency issues. -M3DB may require manual intervention to recover in the event of a prolonged loss of quorum. This is because the Peers Boostrapper must read from a majority of nodes owning a shard to bootstrap. - -To relax this bootstrapping constraint, a value stored in etcd must be modified that corresponds to the m3db.client.bootstrap-consistency-level runtime flag. Until the coordinator supports an API for this, this must be done manually. The M3 contributors are aware of how cumbersome this is and are working on this API. -To update this value in etcd, first determine the environment the M3DB node is using. For example in this configuration, it is default_env. If using the M3DB Operator, the value will be $KUBE_NAMESPACE/$CLUSTER_NAME, where $KUBE_NAMESPACE is the name of the Kubernetes namespace the cluster is located in and $CLUSTER_NAME is the name you have assigned the cluster (such as default/my-test-cluster). - -The following base64-encoded string represents a Protobuf-serialized message containing the string unstrict_majority: ChF1bnN0cmljdF9tYWpvcml0eQ==. Decode this string and place it in the following etcd key, where $ENV is the value determined above: -_kv/$ENV/m3db.client.bootstrap-consistency-level - -Note that on MacOS, base64 requires the -D flag to decode, whereas elsewhere it is likely -d. Also note the use of echo -n to ensure removal of newlines if your shell does not support the <<:/api/v1/openapi or our online API documentation. -Additionally, the following headers can be used in the namespace operations: -- Cluster-Environment-Name: -This header is used to specify the cluster environment name. If not set, the default default_env is used. -- Cluster-Zone-Name: -This header is used to specify the cluster zone name. If not set, the default embedded is used. - -### Adding a Namespace -Recommended (Easy way) -The recommended way to add a namespace to M3DB is to use our api/v1/database/namespace/create endpoint. This API abstracts over a lot of the complexity of configuring a namespace and requires only two pieces of configuration to be provided: the name of the namespace, as well as its retention. -For example, the following cURL: -curl -X POST :/api/v1/database/namespace/create -d '{ - "namespaceName": "default_unaggregated", - "retentionTime": "24h" -}' - -will create a namespace called default_unaggregated with a retention of 24 hours. All of the other namespace options will either use reasonable default values or be calculated based on the provided retentionTime. - -Adding a namespace does not require restarting M3DB, but will require modifying the M3Coordinator configuration to include the new namespace, and then restarting it. - -If you feel the need to configure the namespace options yourself (for performance or other reasons), read the Advanced section below. - -### Advanced (Hard Way) - -The "advanced" API allows you to configure every aspect of the namespace that you're adding which can sometimes be helpful for development, debugging, and tuning clusters for maximum performance. Adding a namespace is a simple as using the POST api/v1/namespace API on an M3Coordinator instance. -curl -X POST :/api/v1/namespace -d '{ - "name": "default_unaggregated", - "options": { - "bootstrapEnabled": true, - "flushEnabled": true, - "writesToCommitLog": true, - "cleanupEnabled": true, - "snapshotEnabled": true, - "repairEnabled": false, - "retentionOptions": { - "retentionPeriod": "2d", - "blockSize": "2h", - "bufferFuture": "10m", - "bufferPast": "10m", - "blockDataExpiry": true, - "blockDataExpiryAfterNotAccessedPeriod": "5m" - }, - "indexOptions": { - "enabled": true, - "blockSize": "2h" - } - } -}' - -Adding a namespace does not require restarting M3DB, but will require modifying the M3Coordinator configuration to include the new namespace, and then restarting it. - -### Deleting a Namespace -Deleting a namespace is a simple as using the DELETE /api/v1/namespace API on an M3Coordinator instance. -curl -X DELETE :/api/v1/namespace/ -Note that deleting a namespace will not have any effect on the M3DB nodes until they are all restarted. In addition, the namespace will need to be removed from the M3Coordinator configuration and then the M3Coordinator node will need to be restarted. - -### Modifying a Namespace -There is currently no atomic namespace modification endpoint. Instead, you will need to delete a namespace and then add it back again with the same name, but modified settings. Review the individual namespace settings above to determine whether or not a given setting is safe to modify. For example, it is never safe to modify the blockSize of a namespace. -Also, be very careful not to restart the M3DB nodes after deleting the namespace, but before adding it back. If you do this, the M3DB nodes may detect the existing data files on disk and delete them since they are not configured to retain that namespace. - -### Viewing a Namespace -In order to view a namespace and its attributes, use the GET /api/v1/namespace API on a M3Coordinator instance. Additionally, for readability/debugging purposes, you can add the debug=true parameter to the URL to view block sizes, buffer sizes, etc. in duration format as opposed to nanoseconds (default). - -#### Namespace Attributes -bootstrapEnabled -This controls whether M3DB will attempt to bootstrap the namespace on startup. This value should always be set to true unless you have a very good reason to change it as setting it to false can cause data loss when restarting nodes. -- Can be modified without creating a new namespace: yes -flushEnabled -This controls whether M3DB will periodically flush blocks to disk once they become immutable. This value should always be set to true unless you have a very good reason to change it as setting it to false will cause increased memory utilization and potential data loss when restarting nodes. -- Can be modified without creating a new namespace: yes -writesToCommitlog -This controls whether M3DB will includes writes to this namespace in the commitlog. This value should always be set to true unless you have a very good reason to change it as setting it to false will cause potential data loss when restarting nodes. -- Can be modified without creating a new namespace: yes -snapshotEnabled -This controls whether M3DB will periodically write out snapshot files for this namespace which act as compacted commitlog files. This value should always be set to true unless you have a very good reason to change it as setting it to false will increasing bootstrapping times (reading commitlog files is slower than reading snapshot files) and increase disk utilization (snapshot files are compressed but commitlog files are uncompressed). -- Can be modified without creating a new namespace: yes -repairEnabled -If enabled, the M3DB nodes will attempt to compare the data they own with the data of their peers and emit metrics about any discrepancies. This feature is experimental and we do not recommend enabling it under any circumstances. -retentionOptions -retentionPeriod -This controls the duration of time that M3DB will retain data for the namespace. For example, if this is set to 30 days, then data within this namespace will be available for querying up to 30 days after it is written. Note that this retention operates at the block level, not the write level, so its possible for individual datapoints to only be available for less than the specified retention. For example, if the blockSize was set to 24 hour and the retention was set to 30 days then a write that arrived at the very end of a 24 hour block would only be available for 29 days, but the node itself would always support querying the last 30 days worth of data. -- Can be modified without creating a new namespace: yes -blockSize -This is the most important value to consider when tuning the performance of an M3DB namespace. Read the storage engine documentation for more details, but the basic idea is that larger blockSizes will use more memory, but achieve higher compression. Similarly, smaller blockSizes will use less memory, but have worse compression. In testing, good compression occurs with blocksizes containing around 720 samples per timeseries. -- Can be modified without creating a new namespace: no -Below are recommendations for block size based on resolution: -Resolution -Block Size -5s -60m -15s -3h -30s -6h -1m -12h -5m -60h -bufferFuture and bufferPast -These values control how far into the future and the past (compared to the system time on an M3DB node) writes for the namespace will be accepted. For example, consider the following configuration: -bufferPast: 10m -bufferFuture: 20m -currentSystemTime: 2:35:00PM - -Now consider the following writes (all of which arrive at 2:35:00PM system time, but include datapoints with the specified timestamps): -2:25:00PM - Accepted, within the 10m bufferPast - -2:24:59PM - Rejected, outside the 10m bufferPast - -2:55:00PM - Accepted, within the 20m bufferFuture - -2:55:01PM - Rejected, outside the 20m bufferFuture - -While it may be tempting to configure bufferPast and bufferFuture to very large values to prevent writes from being rejected, this may cause performance issues. M3DB is a timeseries database that is optimized for realtime data. Out of order writes, as well as writes for times that are very far into the future or past are much more expensive and will cause additional CPU / memory pressure. In addition, M3DB cannot evict a block from memory until it is no longer mutable and large bufferPast and bufferFuture values effectively increase the amount of time that a block is mutable for which means that it must be kept in memory for a longer period of time. -- Can be modified without creating a new namespace: yes -Index Options -enabled -Whether to use the built-in indexing. Must be true. -- Can be modified without creating a new namespace: no -blockSize -The size of blocks (in duration) that the index uses. Should match the databases blocksize for optimal memory usage. -- Can be modified without creating a new namespace: no diff --git a/docs/content/reference_docs/configurations/operator/_index.md b/docs/content/reference_docs/configurations/operator/_index.md deleted file mode 100644 index 79ce8e30c0..0000000000 --- a/docs/content/reference_docs/configurations/operator/_index.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: "Operator" -date: 2020-05-08T12:43:53-04:00 -draft: true ---- - -Introduction -Welcome to the documentation for the M3DB operator, a Kubernetes operator for running the open-source timeseries database M3DB on Kubernetes. - -Please note that this is alpha software, and as such its APIs and behavior are subject to breaking changes. While we aim to produce thoroughly tested reliable software there may be undiscovered bugs. - -For more background on the M3DB operator, see our KubeCon keynote on its origins and usage at Uber. - -Philosophy -The M3DB operator aims to automate everyday tasks around managing M3DB. Specifically, it aims to automate: - -Creating M3DB clusters -Destroying M3DB clusters -Expanding clusters (adding instances) -Shrinking clusters (removing instances) -Replacing failed instances -It explicitly does not try to automate every single edge case a user may ever run into. For example, it does not aim to automate disaster recovery if an entire cluster is taken down. Such use cases may still require human intervention, but the operator will aim to not conflict with such operations a human may have to take on a cluster. - -Generally speaking, the operator's philosophy is if it would be unclear to a human what action to take, we will not try to guess. \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/operator/configuration/_index.md b/docs/content/reference_docs/configurations/operator/configuration/_index.md deleted file mode 100644 index aca74e5688..0000000000 --- a/docs/content/reference_docs/configurations/operator/configuration/_index.md +++ /dev/null @@ -1,21 +0,0 @@ ---- -title: "Configuration" -date: 2020-05-08T12:49:38-04:00 -draft: true ---- - -Configuring M3DB -By default the operator will apply a configmap with basic M3DB options and settings for the coordinator to direct Prometheus reads/writes to the cluster. This template can be found here. - -To apply custom a configuration for the M3DB cluster, one can set the configMapName parameter of the cluster spec to an existing configmap. - -Environment Warning -If providing a custom config map, the env you specify in your config must be $NAMESPACE/$NAME, where $NAMESPACE is the Kubernetes namespace your cluster is in and $NAME is the name of the cluster. For example, with the following cluster: - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -metadata: - name: cluster-a - namespace: production -... -The value of env in your config MUST be production/cluster-a. This restriction allows multiple M3DB clusters to safely share the same etcd cluster. \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/operator/configuration/managing_nodes.md b/docs/content/reference_docs/configurations/operator/configuration/managing_nodes.md deleted file mode 100644 index 770df5ba2f..0000000000 --- a/docs/content/reference_docs/configurations/operator/configuration/managing_nodes.md +++ /dev/null @@ -1,42 +0,0 @@ ---- -title: "Managing nodes" -date: 2020-05-08T12:47:10-04:00 -draft: true ---- - -Pod Identity -Motivation -M3DB assumes that if a process is started and owns sealed shards marked as Available that its data for those shards is valid and does not have to be fetched from peers. Consequentially this means it will begin serving reads for that data. For more background on M3DB topology, see the M3DB topology docs. - -In most environments in which M3DB has been deployed in production, it has been on a set of hosts predetermined by whomever is managing the cluster. This means that an M3DB instance is identified in a toplogy by its hostname, and that when an M3DB process comes up and finds its hostname in the cluster with Available shards that it can serve reads for those shards. - -This does not work on Kubernetes, particularly when working with StatefulSets, as a pod may be rescheduled on a new node or with new storage attached but its name may stay the same. If we were to naively use an instance's hostname (pod name), and it were to get rescheduled on a new node with no data, it could assume that absence of data is valid and begin returning empty results for read requests. - -To account for this, the M3DB Operator determines an M3DB instance's identity in the topology based on a configurable set of metadata about the pod. - -Configuration -The M3DB operator uses a configurable set of metadata about a pod to determine its identity in the M3DB placement. This is encapsulated in the PodIdentityConfig field of a cluster's spec. In addition to the configures sources, a pod's name will always be included. - -Every pod in an M3DB cluster is annotated with its identity and is passed to the M3DB instance via a downward API volume. - -Sources -This section will be filled out as a number of pending PRs land. - -Recommendations -No Persistent Storage -If not using PVs, you should set sources to PodUID: - -podIdentityConfig: - sources: - - PodUID -This way whenever a container is rescheduled, the operator will initiate a replace and it will stream data from its peers before serving reads. Note that not having persistent storage is not a recommended way to run M3DB. - -Remote Persistent Storage -If using remote storage you do not need to set sources, as it will default to just the pods name. The data for an M3DB instance will move around with its container. - -Local Persistent Storage -If using persistent local volumes, you should set sources to NodeName. In this configuration M3DB will consider a pod to be the same so long as it's on the same node. Replaces will only be triggered if a pod with the same name is moved to a new host. - -Note that if using local SSDs on GKE, node names may stay the same even though a VM has been recreated. We also support ProviderID, which will use the underlying VM's unique ID number in GCE to identity host uniqueness. - - diff --git a/docs/content/reference_docs/configurations/operator/configuration/namespace.md b/docs/content/reference_docs/configurations/operator/configuration/namespace.md deleted file mode 100644 index 2bddf07502..0000000000 --- a/docs/content/reference_docs/configurations/operator/configuration/namespace.md +++ /dev/null @@ -1,225 +0,0 @@ ---- -title: "Namespace" -date: 2020-05-08T12:46:59-04:00 -draft: true ---- - -Namespaces -M3DB uses the concept of namespaces to determine how metrics are stored and retained. The M3DB operator allows a user to define their own namespaces, or to use a set of presets we consider to be suitable for production use cases. - -Namespaces are configured as part of an m3dbcluster spec. - -Presets -10s:2d -This preset will store metrics at 10 second resolution for 2 days. For example, in your cluster spec: - -spec: -... - namespaces: - - name: metrics-short-term - preset: 10s:2d -1m:40d -This preset will store metrics at 1 minute resolution for 40 days. - -spec: -... - namespaces: - - name: metrics-long-term - preset: 1m:40d -Custom Namespaces -You can also define your own custom namespaces by setting the NamespaceOptions within a cluster spec. The API lists all available fields. As an example, a namespace to store 7 days of data may look like: - -... -spec: -... - namespaces: - - name: custom-7d - options: - bootstrapEnabled: true - flushEnabled: true - writesToCommitLog: true - cleanupEnabled: true - snapshotEnabled: true - repairEnabled: false - retentionOptions: - retentionPeriod: 168h - blockSize: 12h - bufferFuture: 20m - bufferPast: 20m - blockDataExpiry: true - blockDataExpiryAfterNotAccessPeriod: 5m - indexOptions: - enabled: true - blockSize: 12h - - -Node Affinity & Cluster Topology -Node Affinity -Kubernetes allows pods to be assigned to nodes based on various critera through node affinity. - -M3DB was built with failure tolerance as a core feature. M3DB's isolation groups allow shards to be placed across failure domains such that the loss of no single domain can cause the cluster to lose quorum. More details on M3DB's resiliency can be found in the deployment docs. - -By leveraging Kubernetes' node affinity and M3DB's isolation groups, the operator can guarantee that M3DB pods are distributed across failure domains. For example, in a Kubernetes cluster spread across 3 zones in a cloud region, the isolationGroups configuration below would guarantee that no single zone failure could degrade the M3DB cluster. - -M3DB is unaware of the underlying zone topology: it just views the isolation groups as group1, group2, group3 in its placement. Thanks to the Kubernetes scheduler, however, these groups are actually scheduled across separate failure domains. - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -... -spec: - replicationFactor: 3 - isolationGroups: - - name: group1 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-b - - name: group2 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-c - - name: group3 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-d -Tolerations -In addition to allowing pods to be assigned to certain nodes via node affinity, Kubernetes allows pods to be repelled from nodes through taints if they don't tolerate the taint. For example, the following config would ensure: - -Pods are spread across zones. - -Pods are only assigned to nodes in the m3db-dedicated-pool pool. - -No other pods could be assigned to those nodes (assuming they were tainted with the taint m3db-dedicated-taint). - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -... -spec: - replicationFactor: 3 - isolationGroups: - - name: group1 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-b - - key: nodepool - values: - - m3db-dedicated-pool - - name: group2 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-c - - key: nodepool - values: - - m3db-dedicated-pool - - name: group3 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-d - - key: nodepool - values: - - m3db-dedicated-pool - tolerations: - - key: m3db-dedicated - effect: NoSchedule - operator: Exists -Example Affinity Configurations -Zonal Cluster -The examples so far have focused on multi-zone Kubernetes clusters. Some users may only have a cluster in a single zone and accept the reduced fault tolerance. The following configuration shows how to configure the operator in a zonal cluster. - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -... -spec: - replicationFactor: 3 - isolationGroups: - - name: group1 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-b - - name: group2 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-b - - name: group3 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-b -6 Zone Cluster -In the above examples we created clusters with 1 isolation group in each of 3 zones. Because values within a single NodeAffinityTerm are OR'd, we can also spread an isolationgroup across multiple zones. For example, if we had 6 zones available to us: - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -... -spec: - replicationFactor: 3 - isolationGroups: - - name: group1 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-a - - us-east1-b - - name: group2 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-c - - us-east1-d - - name: group3 - numInstances: 3 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - us-east1-e - - us-east1-f -No Affinity -If there are no failure domains available, one can have a cluster with no affinity where the pods will be scheduled however Kubernetes would place them by default: - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -... -spec: - replicationFactor: 3 - isolationGroups: - - name: group1 - numInstances: 3 - - name: group2 - numInstances: 3 - - name: group3 - numInstances: 3 - -Node Endpoint -M3DB stores an endpoint field on placement instances that is used for communication between DB nodes and from other components such as the coordinator. - -The operator allows customizing the format of this endpoint by setting the nodeEndpointFormat field on a cluster spec. The format of this field uses Go templates, with the following template fields currently supported: - -Field Description -PodName Name of the pod -M3DBService Name of the generated M3DB service -PodNamespace Namespace the pod is in -Port Port M3DB is serving RPCs on -The default format is: - -{{ .PodName }}.{{ .M3DBService }}:{{ .Port }} -As an example of an override, to expose an M3DB cluster to containers in other Kubernetes namespaces nodeEndpointFormat can be set to: - -{{ .PodName }}.{{ .M3DBService }}.{{ .PodNamespace }}:{{ .Port }} \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/operator/getting_started/_index.md b/docs/content/reference_docs/configurations/operator/getting_started/_index.md deleted file mode 100644 index 62e7b28ea8..0000000000 --- a/docs/content/reference_docs/configurations/operator/getting_started/_index.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -title: "Getting Started" -date: 2020-05-08T12:49:48-04:00 -draft: true ---- - -Requirements -Kubernetes Versions -The M3DB operator current targets Kubernetes 1.11 and 1.12. Given the operator's current production use cases at Uber, we typically target the two most recent minor Kubernetes versions supported by GKE. We welcome community contributions to support more recent versions while meeting the aforementioned GKE targets! - -Multi-Zone Kubernetes Cluster -The M3DB operator is intended to be used with Kubernetes clusters that span at least 3 zones within a region to create highly available clusters and maintain quorum in the event of region failures. Instructions for creating regional clusters on GKE can be found here. - -Etcd -M3DB stores its cluster topology and all other runtime metadata in etcd. - -For testing / non-production use cases, we provide simple manifests for running etcd on Kubernetes in our example manifests: one for running ephemeral etcd containers and one for running etcd using basic persistent volumes. If using the etcd-pd yaml manifest, we recommend a modification to use a StorageClass equivalent to your cloud provider's fastest remote disk (such as pd-ssd on GCP). - -For production use cases, we recommend running etcd (in order of preference): - -External to your Kubernetes cluster to avoid circular dependencies. -Using the etcd operator. \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/operator/getting_started/install.md b/docs/content/reference_docs/configurations/operator/getting_started/install.md deleted file mode 100644 index dd90b0a9bf..0000000000 --- a/docs/content/reference_docs/configurations/operator/getting_started/install.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: "Install" -date: 2020-05-08T12:46:04-04:00 -draft: true ---- - -Installation -Be sure to take a look at the requirements before installing the operator. - -Helm -Add the m3db-operator repo: -helm repo add m3db https://m3-helm-charts.storage.googleapis.com/stable -Install the m3db-operator chart: -helm install m3db/m3db-operator --namespace m3db-operator -Note: If uninstalling an instance of the operator that was installed with Helm, some resources such as the ClusterRole, ClusterRoleBinding, and ServiceAccount may need to be deleted manually. - -Manually -Install the bundled operator manifests in the current namespace: - -kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/master/bundle.yaml diff --git a/docs/content/reference_docs/configurations/operator/getting_started/managing_cluster.md b/docs/content/reference_docs/configurations/operator/getting_started/managing_cluster.md deleted file mode 100644 index 194250bb80..0000000000 --- a/docs/content/reference_docs/configurations/operator/getting_started/managing_cluster.md +++ /dev/null @@ -1,163 +0,0 @@ ---- -title: "Managing cluster" -date: 2020-05-08T12:46:31-04:00 -draft: true ---- - -Creating a Cluster -Once you've installed the M3DB operator and read over the requirements, you can start creating some M3DB clusters! - -Basic Cluster -The following creates an M3DB cluster spread across 3 zones, with each M3DB instance being able to store up to 350gb of data using your Kubernetes cluster's default storage class. For examples of different cluster topologies, such as zonal clusters, see the docs on node affinity. - -Etcd -Create an etcd cluster with persistent volumes: - -kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/v0.6.0/example/etcd/etcd-pd.yaml -We recommend modifying the storageClassName in the manifest to one that matches your cloud provider's fastest remote storage option, such as pd-ssd on GCP. - -M3DB -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -metadata: - name: persistent-cluster -spec: - image: quay.io/m3db/m3dbnode:latest - replicationFactor: 3 - numberOfShards: 256 - isolationGroups: - - name: group1 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - - name: group2 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - - name: group3 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - etcdEndpoints: - - http://etcd-0.etcd:2379 - - http://etcd-1.etcd:2379 - - http://etcd-2.etcd:2379 - podIdentityConfig: - sources: [] - namespaces: - - name: metrics-10s:2d - preset: 10s:2d - dataDirVolumeClaimTemplate: - metadata: - name: m3db-data - spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 350Gi - limits: - storage: 350Gi -Ephemeral Cluster -WARNING: This setup is not intended for production-grade clusters, but rather for "kicking the tires" with the operator and M3DB. It is intended to work across almost any Kubernetes environment, and as such has as few dependencies as possible (namely persistent storage). See below for instructions on creating a more durable cluster. - -Etcd -Create an etcd cluster in the same namespace your M3DB cluster will be created in. If you don't have persistent storage available, this will create a cluster that will not use persistent storage and will likely become unavailable if any of the pods die: - -kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/v0.6.0/example/etcd/etcd-basic.yaml - -# Verify etcd health once pods available -kubectl exec etcd-0 -- env ETCDCTL_API=3 etcdctl endpoint health -# 127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.94668ms -If you have remote storage available and would like to jump straight to using it, apply the following manifest for etcd instead: - -kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/v0.6.0/example/etcd/etcd-pd.yaml -M3DB -Once etcd is available, you can create an M3DB cluster. An example of a very basic M3DB cluster definition is as follows: - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -metadata: - name: simple-cluster -spec: - image: quay.io/m3db/m3dbnode:latest - replicationFactor: 3 - numberOfShards: 256 - etcdEndpoints: - - http://etcd-0.etcd:2379 - - http://etcd-1.etcd:2379 - - http://etcd-2.etcd:2379 - isolationGroups: - - name: group1 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - - name: group2 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - - name: group3 - numInstances: 1 - nodeAffinityTerms: - - key: failure-domain.beta.kubernetes.io/zone - values: - - - podIdentityConfig: - sources: - - PodUID - namespaces: - - name: metrics-10s:2d - preset: 10s:2d -This will create a highly available cluster with RF=3 spread evenly across the three given zones within a region. A pod's UID will be used for its identity. The cluster will have 1 namespace that stores metrics for 2 days at 10s resolution. - -Next, apply your manifest: - -$ kubectl apply -f example/simple-cluster.yaml -m3dbcluster.operator.m3db.io/simple-cluster created -Shortly after all pods are created you should see the cluster ready! - -$ kubectl get po -l operator.m3db.io/app=m3db -NAME READY STATUS RESTARTS AGE -simple-cluster-rep0-0 1/1 Running 0 1m -simple-cluster-rep1-0 1/1 Running 0 56s -simple-cluster-rep2-0 1/1 Running 0 37s -We can verify that the cluster has finished streaming data by peers by checking that an instance has bootstrapped: - -$ kubectl exec simple-cluster-rep2-0 -- curl -sSf localhost:9002/health -{"ok":true,"status":"up","bootstrapped":true} - -Deleting a Cluster -Delete your M3DB cluster with kubectl: - -kubectl delete m3dbcluster simple-cluster -By default, the operator will delete the placement and namespaces associated with a cluster before the CRD resource deleted. If you do not want this behavior, set keepEtcdDataOnDelete to true on your cluster spec. - -Under the hood, the operator uses Kubernetes finalizers to ensure the cluster CRD is not deleted until the operator has had a chance to do cleanup. - -Debugging Stuck Cluster Deletion -If for some reason the operator is unable to delete the placement and namespace for the cluster, the cluster CRD itself will be stuck in a state where it can not be deleted, due to the way finalizers work in Kubernetes. The operator might be unable to clean up the data for many reasons, for example if the M3DB cluster itself is not available to serve the APIs for cleanup or if etcd is down and cannot fulfill the deleted. - -To allow the CRD to be deleted, you can kubectl edit m3dbcluster $CLUSTER and remove the operator.m3db.io/etcd-deletion finalizer. For example, in the following cluster you'd remove the finalizer from metadata.finalizers: - -apiVersion: operator.m3db.io/v1alpha1 -kind: M3DBCluster -metadata: - ... - finalizers: - - operator.m3db.io/etcd-deletion - name: m3db-cluster -... -Note that if you do this, you'll have to manually remove the relevant data in etcd. For a cluster in namespace $NS with name $CLUSTER, the keys are: - -_sd.placement/$NS/$CLUSTER/m3db -_kv/$NS/$CLUSTER/m3db.node.namespaces diff --git a/docs/content/reference_docs/configurations/operator/getting_started/monitoring.md b/docs/content/reference_docs/configurations/operator/getting_started/monitoring.md deleted file mode 100644 index ef9b3ce37d..0000000000 --- a/docs/content/reference_docs/configurations/operator/getting_started/monitoring.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: "Monitoring" -date: 2020-05-08T12:46:15-04:00 -draft: true ---- - -Monitoring -M3DB exposes metrics via a Prometheus endpoint. If using the Prometheus Operator, you can apply a ServiceMonitor to have your M3DB pods automatically scraped by Prometheus: - -kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/master/example/prometheus-servicemonitor.yaml -You can visit the "targets" page of the Prometheus UI to verify the pods are being scraped. To view these metrics using Grafana, follow the M3 docs to install the M3DB Grafana dashboard. \ No newline at end of file diff --git a/docs/content/reference_docs/configurations/replication.md b/docs/content/reference_docs/configurations/replication.md deleted file mode 100644 index 7980f50e82..0000000000 --- a/docs/content/reference_docs/configurations/replication.md +++ /dev/null @@ -1,172 +0,0 @@ ---- -title: "Replication" -date: 2020-04-21T21:01:57-04:00 -draft: true ---- - -### Sharding -Timeseries keys are hashed to a fixed set of virtual shards. Virtual shards are then assigned to physical nodes. M3DB can be configured to use any hashing function and a configured number of shards. By default murmur3 is used as the hashing function and 4096 virtual shards are configured. - -#### Benefits -Shards provide a variety of benefits throughout the M3DB stack: -They make horizontal scaling easier and adding / removing nodes without downtime trivial at the cluster level. -They provide more fine grained lock granularity at the memory level. -They inform the filesystem organization in that data belonging to the same shard will be used / dropped together and can be kept in the same file. - - -### Replication -Logical shards are placed per virtual shard per replica with configurable isolation (zone aware, rack aware, etc). For instance, when using rack aware isolation, the set of datacenter racks that locate a replica’s data is distinct to the racks that locate all other replicas’ data. -Replication is synchronization during a write and depending on the consistency level configured will notify the client on whether a write succeeded or failed with respect to the consistency level and replication achieved. -#### Replica -Each replica has its own assignment of a single logical shard per virtual shard. -Conceptually it can be defined as: -Replica { - id uint32 - shards []Shard -} - -Shard state -Each shard can be conceptually defined as: -Shard { - id uint32 - assignments []ShardAssignment -} - -ShardAssignment { - host Host - state ShardState -} - -enum ShardState { - INITIALIZING, - AVAILABLE, - LEAVING -} - -### Shard assignment -The assignment of shards is stored in etcd. When adding, removing or replacing a node shard goal states are assigned for each shard assigned. -For a write to appear as successful for a given replica it must succeed against all assigned hosts for that shard. That means if there is a given shard with a host assigned as LEAVING and another host assigned as INITIALIZING for a given replica writes to both these hosts must appear as successful to return success for a write to that given replica. Currently however only AVAILABLE shards count towards consistency, the work to group the LEAVING and INITIALIZING shards together when calculating a write success/error is not complete, see issue 417. -It is up to the nodes themselves to bootstrap shards when the assignment of new shards to it are discovered in the INITIALIZING state and to transition the state to AVAILABLE once bootstrapped by calling the cluster management APIs when done. Using a compare and set this atomically removes the LEAVING shard still assigned to the node that previously owned it and transitions the shard state on the new node from INITIALIZING state to AVAILABLE. -Nodes will not start serving reads for the new shard until it is AVAILABLE, meaning not until they have bootstrapped data for those shards. - - - -### Replication and Deployment in Zones -#### Overview -M3DB supports both deploying across multiple zones in a region or deploying to a single zone with rack-level isolation. It can also be deployed across multiple regions for a global view of data, though both latency and bandwidth costs may increase as a result. -In addition, M3DB has support for automatically replicating data between isolated M3DB clusters (potentially running in different zones / regions). More details can be found in the Replication between clusters operational guide. - -#### Replication -A replication factor of at least 3 is highly recommended for any M3DB deployment, due to the consistency levels (for both reads and writes) that require quorum in order to complete an operation. For more information on consistency levels, see the documentation concerning tuning availability, consistency and durability. -M3DB will do its best to distribute shards evenly among the availability zones while still taking each individual node's weight into account, but if some of the availability zones have less available hosts than others then each host in that zone will be responsible for more shards than hosts in the other zones and will thus be subjected to heavier load. - -#### Replication Factor Recommendations -Running with RF=1 or RF=2 is not recommended for any multi-node use cases (testing or production). In the future such topologies may be rejected by M3DB entirely. It is also recommended to only run with an odd number of replicas. -- RF=1 is not recommended as it is impossible to perform a safe upgrade or tolerate any node failures: as soon as one node is down, all writes destined for the shards it owned will fail. If the node's storage is lost (e.g. the disk fails), the data is gone forever. -- RF=2, despite having an extra replica, entails many of the same problems RF=1 does. When M3DB is configured to perform quorum writes and reads (the default), as soon as a single node is down (for planned maintenance or an unplanned disruption) clients will be unable to read or write (as the quorum of 2 nodes is 2). Even if clients relax their consistency guarantees and read from the remaining serving node, users may experience flapping results depending on whether one node had data for a time window that the other did not. - -Finally, it is only recommended to run with an odd number of replicas. Because the quorum size of an even-RF N is (N/2)+1, any cluster with an even replica factor N has the same failure tolerance as a cluster with RF=N-1. "Failure tolerance" is defined as the number of isolation groups you can concurrently lose nodes across. The following table demonstrates the quorum size and failure tolerance of various RF's, inspired by etcd's failure tolerance documentation. -Replica Factor -Quorum Size -Failure Tolerance -1 -1 -0 -2 -2 -0 -3 -2 -1 -4 -3 -1 -5 -3 -2 -6 -4 -2 -7 -4 -3 -Upgrading hosts in a deployment -When an M3DB node is restarted it has to perform a bootstrap process before it can serve reads. During this time the node will continue to accept writes, but will not be available for reads. -Obviously, there is also a small window of time during between when the process is stopped and then started again where it will also be unavailable for writes. -Deployment across multiple availability zones in a region -For deployment in a region, it is recommended to set the isolationGroup host attribute to the name of the availability zone a host is in. -In this configuration, shards are distributed among hosts such that each will not be placed more than once in the same availability zone. This allows an entire availability zone to be lost at any given time, as it is guaranteed to only affect one replica of data. -For example, in a multi-zone deployment with four shards spread over three availability zones: - -Typically, deployments have many more than four shards - this is a simple example that illustrates how M3DB maintains availability while losing an availability zone, as two of three replicas are still intact. -Deployment in a single zone -For deployment in a single zone, it is recommended to set the isolationGroup host attribute to the name of the rack a host is in or another logical unit that separates groups of hosts in your zone. -In this configuration, shards are distributed among hosts such that each will not be placed more than once in the same defined rack or logical unit. This allows an entire unit to be lost at any given time, as it is guaranteed to only affect one replica of data. -For example, in a single-zone deployment with three shards spread over four racks: - -Typically, deployments have many more than three shards - this is a simple example that illustrates how M3DB maintains availability while losing a single rack, as two of three replicas are still intact. -Deployment across multiple regions -For deployment across regions, it is recommended to set the isolationGroup host attribute to the name of the region a host is in. -As mentioned previously, latency and bandwidth costs may increase when using clusters that span regions. -In this configuration, shards are distributed among hosts such that each will not be placed more than once in the same region. This allows an entire region to be lost at any given time, as it is guaranteed to only affect one replica of data. -For example, in a multi-region deployment with four shards spread over five regions: - -Typically, deployments have many more than four shards - this is a simple example that illustrates how M3DB maintains availability while losing up to two regions, as three of five replicas are still intact. - -Replication between clusters (beta) -Overview -M3DB clusters can be configured to passively replicate data from other clusters. This feature is most commonly used when operators wish to run two (or more) regional clusters that function independently while passively replicating data from the other cluster in an eventually consistent manner. -The cross-cluster replication feature is built on-top of the background repairs feature. As a result, it has all the same caveats and limitations. Specifically, it does not currently work with clusters that use M3DB's indexing feature and the replication delay between two clusters will be at least (block size + bufferPast) for data written at the beginning of a block for a given namespace. For use-cases where a large replication delay is unacceptable, the current recommendation is to dual-write to both clusters in parallel and then rely upon the cross-cluster replication feature to repair any discrepancies between the clusters caused by failed dual-writes. This recommendation is likely to change in the future once support for low-latency replication is added to M3DB in the form of commitlog tailing. -While cross-cluster replication is built on top of the background repairs feature, background repairs do not need to be enabled for cross-cluster replication to be enabled. In other words, clusters can be configured such that: -Background repairs (within a cluster) are disabled and replication is also disabled. -Background repairs (within a cluster) are enabled, but replication is disabled. -Background repairs (within a cluster) are disabled, but replication is enabled. -Background repairs (within a cluster) are enabled and replication is also enabled. -Configuration -Important: All M3DB clusters involved in the cross-cluster replication process must be configured such that they have the exact same: -Number of shards -Replication factor -Namespace configuration -The replication feature can be enabled by adding the following configuration to m3dbnode.yml under the db section: -db: - ... (other configuration) - replication: - clusters: - - name: "some-other-cluster" - repairEnabled: true - client: - config: - service: - env: - zone: - service: - cacheDir: /var/lib/m3kv - etcdClusters: - - zone: - endpoints: - - : - -Note that the repairEnabled field in the configuration above is independent of the enabled field under the repairs section. For example, the example above will enable replication of data from some-other-cluster but will not perform background repairs within the cluster the M3DB node belongs to. -However, the following configuration: -db: - ... (other configuration) - repair: - enabled: true - - replication: - clusters: - - name: "some-other-cluster" - repairEnabled: true - client: - config: - service: - env: - zone: - service: - cacheDir: /var/lib/m3kv - etcdClusters: - - zone: - endpoints: - - : - -would enable both replication of data from some-other-cluster as well as background repairs within the cluster that the M3DB node belongs to. diff --git a/docs/content/reference_docs/configurations/topology_config.md b/docs/content/reference_docs/configurations/topology_config.md deleted file mode 100644 index 7aa47cccc3..0000000000 --- a/docs/content/reference_docs/configurations/topology_config.md +++ /dev/null @@ -1,92 +0,0 @@ ---- -title: "Topology and placement" -date: 2020-04-21T21:01:48-04:00 -draft: true ---- - -### Placement -#### Overview -Note: The words placement and topology are used interchangeably throughout the M3DB documentation and codebase. - -A M3DB cluster has exactly one Placement. That placement maps the cluster's shard replicas to nodes. A cluster also has 0 or more namespaces (analogous to tables in other databases), and each node serves every namespace for the shards it owns. In other words, if the cluster topology states that node A owns shards 1, 2, and 3 then node A will own shards 1, 2, 3 for all configured namespaces in the cluster. -M3DB stores its placement (mapping of which NODES are responsible for which shards) in etcd. There are three possible states that each node/shard pair can be in: -Initializing -Available -Leaving - -Note that these states are not a reflection of the current status of an M3DB node, but an indication of whether a given node has ever successfully bootstrapped and taken ownership of a given shard (achieved goal state). For example, in a new cluster all the nodes will begin with all of their shards in the Initializing state. Once all the nodes finish bootstrapping, they will mark all of their shards as Available. If all the M3DB nodes are stopped at the same time, the cluster placement will still show all of the shards for all of the nodes as Available. - -### Initializing State -The Initializing state is the state in which all new node/shard combinations begin. For example, upon creating a new placement all the node/shard pairs will begin in the Initializing state and only once they have successfully bootstrapped will they transition to the Available state. -The Initializing state is not limited to new placement, however, as it can also occur during placement changes. For example, during a node add/replace the new node will begin with all of its shards in the Initializing state until it can stream the data it is missing from its peers. During a node removal, all of the nodes who receive new shards (as a result of taking over the responsibilities of the node that is leaving) will begin with those shards marked as Initializing until they can stream in the data from the node leaving the cluster, or one of its peers. - -### Available State -Once a node with a shard in the Initializing state successfully bootstraps all of the data for that shard, it will mark that shard as Available (for the single node) in the cluster placement. - -### Leaving State -The Leaving state indicates that a node has been marked for removal from the cluster. The purpose of this state is to allow the node to remain in the cluster long enough for the nodes that are taking over its responsibilities to stream data from it. - -### Sample Cluster State Transitions - Node Add -Node adds are performed by adding the new node to the placement. Some portion of the existing shards will be assigned to the new node based on its weight, and they will begin in the Initializing state. Similarly, the shards will be marked as Leaving on the node that are destined to lose ownership of them. Once the new node finishes bootstrapping the shards, it will update the placement to indicate that the shards it owns are Available and that the Leaving node should no longer own that shard in the placement. -Replication factor: 3 - - ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ - │ Node A │ │ Node B │ │ Node C │ │ Node D │ -┌──────────────────────────┬─────┴─────────────────┴─────┬────┴─────────────────┴────┬───┴─────────────────┴───┬───┴─────────────────┴───┐ -│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│ │ -│ │ │ │ │ │ │ │ │ ││ │ -│ │ │ │ │ │ │ │ │ ││ │ -│ │ │ Shard 1: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available ││ │ -│ 1) Initial Placement │ │ Shard 2: Available │ │ │ Shard 2: Available │ │ │ Shard 2: Available ││ │ -│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Available ││ │ -│ │ │ │ │ │ │ │ │ ││ │ -│ │ │ │ │ │ │ │ │ ││ │ -│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│ │ -├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ -│ │ │ │ │ │ -│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ Shard 1: Leaving │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││Shard 1: Initializing │ │ -│ 2) Begin Node Add │ │ Shard 2: Available │ │ │ Shard 2: Leaving │ │ │ Shard 2: Available │││Shard 2: Initializing │ │ -│ │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 3: Leaving │││Shard 3: Initializing │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ -│ │ │ │ │ │ -├──────────────────────────┼─────────────────────────────┼───────────────────────────┼─────────────────────────┼─────────────────────────┤ -│ │ │ │ │ │ -│ │ ┌─────────────────────────┐ │ ┌───────────────────────┐ │ ┌──────────────────────┐│┌──────────────────────┐ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ Shard 2: Available │ │ │ Shard 1: Available │ │ │ Shard 1: Available │││ Shard 1: Available │ │ -│ 3) Complete Node Add │ │ Shard 3: Available │ │ │ Shard 3: Available │ │ │ Shard 2: Available │││ Shard 2: Available │ │ -│ │ │ │ │ │ │ │ │ │││ Shard 3: Available │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ │ │ │ │ │ │ │ │││ │ │ -│ │ └─────────────────────────┘ │ └───────────────────────┘ │ └──────────────────────┘│└──────────────────────┘ │ -│ │ │ │ │ │ -└──────────────────────────┴─────────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────┘ - -### Overview -M3DB was designed from the ground up to be a distributed (clustered) database that is availability zone or rack aware (by using isolation groups). Clusters will seamlessly scale with your data, and you can start with a small number of nodes and grow it to a size of several hundred nodes with no downtime or expensive migrations. - -Before reading the rest of this document, we recommend familiarizing yourself with the M3DB placement documentation - -Note: The primary limiting factor for the maximum size of an M3DB cluster is the number of shards. Picking an appropriate number of shards is more of an art than a science, but our recommendation is as follows: - -The number of shards that M3DB uses is configurable and there are a couple of key points to note when deciding the number to use. The more nodes you have, the more shards you want because you want the shards to be evenly distributed amongst your nodes. However, because each shard requires more files to be created, you also don’t want to have too many shards per node. This is due to the fact each bit of data needs to be repartitioned and moved around the cluster (i.e. every bit of data needs to be moved all at once). Below are some guidelines depending on how many nodes you will have in your cluster eventually - you will need to decide the number of shards up front, you cannot change this once the cluster is created. -Number of Nodes -Number of Shards -3 -64 -6 -128 -12 -256 -24 -512 -48 -1024 -128+ -4096