diff --git a/README.md b/README.md index b10e4a22..c0c78bed 100644 --- a/README.md +++ b/README.md @@ -5,111 +5,213 @@ [![PyPI version](https://badge.fury.io/py/slo-generator.svg)](https://badge.fury.io/py/slo-generator) `slo-generator` is a tool to compute and export **Service Level Objectives** ([SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)), -**Error Budgets** and **Burn Rates**, using policies written in JSON or YAML format. +**Error Budgets** and **Burn Rates**, using configurations written in YAML (or JSON) format. + +## Table of contents +- [Description](#description) +- [Local usage](#local-usage) + - [Requirements](#requirements) + - [Installation](#installation) + - [CLI usage](#cli-usage) + - [API usage](#api-usage) +- [Configuration](#configuration) + - [SLO configuration](#slo-configuration) + - [Shared configuration](#shared-configuration) +- [More documentation](#more-documentation) + - [Build an SLO achievements report with BigQuery and DataStudio](#build-an-slo-achievements-report-with-bigquery-and-datastudio) + - [Deploy the SLO Generator in Cloud Run](#deploy-the-slo-generator-in-cloud-run) + - [Deploy the SLO Generator in Kubernetes (Alpha)](#deploy-the-slo-generator-in-kubernetes-alpha) + - [Deploy the SLO Generator in a CloudBuild pipeline](#deploy-the-slo-generator-in-a-cloudbuild-pipeline) + - [DEPRECATED: Deploy the SLO Generator on Google Cloud Functions (Terraform)](#deprecated-deploy-the-slo-generator-on-google-cloud-functions-terraform) + - [Contribute to the SLO Generator](#contribute-to-the-slo-generator) ## Description -`slo-generator` will query metrics backend and compute the following metrics: +The `slo-generator` runs backend queries computing **Service Level Indicators**, +compares them with the **Service Level Objectives** defined and generates a report +by computing important metrics: -* **Service Level Objective** defined as `SLO (%) = GOOD_EVENTS / VALID_EVENTS` -* **Error Budget** defined as `ERROR_BUDGET = 100 - SLO (%)` -* **Burn Rate** defined as `BURN_RATE = ERROR_BUDGET / ERROR_BUDGET_TARGET` +* **Service Level Indicator** (SLI) defined as **SLI = Ngood_events / Nvalid_events** +* **Error Budget** (EB) defined as **EB = 1 - SLI** +* **Error Budget Burn Rate** (EBBR) defined as **EBBR = EB / EBtarget** +* **... and more**, see the [example SLO report](./test/unit/../../tests/unit/fixtures/slo_report_v2.json). + +The **Error Budget Burn Rate** is often used for [**alerting on SLOs**](https://sre.google/workbook/alerting-on-slos/), as it demonstrates in practice to be more **reliable** and **stable** than +alerting directly on metrics or on **SLI > SLO** thresholds. ## Local usage -**Requirements** +### Requirements -* Python 3 +* `python3.7` and above +* `pip3` -**Installation** +### Installation -`slo-generator` is published on PyPI. To install it, run: +`slo-generator` is a Python library published on [PyPI](https://pypi.org). To install it, run: ```sh pip3 install slo-generator ``` -**Run the `slo-generator`** - -``` -slo-generator -f -b --export -``` - * `` is the [SLO config](#slo-configuration) file or folder. - If a folder path is passed, the SLO configs filenames should match the pattern `slo_*.yaml` to be loaded. - - * `` is the [Error Budget Policy](#error-budget-policy) file. - - * `--export` enables exporting data using the `exporters` defined in the SLO - configuration file. - -Use `slo-generator --help` to list all available arguments. - ***Notes:*** +* To install **[providers](./docs/providers)**, use `pip3 install slo-generator[, , ... -c --export +``` +where: + * `` is the [SLO configuration](#slo-configuration) file or folder path. -* **SLO metadata**: - * `slo_name`: Name of this SLO. - * `slo_description`: Description of this SLO. - * `slo_target`: SLO target (between 0 and 1). - * `service_name`: Name of the monitored service. - * `feature_name`: Name of the monitored subsystem. - * `metadata`: Dict of user metadata. + * `` is the [Shared configuration](#shared-configuration) file path. + * `--export` | `-e` enables exporting data using the `exporters` specified in the SLO + configuration file. -* **SLI configuration**: - * `backend`: Specific documentation and examples are available for each supported backends: - * [Stackdriver Monitoring](docs/providers/stackdriver.md#backend) - * [Stackdriver Service Monitoring](docs/providers/stackdriver_service_monitoring.md#backend) - * [Prometheus](docs/providers/prometheus.md#backend) - * [ElasticSearch](docs/providers/elasticsearch.md#backend) - * [Datadog](docs/providers/datadog.md#backend) - * [Dynatrace](docs/providers/dynatrace.md#backend) - * [Custom](docs/providers/custom.md#backend) +Use `slo-generator compute --help` to list all available arguments. -- **Exporter configuration**: - * `exporters`: A list of exporters to export results to. Specific documentation is available for each supported exporters: - * [Cloud Pub/Sub](docs/providers/pubsub.md#exporter) to stream SLO reports. - * [BigQuery](docs/providers/bigquery.md#exporter) to export SLO reports to BigQuery for historical analysis and DataStudio reporting. - * [Stackdriver Monitoring](docs/providers/stackdriver.md#exporter) to export metrics to Stackdriver Monitoring. - * [Prometheus](docs/providers/prometheus.md#exporter) to export metrics to Prometheus. - * [Datadog](docs/providers/datadog.md#exporter) to export metrics to Datadog. - * [Dynatrace](docs/providers/dynatrace.md#exporter) to export metrics to Dynatrace. - * [Custom](docs/providers/custom.md#exporter) to export SLO data or metrics to a custom destination. +### API usage -***Note:*** *you can use environment variables in your SLO configs by using `${MY_ENV_VAR}` syntax to avoid having sensitive data in version control. Environment variables will be replaced at run time.* +On top of the CLI, the `slo-generator` can also be run as an API using the Cloud +Functions Framework SDK (Flask): +``` +slo-generator api -c +``` +where: + * `` is the [Shared configuration](#shared-configuration) file path or GCS URL. -==> An example SLO configuration file is available [here](samples/stackdriver/slo_gae_app_availability.yaml). +Once the API is up-and-running, you can `HTTP POST` SLO configurations to it. -#### Error Budget policy +***Notes:*** +* The API responds by default to HTTP requests. An alternative mode is to +respond to [`CloudEvents`](https://cloudevents.io/) instead, by setting +`--signature-type cloudevent`. -The **Error Budget policy** (JSON or YAML) is a list of multiple error budgets, each one composed of the following fields: +* Use `--target export` to run the API in export mode only (former `slo-pipeline`). -* `window`: Rolling time window for this error budget. -* `alerting_burn_rate_threshold`: Target burnrate threshold over which alerting is needed. -* `urgent_notification`: boolean whether violating this error budget should trigger a page. -* `overburned_consequence_message`: message to show when the error budget is above the target. -* `achieved_consequence_message`: message to show when the error budget is within the target. +## Configuration -==> An example Error Budget policy is available [here](samples/error_budget_policy.yaml). +The `slo-generator` requires two configuration files to run, an **SLO configuration** +file, describing your SLO, and the **Shared configuration** file (common +configuration for all SLOs). + +### SLO configuration + +The **SLO configuration** (JSON or YAML) is following the Kubernetes format and +is composed of the following fields: + +* `api`: `sre.google.com/v2` +* `kind`: `ServiceLevelObjective` +* `metadata`: + * `name`: [**required**] *string* - Full SLO name (**MUST** be unique). + * `labels`: [*optional*] *map* - Metadata labels, **for example**: + * `slo_name`: SLO name (e.g `availability`, `latency128ms`, ...). + * `service_name`: Monitored service (to group SLOs by service). + * `feature_name`: Monitored feature (to group SLOs by feature). + +* `spec`: + * `description`: [**required**] *string* - Description of this SLO. + * `goal`: [**required**] *string* - SLO goal (or target) (**MUST** be between 0 and 1). + * `backend`: [**required**] *string* - Backend name (**MUST** exist in SLO Generator Configuration). + * `service_level_indicator`: [**required**] *map* - SLI configuration. The content of this section is + specific to each provider, see [`docs/providers`](./docs/providers). + * `error_budget_policy`: [*optional*] *string* - Error budget policy name + (**MUST** exist in SLO Generator Configuration). If not specified, defaults to `default`. + * `exporters`: [*optional*] *string* - List of exporter names (**MUST** exist in SLO Generator Configuration). + +***Note:*** *you can use environment variables in your SLO configs by using +`${MY_ENV_VAR}` syntax to avoid having sensitive data in version control. +Environment variables will be replaced automatically at run time.* + +**→ See [example SLO configuration](samples/cloud_monitoring/slo_gae_app_availability.yaml).** + +### Shared configuration +The shared configuration (JSON or YAML) configures the `slo-generator` and acts +as a shared config for all SLO configs. It is composed of the following fields: + +* `backends`: [**required**] *map* - Data backends configurations. Each backend + alias is defined as a key `/`, and a configuration map. + ```yaml + backends: + cloud_monitoring/dev: + project_id: proj-cm-dev-a4b7 + datadog/test: + app_key: ${APP_SECRET_KEY} + api_key: ${API_SECRET_KEY} + ``` + See specific providers documentation for detailed configuration: + * [`cloud_monitoring`](docs/providers/cloud_monitoring.md#backend) + * [`cloud_service_monitoring`](docs/providers/cloud_service_monitoring.md#backend) + * [`prometheus`](docs/providers/prometheus.md#backend) + * [`elasticsearch`](docs/providers/elasticsearch.md#backend) + * [`datadog`](docs/providers/datadog.md#backend) + * [`dynatrace`](docs/providers/dynatrace.md#backend) + * [``](docs/providers/custom.md#backend) + +* `exporters`: A map of exporters to export results to. Each exporter is defined + as a key formatted as `/`, and a map value detailing the + exporter configuration. + ```yaml + exporters: + bigquery/dev: + project_id: proj-bq-dev-a4b7 + dataset_id: my-test-dataset + table_id: my-test-table + prometheus/test: + url: ${PROMETHEUS_URL} + ``` + See specific providers documentation for detailed configuration: + * [`pubsub`](docs/providers/pubsub.md#exporter) to stream SLO reports. + * [`bigquery`](docs/providers/bigquery.md#exporter) to export SLO reports to BigQuery for historical analysis and DataStudio reporting. + * [`cloud_monitoring`](docs/providers/cloud_monitoring.md#exporter) to export metrics to Cloud Monitoring. + * [`prometheus`](docs/providers/prometheus.md#exporter) to export metrics to Prometheus. + * [`datadog`](docs/providers/datadog.md#exporter) to export metrics to Datadog. + * [`dynatrace`](docs/providers/dynatrace.md#exporter) to export metrics to Dynatrace. + * [``](docs/providers/custom.md#exporter) to export SLO data or metrics to a custom destination. + +* `error_budget_policies`: [**required**] A map of various error budget policies. + * ``: Name of the error budget policy. + * `steps`: List of error budget policy steps, each containing the following fields: + * `window`: Rolling time window for this error budget. + * `alerting_burn_rate_threshold`: Target burnrate threshold over which alerting is needed. + * `urgent_notification`: boolean whether violating this error budget should trigger a page. + * `overburned_consequence_message`: message to show when the error budget is above the target. + * `achieved_consequence_message`: message to show when the error budget is within the target. + + ```yaml + error_budget_policies: + default: + steps: + - name: 1 hour + burn_rate_threshold: 9 + alert: true + message_alert: Page to defend the SLO + message_ok: Last hour on track + window: 3600 + - name: 12 hours + burn_rate_threshold: 3 + alert: true + message_alert: Page to defend the SLO + message_ok: Last 12 hours on track + window: 43200 + ``` + +**→ See [example Shared configuration](samples/config.yaml).** ## More documentation To go further with the SLO Generator, you can read: -* [Build an SLO achievements report with BigQuery and DataStudio](docs/deploy/datastudio_slo_report.md) - -* [Deploy the SLO Generator on Google Cloud Functions (Terraform)](docs/deploy/cloudfunctions.md) - -* [Deploy the SLO Generator on Kubernetes (Alpha)](docs/deploy/kubernetes.md) - -* [Deploy the SLO Generator in a CloudBuild pipeline](docs/deploy/cloudbuild.md) - -* [Contribute to the SLO Generator](CONTRIBUTING.md) +### [Build an SLO achievements report with BigQuery and DataStudio](docs/deploy/datastudio_slo_report.md) +### [Deploy the SLO Generator in Cloud Run](docs/deploy/cloudrun.md) +### [Deploy the SLO Generator in Kubernetes (Alpha)](docs/deploy/kubernetes.md) +### [Deploy the SLO Generator in a CloudBuild pipeline](docs/deploy/cloudbuild.md) +### [DEPRECATED: Deploy the SLO Generator on Google Cloud Functions (Terraform)](docs/deploy/cloudfunctions.md) +### [Contribute to the SLO Generator](CONTRIBUTING.md) diff --git a/docs/deploy/cloudfunctions.md b/docs/deploy/cloudfunctions.md index 301043d8..72c032c4 100644 --- a/docs/deploy/cloudfunctions.md +++ b/docs/deploy/cloudfunctions.md @@ -9,8 +9,8 @@ Other components can be added to make results available to other destinations: -* A **Cloud Function** to export SLO reports (e.g: to BigQuery and Stackdriver Monitoring), running `slo-generator`. -* A **Stackdriver Monitoring Policy** to alert on high budget Burn Rates. +* A **Cloud Function** to export SLO reports (e.g: to BigQuery and Cloud Monitoring), running `slo-generator`. +* A **Cloud Monitoring Policy** to alert on high budget Burn Rates. Below is a diagram of what this pipeline looks like: @@ -22,9 +22,9 @@ Below is a diagram of what this pipeline looks like: * **Historical analytics** by analyzing SLO data in Bigquery. -* **Real-time alerting** by setting up Stackdriver Monitoring alerts based on +* **Real-time alerting** by setting up Cloud Monitoring alerts based on wanted SLOs. * **Real-time, daily, monthly, yearly dashboards** by streaming BigQuery SLO reports to DataStudio (see [here](datastudio_slo_report.md)) and building dashboards. -An example of pipeline automation with Terraform can be found in the corresponding [Terraform module](https://github.com/terraform-google-modules/terraform-google-slo/tree/master/examples/simple_example). +An example of pipeline automation with Terraform can be found in the corresponding [Terraform module](https://github.com/terraform-google-modules/terraform-google-slo/tree/master/examples/slo-generator/simple_example). diff --git a/docs/providers/stackdriver.md b/docs/providers/cloud_monitoring.md similarity index 52% rename from docs/providers/stackdriver.md rename to docs/providers/cloud_monitoring.md index 2e107d6c..26054ad2 100644 --- a/docs/providers/stackdriver.md +++ b/docs/providers/cloud_monitoring.md @@ -1,16 +1,23 @@ -# Stackdriver Monitoring +# Cloud Monitoring ## Backend -Using the `Stackdriver` backend class, you can query any metrics available in -Stackdriver Monitoring to create an SLO. +Using the `cloud_monitoring` backend class, you can query any metrics available +in `Cloud Monitoring` to create an SLO. -The following methods are available to compute SLOs with the `Stackdriver` +```yaml +backends: + cloud_monitoring: + project_id: "${WORKSPACE_PROJECT_ID}" +``` + +The following methods are available to compute SLOs with the `cloud_monitoring` backend: * `good_bad_ratio` for metrics of type `DELTA`, `GAUGE`, or `CUMULATIVE`. * `distribution_cut` for metrics of type `DELTA` and unit `DISTRIBUTION`. + ### Good / bad ratio The `good_bad_ratio` method is used to compute the ratio between two metrics: @@ -23,84 +30,75 @@ SLO. This method is often used for availability SLOs, but can be used for other purposes as well (see examples). -**Config example:** +**SLO config blob:** ```yaml -backend: - class: Stackdriver - project_id: "${STACKDRIVER_HOST_PROJECT_ID}" - method: good_bad_ratio - measurement: - filter_good: > - project="${GAE_PROJECT_ID}" - metric.type="appengine.googleapis.com/http/server/response_count" - metric.labels.response_code >= 200 - metric.labels.response_code < 500 - filter_valid: > - project="${GAE_PROJECT_ID}" - metric.type="appengine.googleapis.com/http/server/response_count" +backend: cloud_monitoring +method: good_bad_ratio +service_level_indicator: + filter_good: > + project="${GAE_PROJECT_ID}" + metric.type="appengine.googleapis.com/http/server/response_count" + metric.labels.response_code >= 200 + metric.labels.response_code < 500 + filter_valid: > + project="${GAE_PROJECT_ID}" + metric.type="appengine.googleapis.com/http/server/response_count" ``` You can also use the `filter_bad` field which identifies bad events instead of the `filter_valid` field which identifies all valid events. -**→ [Full SLO config](../../samples/stackdriver/slo_gae_app_availability.yaml)** +**→ [Full SLO config](../../samples/cloud_monitoring/slo_gae_app_availability.yaml)** ### Distribution cut -The `distribution_cut` method is used for Stackdriver distribution-type metrics, -which are usually used for latency metrics. +The `distribution_cut` method is used for Cloud Monitoring distribution-type +metrics, which are usually used for latency metrics. A distribution metric records the **statistical distribution of the extracted values** in **histogram buckets**. The extracted values are not recorded individually, but their distribution across the configured buckets are recorded, along with the `count`, `mean`, and `sum` of squared deviation of the values. -In `Stackdriver Monitoring`, there are three different ways to specify bucket +In Cloud Monitoring, there are three different ways to specify bucket boundaries: * **Linear:** Every bucket has the same width. * **Exponential:** Bucket widths increases for higher values, using an exponential growth factor. * **Explicit:** Bucket boundaries are set for each bucket using a bounds array. -**Config example:** +**SLO config blob:** ```yaml -backend: - class: Stackdriver - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - method: exponential_distribution_cut - measurement: - filter_valid: > - project=${GAE_PROJECT_ID} AND - metric.type=appengine.googleapis.com/http/server/response_latencies AND - metric.labels.response_code >= 200 AND - metric.labels.response_code < 500 - good_below_threshold: true - threshold_bucket: 19 +backend: cloud_monitoring +method: exponential_distribution_cut +service_level_indicator: + filter_valid: > + project=${GAE_PROJECT_ID} AND + metric.type=appengine.googleapis.com/http/server/response_latencies AND + metric.labels.response_code >= 200 AND + metric.labels.response_code < 500 + good_below_threshold: true + threshold_bucket: 19 ``` -**→ [Full SLO config](../../samples/stackdriver/slo_gae_app_latency.yaml)** +**→ [Full SLO config](../../samples/cloud_monitoring/slo_gae_app_latency.yaml)** The `threshold_bucket` number to reach our 724ms target latency will depend on how the buckets boundaries are set. Learn how to [inspect your distribution metrics](https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics#inspecting_distribution_metrics) to figure out the bucketization. ## Exporter -The `Stackdriver` exporter allows to export SLO metrics to Cloud Monitoring API. - -**Example config:** - -The following configuration will create the custom metric -`error_budget_burn_rate` in `Stackdriver Monitoring`: +The `cloud_monitoring` exporter allows to export SLO metrics to Cloud Monitoring API. ```yaml -exporters: - - class: Stackdriver - project_id: "${STACKDRIVER_HOST_PROJECT_ID}" +backends: + cloud_monitoring: + project_id: "${WORKSPACE_PROJECT_ID}" ``` Optional fields: - * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`custom:error_budget_burn_rate`, `custom:sli_measurement`]. + * `metrics`: [*optional*] `list` - List of metrics to export ([see docs](../shared/metrics.md)). -**→ [Full SLO config](../../samples/stackdriver/slo_lb_request_availability.yaml)** +**→ [Full SLO config](../../samples/cloud_monitoring/slo_lb_request_availability.yaml)** ## Alerting @@ -109,6 +107,7 @@ being able to alert on them is simply useless. **Too many alerts** can be daunting, and page your SRE engineers for no valid reasons. + **Too little alerts** can mean that your applications are not monitored at all (no application have 100% reliability). @@ -117,24 +116,24 @@ reduce the noise and page only when it's needed. **Example:** -We will define a `Stackdriver Monitoring` alert that we will **filter out on the +We will define a `Cloud Monitoring` alert that we will **filter out on the corresponding error budget step**. -Consider the following error budget policy config: +Consider the following error budget policy step config: ```yaml -- error_budget_policy_step_name: 1 hour - measurement_window_seconds: 3600 - alerting_burn_rate_threshold: 9 - urgent_notification: true - overburned_consequence_message: Page the SRE team to defend the SLO - achieved_consequence_message: Last hour on track +- name: 1 hour + window: 3600 + burn_rate_threshold: 9 + alert: true + message_alert: Page the SRE team to defend the SLO + message_ok: Last hour on track ``` -Using Stackdriver UI, let's set up an alert when our error budget burn rate is -burning **9X faster** than it should in the last hour: +Using Cloud Monitoring UI, let's set up an alert when our error budget burn rate +is burning **9X faster** than it should in the last hour: -* Open `Stackdriver Monitoring` and click on `Alerting > Create Policy` +* Open `Cloud Monitoring` and click on `Alerting > Create Policy` * Fill the alert name and click on `Add Condition`. @@ -163,5 +162,5 @@ differentiate the alert messages. ## Examples -Complete SLO samples using `Stackdriver` are available in -[samples/stackdriver](../../samples/stackdriver). Check them out ! +Complete SLO samples using Cloud Monitoring are available in +[samples/cloud_monitoring](../../samples/cloud_monitoring). Check them out ! diff --git a/docs/providers/stackdriver_service_monitoring.md b/docs/providers/cloud_service_monitoring.md similarity index 51% rename from docs/providers/stackdriver_service_monitoring.md rename to docs/providers/cloud_service_monitoring.md index b72d71c6..bea7d835 100644 --- a/docs/providers/stackdriver_service_monitoring.md +++ b/docs/providers/cloud_service_monitoring.md @@ -1,15 +1,21 @@ -# Stackdriver Service Monitoring +# Cloud Service Monitoring ## Backend -Using the `StackdriverServiceMonitoring` backend class, you can use the -`Stackdriver Service Monitoring API` to manage your SLOs. +Using the `cloud_service_monitoring` backend, you can use the +`Cloud Service Monitoring API` to manage your SLOs. -SLOs are created from standard metrics available in Stackdriver Monitoring and -the data is stored in `Stackdriver Service Monitoring API` (see +```yaml +backends: + cloud_service_monitoring: + project_id: "${WORKSPACE_PROJECT_ID}" +``` + +SLOs are created from standard metrics available in Cloud Monitoring and +the data is stored in `Cloud Service Monitoring API` (see [docs](https://cloud.google.com/monitoring/service-monitoring/using-api)). -The following methods are available to compute SLOs with the `Stackdriver` +The following methods are available to compute SLOs with the `cloud_service_monitoring` backend: * `basic` to create standard SLOs for Google App Engine, Google Kubernetes @@ -17,91 +23,84 @@ Engine, and Cloud Endpoints. * `good_bad_ratio` for metrics of type `DELTA` or `CUMULATIVE`. * `distribution_cut` for metrics of type `DELTA` and unit `DISTRIBUTION`. + ### Basic -The `basic` method is used to let the `Stackdriver Service Monitoring API` +The `basic` method is used to let the `Cloud Service Monitoring API` automatically generate standardized SLOs for the following GCP services: * **Google App Engine** * **Google Kubernetes Engine** (with Istio) * **Google Cloud Endpoints** -The SLO configuration uses Stackdriver +The SLO configuration uses Cloud Monitoring [GCP metrics](https://cloud.google.com/monitoring/api/metrics_gcp) and only requires minimal configuration compared to custom SLOs. **Example config (App Engine availability):** ```yaml -backend: - class: StackdriverServiceMonitoring - method: basic - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - measurement: - app_engine: - project_id: ${GAE_PROJECT_ID} - module_id: ${GAE_MODULE_ID} - availability: {} +backend: cloud_service_monitoring +method: basic +service_level_indicator: + app_engine: + project_id: ${GAE_PROJECT_ID} + module_id: ${GAE_MODULE_ID} + availability: {} ``` For details on filling the `app_engine` fields, see [AppEngine](https://cloud.google.com/monitoring/api/ref_v3/rest/v3/services#appengine) spec. -**→ [Full SLO config](../../samples/stackdriver_service_monitoring/slo_gae_app_availability_basic.yaml)** +**→ [Full SLO config](../../samples/cloud_service_monitoring/slo_gae_app_availability_basic.yaml)** **Example config (Cloud Endpoint latency):** ```yaml -backend: - class: StackdriverServiceMonitoring - method: basic - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - measurement: - cloud_endpoints: - service: ${ENDPOINT_URL} - latency: - threshold: 724 # ms +backend: cloud_service_monitoring +method: basic +service_level_indicator: + cloud_endpoints: + service_name: ${ENDPOINT_URL} + latency: + threshold: 724 # ms ``` For details on filling the `cloud_endpoints` fields, see [CloudEndpoint](https://cloud.google.com/monitoring/api/ref_v3/rest/v3/services#cloudendpoints) spec. -**Example config (Istio service latency) [NOT YET RELEASED]:** +**Example config (Istio service latency):** ```yaml -backend: - class: StackdriverServiceMonitoring - method: basic - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - measurement: - mesh_istio: - mesh_uid: ${GKE_MESH_UID} - service_namespace: ${GKE_SERVICE_NAMESPACE} - service_name: ${GKE_SERVICE_NAME} - latency: - threshold: 500 # ms +backend: cloud_service_monitoring +method: basic +service_level_indicator: + mesh_istio: + mesh_uid: ${GKE_MESH_UID} + service_namespace: ${GKE_SERVICE_NAMESPACE} + service_name: ${GKE_SERVICE_NAME} + latency: + threshold: 500 # ms ``` For details on filling the `mesh_istio` fields, see [MeshIstio](https://cloud.google.com/monitoring/api/ref_v3/rest/v3/services#meshistio) spec. -**→ [Full SLO config](../../samples/stackdriver_service_monitoring/slo_gke_app_latency_basic.yaml)** +**→ [Full SLO config](../../samples/cloud_service_monitoring/slo_gke_app_latency_basic.yaml)** -**Example config (Istio service latency) [DEPRECATED SOON]:** +**Example config (Istio service latency) [DEPRECATED]:** ```yaml -backend: - class: StackdriverServiceMonitoring - method: basic - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - measurement: - cluster_istio: - project_id: ${GKE_PROJECT_ID} - location: ${GKE_LOCATION} - cluster_name: ${GKE_CLUSTER_NAME} - service_namespace: ${GKE_SERVICE_NAMESPACE} - service_name: ${GKE_SERVICE_NAME} - latency: - threshold: 500 # ms +backend: cloud_service_monitoring +method: basic +service_level_indicator: + cluster_istio: + project_id: ${GKE_PROJECT_ID} + location: ${GKE_LOCATION} + cluster_name: ${GKE_CLUSTER_NAME} + service_namespace: ${GKE_SERVICE_NAMESPACE} + service_name: ${GKE_SERVICE_NAME} + latency: + threshold: 500 # ms ``` For details on filling the `cluster_istio` fields, see [ClusterIstio](https://cloud.google.com/monitoring/api/ref_v3/rest/v3/services#clusteristio) spec. -**→ [Full SLO config](../../samples/stackdriver_service_monitoring/slo_gke_app_latency_basic_deprecated.yaml)** +**→ [Full SLO config](../../samples/cloud_service_monitoring/slo_gke_app_latency_basic_deprecated.yaml)** ### Good / bad ratio @@ -118,30 +117,28 @@ purposes as well (see examples). **Example config:** ```yaml -backend: - class: StackdriverServiceMonitoring - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - method: good_bad_ratio - measurement: - filter_good: > - project="${GAE_PROJECT_ID}" - metric.type="appengine.googleapis.com/http/server/response_count" - resource.type="gae_app" - metric.labels.response_code >= 200 - metric.labels.response_code < 500 - filter_valid: > - project="${GAE_PROJECT_ID}" - metric.type="appengine.googleapis.com/http/server/response_count" +backend: cloud_service_monitoring +method: good_bad_ratio +service_level_indicator: + filter_good: > + project="${GAE_PROJECT_ID}" + metric.type="appengine.googleapis.com/http/server/response_count" + resource.type="gae_app" + metric.labels.response_code >= 200 + metric.labels.response_code < 500 + filter_valid: > + project="${GAE_PROJECT_ID}" + metric.type="appengine.googleapis.com/http/server/response_count" ``` You can also use the `filter_bad` field which identifies bad events instead of the `filter_valid` field which identifies all valid events. -**→ [Full SLO config](../../samples/stackdriver_service_monitoring/slo_gae_app_availability.yaml)** +**→ [Full SLO config](../../samples/cloud_service_monitoring/slo_gae_app_availability.yaml)** ## Distribution cut -The `distribution_cut` method is used for Stackdriver distribution-type metrics, +The `distribution_cut` method is used for Cloud distribution-type metrics, which are usually used for latency metrics. A distribution metric records the **statistical distribution of the extracted @@ -152,30 +149,28 @@ along with the `count`, `mean`, and `sum` of squared deviation of the values. **Example config:** ```yaml -backend: - class: StackdriverServiceMonitoring - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - method: distribution_cut - measurement: - filter_valid: > - project=${GAE_PROJECT_ID} - metric.type=appengine.googleapis.com/http/server/response_latencies - metric.labels.response_code >= 200 - metric.labels.response_code < 500 - range_min: 0 - range_max: 724 # ms +backend: cloud_service_monitoring +method: distribution_cut +service_level_indicator: + filter_valid: > + project=${GAE_PROJECT_ID} + metric.type=appengine.googleapis.com/http/server/response_latencies + metric.labels.response_code >= 200 + metric.labels.response_code < 500 + range_min: 0 + range_max: 724 # ms ``` The `range_min` and `range_max` are used to specify the latency range that we consider 'good'. -**→ [Full SLO config](../../samples/stackdriver_service_monitoring/slo_gae_app_latency.yaml)** +**→ [Full SLO config](../../samples/cloud_service_monitoring/slo_gae_app_latency.yaml)** ## Service Monitoring API considerations ### Tracking objects -Since `Stackdriver Service Monitoring API` persists `Service` and +Since `Cloud Service Monitoring API` persists `Service` and `ServiceLevelObjective` objects, we need ways to keep our local SLO YAML configuration synced with the remote objects. @@ -214,7 +209,7 @@ unique id to an auto-imported `Service`: * **Cluster Istio [DEPRECATED SOON]:** ``` - ist:{project_id}-zone-{location}-{cluster_name}-{service_namespace}-{service_name} + ist:{project_id}-{suffix}-{location}-{cluster_name}-{service_namespace}-{service_name} ``` → *Make sure that the `cluster_istio` block in your config has the correct fields corresponding to your Istio service.* @@ -225,19 +220,19 @@ random id. **Custom** Custom services are the ones you create yourself using the -`Service Monitoring API` and the `slo-generator`. +`Cloud Service Monitoring API` and the `slo-generator`. The following conventions are used by the `slo-generator` to give a unique id to a custom `Service` and `Service Level Objective` objects: -* `service_id = ${service_name}-${feature_name}` +* `service_id = ${metadata.service_name}-${metadata.feature_name}` -* `slo_id = ${service_name}-${feature_name}-${slo_name}-${window}` +* `slo_id = ${metadata.service_name}-${metadata.feature_name}-${metadata.slo_name}-${window}` To keep track of those, **do not update any of the following fields** in your configs: - * `service_name`, `feature_name` and `slo_name` in the SLO config. + * `metadata.service_name`, `metadata.feature_name` and `metadata.slo_name` in the SLO config. * `window` in the Error Budget Policy. @@ -246,15 +241,12 @@ If you need to make updates to any of those fields, first run the [#deleting-objects](#deleting-objects)), then re-run normally. To import an existing custom `Service` objects, find out your service id from -the API and fill the `service_id` in the SLO configuration. - -You cannot import an existing custom `ServiceLevelObjective` unless it complies -to the naming convention. +the API and fill the `service_id` in the `service_level_indicator` configuration. ### Deleting objects -To delete an SLO object in `Stackdriver Monitoring API` using the -`StackdriverServiceMonitoringBackend` class, run the `slo-generator` with the +To delete an SLO object in `Cloud Monitoring API` using the +`cloud_service_monitoring` class, run the `slo-generator` with the `-d` (or `--delete`) flag: ``` @@ -263,10 +255,10 @@ slo-generator -f -b --delete ## Alerting -See the Stackdriver Service Monitoring [docs](https://cloud.google.com/monitoring/service-monitoring/alerting-on-budget-burn-rate) +See the Cloud Service Monitoring [docs](https://cloud.google.com/monitoring/service-monitoring/alerting-on-budget-burn-rate) for instructions on alerting. ### Examples -Complete SLO samples using `Stackdriver Service Monitoring` are available in [ samples/stackdriver_service_monitoring](../../samples/stackdriver_service_monitoring). +Complete SLO samples using `Cloud Service Monitoring` are available in [ samples/cloud_service_monitoring](../../samples/cloud_service_monitoring). Check them out ! diff --git a/docs/providers/custom.md b/docs/providers/custom.md index 8da87640..bb180a58 100644 --- a/docs/providers/custom.md +++ b/docs/providers/custom.md @@ -41,14 +41,20 @@ class CustomBackend: In order to call the `good_bad_ratio` method in the custom backend above, the -`backend` block would look like this: +`backends` block would look like this: ```yaml -backend: - class: custom.custom_backend.CustomBackend # relative Python path to the backend. Make sure __init__.py is created in subdirectories for this to work. - method: good_bad_ratio # name of the method to run - arg_1: test_arg_1 # passed to kwargs in __init__ - arg_2: test_arg_2 # passed to kwargs in __init__ +backends: + custom.custom_backend.CustomBackend: # relative Python path to the backend. Make sure __init__.py is created in subdirectories for this to work. + arg_1: test_arg_1 # passed to kwargs in __init__ + arg_2: test_arg_2 # passed to kwargs in __init__ +``` + +The `spec` section in the SLO config would look like: +```yaml +backend: custom.custom_backend.CustomBackend +method: good_bad_ratio # name of the method to run +service_level_indicator: {} ``` **→ [Full SLO config](../../samples/custom/slo_custom_app_availability_ratio.yaml)** @@ -92,10 +98,17 @@ class CustomExporter: and the corresponding `exporters` section in your SLO config: +The `exporters` block in the shared config would look like this: + ```yaml exporters: -- class: custom.custom_exporter.CustomExporter - arg_1: test + custom.custom_exporter.CustomExporter: # relative Python path to the backend. Make sure __init__.py is created in subdirectories for this to work. + arg_1: test_arg_1 # passed to kwargs in __init__ +``` + +The `spec` section in the SLO config would look like: +```yaml +exporters: [custom.custom_exporter.CustomExporter] ``` ### Metrics @@ -103,7 +116,9 @@ exporters: A metrics exporter: * must inherit from `slo_generator.exporters.base.MetricsExporter`. -* must implement the `export_metric` method which exports **one** metric as a dict like: +* must implement the `export_metric` method which exports **one** metric. +The `export_metric` function takes a metric dict as input, such as: + ```py { "name": , @@ -129,13 +144,13 @@ class CustomExporter(MetricsExporter): # derive from base class """Custom exporter.""" def export_metric(self, data): - """Export data to Stackdriver Monitoring. + """Export data to Custom Monitoring API. Args: data (dict): Metric data. Returns: - object: Stackdriver Monitoring API result. + object: Custom Monitoring API result. """ # implement how to export 1 metric here... return { @@ -144,11 +159,12 @@ class CustomExporter(MetricsExporter): # derive from base class } ``` -and the exporters section in your SLO config: +The `exporters` block in the shared config would look like this: + ```yaml exporters: - - class: custom.custom_exporter.CustomExporter - arg_1: test + custom.custom_exporter.CustomExporter: # relative Python path to the backend. Make sure __init__.py is created in subdirectories for this to work. + arg_1: test_arg_1 # passed to kwargs in __init__ ``` **Note:** diff --git a/docs/providers/datadog.md b/docs/providers/datadog.md index fb6b2128..f6e13f61 100644 --- a/docs/providers/datadog.md +++ b/docs/providers/datadog.md @@ -2,16 +2,28 @@ ## Backend -Using the `Datadog` backend class, you can query any metrics available in +Using the `datadog` backend class, you can query any metrics available in Datadog to create an SLO. -The following methods are available to compute SLOs with the `Datadog` +```yaml +backends: + datadog: + api_key: ${DATADOG_API_KEY} + app_key: ${DATADOG_APP_KEY} +``` + +The following methods are available to compute SLOs with the `datadog` backend: * `good_bad_ratio` for computing good / bad metrics ratios. * `query_sli` for computing SLIs directly with Datadog. * `query_slo` for getting SLO value from Datadog SLO endpoint. +Optional arguments to configure Datadog are documented in the Datadog +`initialize` method [here](https://github.com/DataDog/datadogpy/blob/058114cc3d65483466684c96a5c23e36c3aa052e/datadog/__init__.py#L33). +You can pass them in the `backend` section, such as specifying +`api_host: api.datadoghq.eu` in order to use the EU site. + ### Good / bad ratio The `good_bad_ratio` method is used to compute the ratio between two metrics: @@ -27,45 +39,29 @@ purposes as well (see examples). **Config example:** ```yaml -backend: - class: Datadog - method: good_bad_ratio - api_key: ${DATADOG_API_KEY} - app_key: ${DATADOG_APP_KEY} - measurement: - filter_good: app.requests.count{http.path:/, http.status_code_class:2xx} - filter_valid: app.requests.count{http.path:/} +backend: datadog +method: good_bad_ratio +service_level_indicator: + filter_good: app.requests.count{http.path:/, http.status_code_class:2xx} + filter_valid: app.requests.count{http.path:/} ``` **→ [Full SLO config](../../samples/datadog/slo_dd_app_availability_ratio.yaml)** -Optional arguments to configure Datadog are documented in the Datadog -`initialize` method [here](https://github.com/DataDog/datadogpy/blob/058114cc3d65483466684c96a5c23e36c3aa052e/datadog/__init__.py#L33). -You can pass them in the `backend` section, such as specifying -`api_host: api.datadoghq.eu` in order to use the EU site. - ### Query SLI The `query_sli` method is used to directly query the needed SLI with Datadog: Datadog's query language is powerful enough that it can do ratios natively. -This method makes it more flexible to input any `Datadog` SLI computation and +This method makes it more flexible to input any `datadog` SLI computation and eventually reduces the number of queries made to Datadog. ```yaml -backend: - class: Datadog - method: query_sli - api_key: ${DATADOG_API_KEY} - app_key: ${DATADOG_APP_KEY} - measurement: - expression: sum:app.requests.count{http.path:/, http.status_code_class:2xx} / sum:app.requests.count{http.path:/} +backend: datadog +method: query_sli +service_level_indicator: + expression: sum:app.requests.count{http.path:/, http.status_code_class:2xx} / sum:app.requests.count{http.path:/} ``` -Optional arguments to configure Datadog are documented in the Datadog -`initialize` method [here](https://github.com/DataDog/datadogpy/blob/058114cc3d65483466684c96a5c23e36c3aa052e/datadog/__init__.py#L33). -You can pass them in the `backend` section, such as specifying -`api_host: api.datadoghq.eu` in order to use the EU site. - **→ [Full SLO config](../../samples/datadog/slo_dd_app_availability_query_sli.yaml)** ### Query SLO @@ -73,46 +69,43 @@ You can pass them in the `backend` section, such as specifying The `query_slo` method is used to directly query the needed SLO with Datadog: indeed, Datadog has SLO objects that you can directly refer to in your config by inputing their `slo_id`. -This method makes it more flexible to input any `Datadog` SLI computation and +This method makes it more flexible to input any `datadog` SLI computation and eventually reduces the number of queries made to Datadog. To query the value from Datadog SLO, simply add a `slo_id` field in the `measurement` section: ```yaml -... -backend: - class: Datadog - method: query_slo - api_key: ${DATADOG_API_KEY} - app_key: ${DATADOG_APP_KEY} - measurement: - slo_id: ${DATADOG_SLO_ID} +backend: datadog +method: query_slo +service_level_indicator: + slo_id: ${DATADOG_SLO_ID} ``` **→ [Full SLO config](../../samples/datadog/slo_dd_app_availability_query_slo.yaml)** ### Examples -Complete SLO samples using `Datadog` are available in +Complete SLO samples using `datadog` are available in [samples/datadog](../../samples/datadog). Check them out! ## Exporter -The `Datadog` exporter allows to export SLO metrics to the Datadog API. - -**Example config:** +The `datadog` exporter allows to export SLO metrics to the Datadog API. ```yaml exporters: - - class: Datadog + datadog: api_key: ${DATADOG_API_KEY} app_key: ${DATADOG_APP_KEY} ``` +Optional arguments to configure Datadog are documented in the Datadog +`initialize` method [here](https://github.com/DataDog/datadogpy/blob/058114cc3d65483466684c96a5c23e36c3aa052e/datadog/__init__.py#L33). +You can pass them in the `backend` section, such as specifying +`api_host: api.datadoghq.eu` in order to use the EU site. Optional fields: - * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`custom:error_budget_burn_rate`, `custom:sli_measurement`]. - + * `metrics`: [*optional*] `list` - List of metrics to export ([see docs](../shared/metrics.md)). **→ [Full SLO config](../../samples/datadog/slo_dd_app_availability_ratio.yaml)** diff --git a/docs/providers/dynatrace.md b/docs/providers/dynatrace.md index c2fcbbac..90f1a3db 100644 --- a/docs/providers/dynatrace.md +++ b/docs/providers/dynatrace.md @@ -2,10 +2,17 @@ ## Backend -Using the `Dynatrace` backend class, you can query any metrics available in +Using the `dynatrace` backend class, you can query any metrics available in Dynatrace to create an SLO. -The following methods are available to compute SLOs with the `Dynatrace` +```yaml +backends: + dynatrace: + api_token: ${DYNATRACE_API_TOKEN} + api_url: ${DYNATRACE_API_URL} +``` + +The following methods are available to compute SLOs with the `dynatrace` backend: * `good_bad_ratio` for computing good / bad metrics ratios. @@ -25,16 +32,13 @@ purposes as well (see examples). **Config example:** ```yaml -backend: - class: Dynatrace - method: good_bad_ratio - api_token: ${DYNATRACE_API_TOKEN} - api_url: ${DYNATRACE_API_URL} - measurement: - query_good: - metric_selector: ext:app.request_count:filter(and(eq(app,test_app),eq(env,prod),eq(status_code_class,2xx))) - entity_selector: type(HOST) - query_valid: +backend: dynatrace +method: good_bad_ratio +service_level_indicator: + query_good: + metric_selector: ext:app.request_count:filter(and(eq(app,test_app),eq(env,prod),eq(status_code_class,2xx))) + entity_selector: type(HOST) + query_valid: metric_selector: ext:app.request_count:filter(and(eq(app,test_app),eq(env,prod))) entity_selector: type(HOST) ``` @@ -52,16 +56,13 @@ This method can be used for latency SLOs, by defining a latency threshold. **Config example:** ```yaml -backend: - class: Dynatrace - method: threshold - api_token: ${DYNATRACE_API_TOKEN} - api_url: ${DYNATRACE_API_URL} - measurement: - query_valid: - metric_selector: ext:app.request_latency:filter(and(eq(app,test_app),eq(env,prod),eq(status_code_class,2xx))) - entity_selector: type(HOST) - threshold: 40000 # us +backend: dynatrace +method: threshold +service_level_indicator: + query_valid: + metric_selector: ext:app.request_latency:filter(and(eq(app,test_app),eq(env,prod),eq(status_code_class,2xx))) + entity_selector: type(HOST) + threshold: 40000 # us ``` **→ [Full SLO config](../../samples/dynatrace/slo_dt_app_latency_threshold.yaml)** @@ -71,24 +72,22 @@ Optional fields: ### Examples -Complete SLO samples using `Dynatrace` are available in +Complete SLO samples using `dynatrace` are available in [samples/dynatrace](../../samples/dynatrace). Check them out! ## Exporter -The `Dynatrace` exporter allows to export SLO metrics to Dynatrace API. - -**Example config:** +The `dynatrace` exporter allows to export SLO metrics to Dynatrace API. ```yaml exporters: - - class: Dynatrace - api_token: ${DYNATRACE_API_TOKEN} - api_url: ${DYNATRACE_API_URL} + dynatrace: + api_token: ${DYNATRACE_API_TOKEN} + api_url: ${DYNATRACE_API_URL} ``` Optional fields: - * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`custom:error_budget_burn_rate`, `custom:sli_measurement`]. + * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`custom:error_budget_burn_rate`, `custom:sli_service_level_indicator`]. **→ [Full SLO config](../../samples/dynatrace/slo_dt_app_availability_ratio.yaml)** diff --git a/docs/providers/elasticsearch.md b/docs/providers/elasticsearch.md index 5e160bc3..c30b1dbb 100644 --- a/docs/providers/elasticsearch.md +++ b/docs/providers/elasticsearch.md @@ -2,10 +2,17 @@ ## Backend -Using the `Elasticsearch` backend class, you can query any metrics available in +Using the `elasticsearch` backend class, you can query any metrics available in Elasticsearch to create an SLO. -The following methods are available to compute SLOs with the `Elasticsearch` +```yaml +backends: + elasticsearch: + api_token: ${DYNATRACE_API_TOKEN} + api_url: ${DYNATRACE_API_URL} +``` + +The following methods are available to compute SLOs with the `elasticsearch` backend: * `good_bad_ratio` for computing good / bad metrics ratios. @@ -81,5 +88,5 @@ look like: ### Examples -Complete SLO samples using the `Elasticsearch` backend are available in +Complete SLO samples using the `elasticsearch` backend are available in [samples/elasticsearch](../../samples/elasticsearch). Check them out ! diff --git a/docs/providers/prometheus.md b/docs/providers/prometheus.md index 1ceb4a62..1120b777 100644 --- a/docs/providers/prometheus.md +++ b/docs/providers/prometheus.md @@ -2,10 +2,22 @@ ## Backend -Using the `Prometheus` backend class, you can query any metrics available in +Using the `prometheus` backend class, you can query any metrics available in Prometheus to create an SLO. -The following methods are available to compute SLOs with the `Prometheus` +```yaml +backends: + prometheus: + url: http://localhost:9090 + # headers: + # Content-Type: application/json + # Authorization: Basic b2s6cGFzcW== +``` + +Optional fields: +* `headers` allows to specify Basic Authentication credentials if needed. + +The following methods are available to compute SLOs with the `prometheus` backend: * `good_bad_ratio` for computing good / bad metrics ratios. @@ -26,24 +38,16 @@ purposes as well (see examples). **Config example:** ```yaml -backend: - class: Prometheus - method: good_bad_ratio - url: http://localhost:9090 - # headers: - # Content-Type: application/json - # Authorization: Basic b2s6cGFzcW== - measurement: - filter_good: http_requests_total{handler="/metrics", code=~"2.."}[window] - filter_valid: http_requests_total{handler="/metrics"}[window] - # operators: ['sum', 'rate'] +backend: prometheus +method: good_bad_ratio +service_level_indicator: + filter_good: http_requests_total{handler="/metrics", code=~"2.."}[window] + filter_valid: http_requests_total{handler="/metrics"}[window] + # operators: ['sum', 'rate'] ``` * The `window` placeholder is needed in the query and will be replaced by the corresponding `window` field set in each step of the Error Budget Policy. -* The `headers` section (commented) allows to specify Basic Authentication -credentials if needed. - * The `operators` section defines which PromQL functions to apply on the timeseries. The default is to compute `sum(increase([METRIC_NAME][window]))` to get an accurate count of good and bad events. Be aware that changing will likely @@ -64,26 +68,20 @@ eventually reduces the number of queries made to Prometheus. See Bitnami's [article](https://engineering.bitnami.com/articles/implementing-slos-using-prometheus.html) on engineering SLOs with Prometheus. +**Config example:** + ```yaml -backend: - class: Prometheus - method: query_sli - url: ${PROMETHEUS_URL} - # headers: - # Content-Type: application/json - # Authorization: Basic b2s6cGFzcW== - measurement: - expression: > - sum(rate(http_requests_total{handler="/metrics", code=~"2.."}[window])) - / - sum(rate(http_requests_total{handler="/metrics"}[window])) +backend: prometheus +method: query_sli +service_level_indicator: + expression: > + sum(rate(http_requests_total{handler="/metrics", code=~"2.."}[window])) + / + sum(rate(http_requests_total{handler="/metrics"}[window])) ``` * The `window` placeholder is needed in the query and will be replaced by the corresponding `window` field set in each step of the Error Budget Policy. -* The `headers` section (commented) allows to specify Basic Authentication -credentials if needed. - **→ [Full SLO config (availability)](../../samples/prometheus/slo_prom_metrics_availability_query_sli.yaml)** **→ [Full SLO config (latency)](../../samples/prometheus/slo_prom_metrics_latency_query_sli.yaml)** @@ -121,13 +119,11 @@ expressing it, as shown in the config example below. **Config example:** ```yaml -backend: - class: Prometheus - project_id: ${STACKDRIVER_HOST_PROJECT_ID} - method: distribution_cut - measurement: - expression: http_requests_duration_bucket{path='/', code=~"2.."} - threshold_bucket: 0.25 # corresponds to 'le' attribute in Prometheus histograms +backend: prometheus +method: distribution_cut +service_level_indicator: + expression: http_requests_duration_bucket{path='/', code=~"2.."} + threshold_bucket: 0.25 # corresponds to 'le' attribute in Prometheus histograms ``` **→ [Full SLO config](../../samples/prometheus/slo_prom_metrics_latency_distribution_cut.yaml)** @@ -137,31 +133,29 @@ set for your metric. Learn more in the [Prometheus docs](https://prometheus.io/d ## Exporter -The `Prometheus` exporter allows to export SLO metrics to the +The `prometheus` exporter allows to export SLO metrics to the [Prometheus Pushgateway](https://prometheus.io/docs/practices/pushing/) which needs to be running. -`Prometheus` needs to be setup to **scrape metrics from `Pushgateway`** (see - [documentation](https://github.com/prometheus/pushgateway) for more details). - -**Example config:** - ```yaml exporters: - - class: Prometheus - url: ${PUSHGATEWAY_URL} + prometheus: + url: ${PUSHGATEWAY_URL} ``` Optional fields: - * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`error_budget_burn_rate`, `sli_measurement`]. + * `metrics`: List of metrics to export ([see docs](../shared/metrics.md)). Defaults to [`error_budget_burn_rate`, `sli_service_level_indicator`]. * `username`: Username for Basic Auth. * `password`: Password for Basic Auth. * `job`: Name of `Pushgateway` job. Defaults to `slo-generator`. +***Note:*** `prometheus` needs to be setup to **scrape metrics from `Pushgateway`** +(see [documentation](https://github.com/prometheus/pushgateway) for more details). + **→ [Full SLO config](../../samples/prometheus/slo_prom_metrics_availability_query_sli.yaml)** ### Examples -Complete SLO samples using `Prometheus` are available in +Complete SLO samples using `prometheus` are available in [samples/prometheus](../../samples/prometheus). Check them out ! diff --git a/docs/providers/pubsub.md b/docs/providers/pubsub.md index 197274f4..0f9f2437 100644 --- a/docs/providers/pubsub.md +++ b/docs/providers/pubsub.md @@ -2,18 +2,16 @@ ## Exporter -The `Pubsub` exporter will export SLO reports to a Pub/Sub topic, in JSON format. - -This allows teams to consume SLO reports in real-time, and take appropriate -actions when they see a need. - -**Example config:** +The `pubsub` exporter will export SLO reports to a Pub/Sub topic, in JSON format. ```yaml exporters: - - class: Pubsub + pubsub: project_id: "${PUBSUB_PROJECT_ID}" topic_name: "${PUBSUB_TOPIC_NAME}" ``` -**→ [Full SLO config](../../samples/stackdriver/slo_pubsub_subscription_throughput.yaml)** +This allows teams to consume SLO reports in real-time, and take appropriate +actions when they see a need. + +**→ [Full SLO config](../../samples/cloud_monitoring/slo_pubsub_subscription_throughput.yaml)** diff --git a/docs/shared/metrics.md b/docs/shared/metrics.md index a6cecf4d..b50c95ed 100644 --- a/docs/shared/metrics.md +++ b/docs/shared/metrics.md @@ -63,23 +63,23 @@ metrics: ``` where: -* `name`: name of the [SLO Report](../../tests/unit/fixtures/slo_report.json) +* `name`: name of the [SLO Report](../../tests/unit/fixtures/slo_report_v2.json) field to export as a metric. The field MUST exist in the SLO report. * `description`: description of the metric (if the metrics exporter supports it) * `alias` (optional): rename the metric before writing to the monitoring backend. * `additional_labels` (optional) allow you to specify other labels to the timeseries written. Each label name must correspond to a field of the -[SLO Report](../../tests/unit/fixtures/slo_report.json). +[SLO Report](../../tests/unit/fixtures/slo_report_v2.json). ## Metric exporters Some metrics exporters have a specific `prefix` that is pre-prepended to the metric name: -* `StackdriverExporter` prefix: `custom.googleapis.com/` -* `DatadogExporter` prefix: `custom:` +* `cloud_monitoring` exporter prefix: `custom.googleapis.com/` +* `datadog` prefix: `custom:` Some metrics exporters have a limit of `labels` that can be written to their metrics timeseries: -* `StackdriverExporter` labels limit: `10`. +* `cloud_monitoring` labels limit: `10`. Those are standards and cannot be modified. diff --git a/docs/shared/migration.md b/docs/shared/migration.md new file mode 100644 index 00000000..65f9ef72 --- /dev/null +++ b/docs/shared/migration.md @@ -0,0 +1,33 @@ +# Migrating `slo-generator` to the next major version + +## v1 to v2 + +Version `v2` of the slo-generator introduces some changes to the structure of +the SLO configurations. + +To migrate your SLO configurations from v1 to v3, please execute the following +instructions: + +**Upgrade `slo-generator`:** +``` +pip3 install slo-generator -U # upgrades slo-generator version to the latest version +``` + +**Run the `slo-generator-migrate` command:** +``` +slo-generator-migrate -s -t -b +``` +where: +* is the source folder containg SLO configurations in v1 format. +This folder can have nested subfolders containing SLOs. The subfolder structure +will be reproduced on the target folder. + +* is the target folder to drop the SLO configurations in v2 +format. If the target folder is identical to the source folder, the existing SLO +configurations will be updated in-place. + +* is the path to your error budget policy configuration. + +**Follow the instructions printed to finish the migration:** +This includes committing the resulting files to git and updating your Terraform +modules to the version that supports the v2 configuration format. diff --git a/docs/shared/troubleshooting.md b/docs/shared/troubleshooting.md index 45fa562b..29d6eef5 100644 --- a/docs/shared/troubleshooting.md +++ b/docs/shared/troubleshooting.md @@ -2,7 +2,7 @@ ## Problem -**`StackdriverExporter`: Labels limit (10) reached.** +**`cloud_monitoring` exporter: Labels limit (10) reached.** ``` The new labels would cause the metric custom.googleapis.com/slo_target to have over 10 labels.: timeSeries[0]" diff --git a/samples/README.md b/samples/README.md index e7198ba8..201181fe 100644 --- a/samples/README.md +++ b/samples/README.md @@ -14,17 +14,17 @@ running it. The following is listing all environmental variables found in the SLO configs, per backend: -`stackdriver/`: - - `STACKDRIVER_HOST_PROJECT_ID`: Stackdriver host project id. - - `STACKDRIVER_LOG_METRIC_NAME`: Stackdriver log-based metric name. +`cloud_monitoring/`: + - `WORKSPACE_PROJECT_ID`: Cloud Monitoring host project id. + - `LOG_METRIC_NAME`: Cloud Logging log-based metric name. - `GAE_PROJECT_ID`: Google App Engine application project id. - `GAE_MODULE_ID`: Google App Engine application module id. - `PUBSUB_PROJECT_ID`: Pub/Sub project id. - `PUBSUB_TOPIC_NAME`: Pub/Sub topic name. -`stackdriver_service_monitoring/`: - - `STACKDRIVER_HOST_PROJECT_ID`: Stackdriver host project id. - - `STACKDRIVER_LOG_METRIC_NAME`: Stackdriver log-based metric name. +`cloud_service_monitoring/`: + - `WORKSPACE_PROJECT_ID`: Cloud Monitoring host project id. + - `LOG_METRIC_NAME`: Cloud Logging log-based metric name. - `GAE_PROJECT_ID`: Google App Engine application project id. - `GAE_MODULE_ID`: Google App Engine application module id. - `PUBSUB_PROJECT_ID`: Pub/Sub project id. @@ -50,7 +50,7 @@ you're pointing to need to exist. To run one sample: ``` -slo-generator -f samples/stackdriver/.yaml +slo-generator -f samples/cloud_monitoring/.yaml ``` To run all the samples for a backend: @@ -68,14 +68,14 @@ slo-generator -f samples/ -b samples/ ### Examples -##### Stackdriver +##### Cloud Monitoring ``` -slo-generator -f samples/stackdriver -b error_budget_policy.yaml +slo-generator -f samples/cloud_monitoring -b error_budget_policy.yaml ``` -##### Stackdriver Service Monitoring +##### Cloud Service Monitoring ``` -slo-generator -f samples/stackdriver_service_monitoring -b error_budget_policy_ssm.yaml +slo-generator -f samples/cloud_service_monitoring -b error_budget_policy_ssm.yaml ``` ***Note:*** *the Error Budget Policy is different for this backend, because it only