From 822953cc793bf4ff2502cca0fd3a1c024497df57 Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Thu, 21 Sep 2023 16:55:44 +0200 Subject: [PATCH 1/7] docu: Add operation sections, move telemetry status docu --- apis/operator/v1alpha1/telemetry_types.go | 5 +- .../operator.kyma-project.io_telemetries.yaml | 6 ++- docs/user/02-logs.md | 13 +++++ docs/user/03-traces.md | 13 +++++ docs/user/04-metrics.md | 11 ++++ docs/user/README.md | 8 +-- docs/user/resources/01-telemetry.md | 53 +++++++++++++++++-- 7 files changed, 100 insertions(+), 9 deletions(-) diff --git a/apis/operator/v1alpha1/telemetry_types.go b/apis/operator/v1alpha1/telemetry_types.go index 5ca913345..93931f255 100644 --- a/apis/operator/v1alpha1/telemetry_types.go +++ b/apis/operator/v1alpha1/telemetry_types.go @@ -50,18 +50,21 @@ type TelemetryStatus struct { // If all Conditions are met, State is expected to be in StateReady. Conditions []metav1.Condition `json:"conditions,omitempty"` - // GatewayEndpoints for trace and metric gateway + // endpoints for trace and metric gateway. // +nullable GatewayEndpoints GatewayEndpoints `json:"endpoints,omitempty"` // add other fields to status subresource here } type GatewayEndpoints struct { + //traces contains the endpoints for trace gateway supporting OTLP. Traces *OTLPEndpoints `json:"traces,omitempty"` } type OTLPEndpoints struct { + //GRPC endpoint for OTLP. GRPC string `json:"grpc,omitempty"` + //HTTP endpoint for OTLP. HTTP string `json:"http,omitempty"` } diff --git a/config/crd/bases/operator.kyma-project.io_telemetries.yaml b/config/crd/bases/operator.kyma-project.io_telemetries.yaml index 55abee001..7f27c13a7 100644 --- a/config/crd/bases/operator.kyma-project.io_telemetries.yaml +++ b/config/crd/bases/operator.kyma-project.io_telemetries.yaml @@ -120,14 +120,18 @@ spec: type: object type: array endpoints: - description: GatewayEndpoints for trace and metric gateway + description: endpoints for trace and metric gateway. nullable: true properties: traces: + description: traces contains the endpoints for trace gateway supporting + OTLP. properties: grpc: + description: GRPC endpoint for OTLP. type: string http: + description: HTTP endpoint for OTLP. type: string type: object type: object diff --git a/docs/user/02-logs.md b/docs/user/02-logs.md index 6c3beb7bd..179c93315 100644 --- a/docs/user/02-logs.md +++ b/docs/user/02-logs.md @@ -430,6 +430,19 @@ The record **after** applying the JSON parser: As per the LogPipeline definition, a dedicated [rewrite_tag](https://docs.fluentbit.io/manual/pipeline/filters/rewrite-tag) filter is introduced. The filter brings a dedicated filesystem buffer for the outputs defined in the related pipeline, and with that, ensures a shipment of the logs isolated from outputs of other pipelines. As a consequence, each pipeline runs on its own [tag](https://docs.fluentbit.io/manual/concepts/key-concepts#tag). +## Operations + +A LogPipeline will result in a DaemonSet running one FluentBit instance per Node in your cluster. That instances will collect and ship application logs to the configured backend. The module will assure that the FluentBit instances are operational at any time and running healthy. It will try to deliver the log data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop logs as the backend is either not reachable for some duration or cannot handle the log load and is causing backpressure. + +To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, two Services `telemetry-fluent-bit-metrics` and `telemetry-fluent-bit-exporter-metrics` are located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. + +The relevant metrics are: +| Name | Threshold | Description | +|---|---|---| +| telemetry_fsbuffer_usage_bytes | (bytes/1000000000) * 100 > 90 | The metric indicates the current size of the persistent log buffer in bytes running on each instance. If the size reaches 1GB, logs will start getting dropped at that instance. At 90% buffer size an alert should get raised. | +| fluentbit_output_dropped_records_total| total[5m] > 0 | The metric indicates that the instance is actively dropping logs. That typically happens when a log message got rejected with a un-retryable status code like a 400. Any occurence of such drop should be alerted. | + + ## Limitations Currently, there are the following limitations for LogPipelines that are served by Fluent Bit: diff --git a/docs/user/03-traces.md b/docs/user/03-traces.md index d18a22f95..d0bacd435 100644 --- a/docs/user/03-traces.md +++ b/docs/user/03-traces.md @@ -411,6 +411,19 @@ The Kyma [Eventing](https://kyma-project.io/#/01-overview/eventing/README) compo ### Serverless By default, all engines for the [Serverless](https://kyma-project.io/#/serverless-manager/user/README) module integrate the [Open Telemetry SDK](https://opentelemetry.io/docs/reference/specification/metrics/sdk/). With that, trace propagation no longer is your concern, because the used middlewares are configured to automatically propagate the context for chained calls. Because the Telemetry endpoints are configured by default, Serverless also reports custom spans for incoming and outgoing requests. You can [customize Function traces](https://kyma-project.io/#/03-tutorials/00-serverless/svls-12-customize-function-traces) to add more spans as part of your Serverless source code. +## Operations + +A TracePipeline will result in a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The module will assure that the instances are operational at any time and running healthy. It will try to deliver the data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop data as the backend is either not reachable for some duration or cannot handle the load and is causing backpressure. + +To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-trace-collector-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. + +The relevant metrics are: +| Name | Threshold | Description | +|---|---|---| +| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | +| otelcol_exporter_send_failed_spans | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | +| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | + ## Limitations The trace gateway setup is designed using the following assumptions: diff --git a/docs/user/04-metrics.md b/docs/user/04-metrics.md index 15ebaa850..1c4072608 100644 --- a/docs/user/04-metrics.md +++ b/docs/user/04-metrics.md @@ -382,6 +382,17 @@ You activated a MetricPipeline and metrics start streaming to your backend. To v NAME STATUS AGE backend Ready 44s +A MetricPipeline will result in a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The module will assure that the instances are operational at any time and running healthy. It will try to deliver the data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop data as the backend is either not reachable for some duration or cannot handle the load and is causing backpressure. + +To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. + +The relevant metrics are: +| Name | Threshold | Description | +|---|---|---| +| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | +| otelcol_exporter_send_failed_spans | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | +| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | + ## Limitations The metric gateway setup is based on the following assumptions: diff --git a/docs/user/README.md b/docs/user/README.md index 5d5d7f188..698b7ea09 100644 --- a/docs/user/README.md +++ b/docs/user/README.md @@ -20,9 +20,9 @@ Kyma's Telemetry module focuses exactly on the aspects of instrumentation, colle To support telemetry for your applications, Kyma's Telemetry module provides the following features: -- Guidance for the instrumentation: Based on [Open Telemetry](https://opentelemetry.io/), you get community samples on how to instrument your code using the [Open Telemetry SDKs](https://opentelemetry.io/docs/instrumentation/) in nearly every programming language. -- Tooling for collection, filtering, and shipment: Based on the [Open Telemetry Collector](https://opentelemetry.io/docs/collector/), you can configure basic pipelines to filter and ship telemetry data. -- Integration in a vendor-neutral way to a vendor-specific observability system: Based on the [OpenTelemetry protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/), you can integrate backend systems. +- Tooling for collection, filtering, and shipment: Based on the [Open Telemetry Collector](https://opentelemetry.io/docs/collector/) and [Fluent Bit](https://fluentbit.io/), you can configure basic pipelines to filter and ship telemetry data. +- Integration in a vendor-neutral way to a vendor-specific observability system (traces and metrics only): Based on the [OpenTelemetry protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/), you can integrate backend systems. +- Guidance for the instrumentation (traces and metrics only): Based on [Open Telemetry](https://opentelemetry.io/), you get community samples on how to instrument your code using the [Open Telemetry SDKs](https://opentelemetry.io/docs/instrumentation/) in nearly every programming language. - Opt-out from features for advanced scenarios: At any time, you can opt out for each data type, and use custom tooling to collect and ship the telemetry data. - SAP BTP as first-class integration: Integration into BTP Observability services is prioritized. @@ -59,6 +59,8 @@ For details, see [Traces](03-traces.md). ### Metric Gateway/Agent +> **NOTE:** The feature is not available yet. To understand the current progress, watch this [epic](https://github.com/kyma-project/kyma/issues/13079). + The metric gateway and agent are based on an [OTel Collector](https://opentelemetry.io/docs/collector/) [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and a [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). The gateway provides an [OTLP-based](https://opentelemetry.io/docs/reference/specification/protocol/) endpoint to which applications can push the metric signals. The agent scrapes annotated Prometheus-based workloads. According to a MetricPipeline configuration, the gateway processes and ships the metric data to a target system. For more information, see [Metrics](04-metrics.md). diff --git a/docs/user/resources/01-telemetry.md b/docs/user/resources/01-telemetry.md index a980b1b39..c48be23b8 100644 --- a/docs/user/resources/01-telemetry.md +++ b/docs/user/resources/01-telemetry.md @@ -74,10 +74,55 @@ For details, see the [Telemetry specification file](https://github.com/kyma-proj | **conditions.​reason** (required) | string | reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty. | | **conditions.​status** (required) | string | status of the condition, one of True, False, Unknown. | | **conditions.​type** (required) | string | type of condition in CamelCase or in foo.example.com/CamelCase. --- Many .condition.type values are consistent across resources like Available, but because arbitrary conditions can be useful (see .node.status.conditions), the ability to deconflict is important. The regex it matches is (dns1123SubdomainFmt/)?(qualifiedNameFmt) | -| **endpoints** | object | GatewayEndpoints for trace and metric gateway | -| **endpoints.​traces** | object | | -| **endpoints.​traces.​grpc** | string | | -| **endpoints.​traces.​http** | string | | +| **endpoints** | object | endpoints for trace and metric gateway. | +| **endpoints.​traces** | object | traces contains the endpoints for trace gateway supporting OTLP. | +| **endpoints.​traces.​grpc** | string | GRPC endpoint for OTLP. | +| **endpoints.​traces.​http** | string | HTTP endpoint for OTLP. | | **state** (required) | string | State signifies current state of Module CR. Value can be one of these three: "Ready", "Deleting", or "Warning". | + +The `state` attribute of the Telemetry CR is derived from the combined state of all the subcomponents, namely, from the condition types `LogComponentsHealthy`, `TraceComponentsHealthy` and `MetricComponentsHealthy`. + +### Log Components State + +The state of the log components is determined by the status condition of type `LogComponentsHealthy`: + +| Condition status | Condition reason | Message | +|------------------|----------------------------|-------------------------------------------------| +| True | NoPipelineDeployed | No pipelines have been deployed | +| True | FluentBitDaemonSetReady | Fluent Bit DaemonSet is ready | +| False | ReferencedSecretMissing | One or more referenced Secrets are missing | +| False | FluentBitDaemonSetNotReady | Fluent Bit DaemonSet is not ready | +| False | LogResourceBlocksDeletion | One or more LogPipelines/LogParsers still exist | + +### Trace Components State + +The state of the trace components is determined by the status condition of type `TraceComponentsHealthy`: + +| Condition status | Condition reason | Message | +|------------------|--------------------------------|--------------------------------------------| +| True | NoPipelineDeployed | No pipelines have been deployed | +| True | TraceGatewayDeploymentReady | Trace gateway Deployment is ready | +| False | ReferencedSecretMissing | One or more referenced Secrets are missing | +| False | TraceGatewayDeploymentNotReady | Trace gateway Deployment is not ready | +| False | TraceResourceBlocksDeletion | One or more TracePipelines still exist | + +### Metric Components State + +The state of the metric components is determined by the status condition of type `MetricComponentsHealthy`: + +| Condition status | Condition reason | Message | +|------------------|---------------------------------|--------------------------------------------| +| True | NoPipelineDeployed | No pipelines have been deployed | +| True | MetricGatewayDeploymentReady | Metric gateway Deployment is ready | +| False | ReferencedSecretMissing | One or more referenced Secrets are missing | +| False | MetricGatewayDeploymentNotReady | Metric gateway Deployment is not ready | +| False | MetricResourceBlocksDeletion | One or more MetricPipelines still exist | + + +### Telemetry CR State + +- 'Ready': Only if all the subcomponent conditions (LogComponentsHealthy, TraceComponentsHealthy, and MetricComponentsHealthy) have a status of 'True.' +- 'Warning': If any of these conditions are not 'True'. +- 'Deleting': When a Telemetry CR is being deleted. From b3734e1d5fa7d764a8ced777024847c933c805ee Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Thu, 21 Sep 2023 17:14:28 +0200 Subject: [PATCH 2/7] fix --- docs/user/04-metrics.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/user/04-metrics.md b/docs/user/04-metrics.md index 1c4072608..dc5881678 100644 --- a/docs/user/04-metrics.md +++ b/docs/user/04-metrics.md @@ -382,6 +382,8 @@ You activated a MetricPipeline and metrics start streaming to your backend. To v NAME STATUS AGE backend Ready 44s +## Operations + A MetricPipeline will result in a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The module will assure that the instances are operational at any time and running healthy. It will try to deliver the data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop data as the backend is either not reachable for some duration or cannot handle the load and is causing backpressure. To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. @@ -389,9 +391,9 @@ To avoid and detect these situations, you should monitor the instances by collec The relevant metrics are: | Name | Threshold | Description | |---|---|---| -| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | -| otelcol_exporter_send_failed_spans | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | -| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | +| otelcol_exporter_enqueue_failed_metric_points | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | +| otelcol_exporter_send_failed_metric_points | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | +| otelcol_processor_refused_metric_points | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | ## Limitations From cde4b41b128b6d4be0da25d3652d887c8cc2c6c6 Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Fri, 22 Sep 2023 11:33:05 +0200 Subject: [PATCH 3/7] fix --- docs/user/resources/01-telemetry.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/user/resources/01-telemetry.md b/docs/user/resources/01-telemetry.md index c48be23b8..1de3355ae 100644 --- a/docs/user/resources/01-telemetry.md +++ b/docs/user/resources/01-telemetry.md @@ -23,12 +23,6 @@ Status: grpc: http://telemetry-otlp-traces.kyma-system:4317 http: http://telemetry-otlp-traces.kyma-system:4318 conditions: - - lastTransitionTime: "2023-09-01T15:11:09Z" - message: installation is ready and resources can be used - observedGeneration: 2 - reason: Ready - status: "True" - type: Installation - lastTransitionTime: "2023-09-01T15:28:28Z" message: Fluent Bit DaemonSet is ready observedGeneration: 2 From a4ae3bea51b54d4e2041aac12c01dbc08c7f8523 Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Tue, 26 Sep 2023 12:38:42 +0200 Subject: [PATCH 4/7] fix --- docs/user/02-logs.md | 2 +- docs/user/03-traces.md | 2 +- docs/user/04-metrics.md | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/user/02-logs.md b/docs/user/02-logs.md index 179c93315..d5e16c2cd 100644 --- a/docs/user/02-logs.md +++ b/docs/user/02-logs.md @@ -432,7 +432,7 @@ As per the LogPipeline definition, a dedicated [rewrite_tag](https://docs.fluent ## Operations -A LogPipeline will result in a DaemonSet running one FluentBit instance per Node in your cluster. That instances will collect and ship application logs to the configured backend. The module will assure that the FluentBit instances are operational at any time and running healthy. It will try to deliver the log data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop logs as the backend is either not reachable for some duration or cannot handle the log load and is causing backpressure. +A LogPipeline creates a DaemonSet running one Fluent Bit instance per Node in your cluster. That instance collects and ships application logs to the configured backend. The Telemetry module assures that the Fluent Bit instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, two Services `telemetry-fluent-bit-metrics` and `telemetry-fluent-bit-exporter-metrics` are located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. diff --git a/docs/user/03-traces.md b/docs/user/03-traces.md index d0bacd435..cac3558fb 100644 --- a/docs/user/03-traces.md +++ b/docs/user/03-traces.md @@ -413,7 +413,7 @@ By default, all engines for the [Serverless](https://kyma-project.io/#/serverles ## Operations -A TracePipeline will result in a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The module will assure that the instances are operational at any time and running healthy. It will try to deliver the data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop data as the backend is either not reachable for some duration or cannot handle the load and is causing backpressure. +A TracePipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-trace-collector-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. diff --git a/docs/user/04-metrics.md b/docs/user/04-metrics.md index dc5881678..e2f74f467 100644 --- a/docs/user/04-metrics.md +++ b/docs/user/04-metrics.md @@ -384,7 +384,7 @@ You activated a MetricPipeline and metrics start streaming to your backend. To v ## Operations -A MetricPipeline will result in a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The module will assure that the instances are operational at any time and running healthy. It will try to deliver the data to the backend using typical patterns like buffering and retries (see the Limitations section). However, there are scenarios where the instances will drop data as the backend is either not reachable for some duration or cannot handle the load and is causing backpressure. +A MetricPipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. @@ -432,7 +432,7 @@ Cause: The backend is not reachable or wrong authentication credentials are used Remedy: -1. To check the `telemetry-trace-collector` Pods for error logs, call `kubectl logs -n kyma-system {POD_NAME}`. +1. To check the `telemetry-metric-gateway` Pods for error logs, call `kubectl logs -n kyma-system {POD_NAME}`. 2. Fix the errors. ### Only Istio metrics arrive at the destination From 7e294a4d586e80bafcf907a52573f7c3c7404ad4 Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Tue, 26 Sep 2023 12:40:13 +0200 Subject: [PATCH 5/7] fix --- docs/user/02-logs.md | 2 +- docs/user/03-traces.md | 2 +- docs/user/04-metrics.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user/02-logs.md b/docs/user/02-logs.md index d5e16c2cd..05d9fff90 100644 --- a/docs/user/02-logs.md +++ b/docs/user/02-logs.md @@ -434,7 +434,7 @@ As per the LogPipeline definition, a dedicated [rewrite_tag](https://docs.fluent A LogPipeline creates a DaemonSet running one Fluent Bit instance per Node in your cluster. That instance collects and ships application logs to the configured backend. The Telemetry module assures that the Fluent Bit instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. -To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, two Services `telemetry-fluent-bit-metrics` and `telemetry-fluent-bit-exporter-metrics` are located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. +To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, two Services `telemetry-fluent-bit-metrics` and `telemetry-fluent-bit-exporter-metrics` are located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation. The relevant metrics are: | Name | Threshold | Description | diff --git a/docs/user/03-traces.md b/docs/user/03-traces.md index cac3558fb..addf085fd 100644 --- a/docs/user/03-traces.md +++ b/docs/user/03-traces.md @@ -415,7 +415,7 @@ By default, all engines for the [Serverless](https://kyma-project.io/#/serverles A TracePipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. -To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-trace-collector-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. +To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, a service `telemetry-trace-collector-metrics` is located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation. The relevant metrics are: | Name | Threshold | Description | diff --git a/docs/user/04-metrics.md b/docs/user/04-metrics.md index e2f74f467..8263fcb8a 100644 --- a/docs/user/04-metrics.md +++ b/docs/user/04-metrics.md @@ -386,7 +386,7 @@ You activated a MetricPipeline and metrics start streaming to your backend. To v A MetricPipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure. -To avoid and detect these situations, you should monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace being annotated with `prometheus.io` annotations so that a discovery of the metrics is possible. +To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation. The relevant metrics are: | Name | Threshold | Description | From 58b140a5291284b2948ed4235fcd3ec6dcf685fc Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Tue, 26 Sep 2023 12:43:09 +0200 Subject: [PATCH 6/7] Apply suggestions from code review Co-authored-by: Nina Hingerl <76950046+NHingerl@users.noreply.github.com> --- docs/user/02-logs.md | 4 ++-- docs/user/03-traces.md | 4 ++-- docs/user/04-metrics.md | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/user/02-logs.md b/docs/user/02-logs.md index 05d9fff90..be3ece1c1 100644 --- a/docs/user/02-logs.md +++ b/docs/user/02-logs.md @@ -439,8 +439,8 @@ To avoid and detect these scenarios, you must monitor the instances by collectin The relevant metrics are: | Name | Threshold | Description | |---|---|---| -| telemetry_fsbuffer_usage_bytes | (bytes/1000000000) * 100 > 90 | The metric indicates the current size of the persistent log buffer in bytes running on each instance. If the size reaches 1GB, logs will start getting dropped at that instance. At 90% buffer size an alert should get raised. | -| fluentbit_output_dropped_records_total| total[5m] > 0 | The metric indicates that the instance is actively dropping logs. That typically happens when a log message got rejected with a un-retryable status code like a 400. Any occurence of such drop should be alerted. | +| telemetry_fsbuffer_usage_bytes | (bytes/1000000000) * 100 > 90 | The metric indicates the current size (in bytes) of the persistent log buffer running on each instance. If the size reaches 1GB, logs are dropped at that instance. At 90% buffer size, an alert should be raised. | +| fluentbit_output_dropped_records_total| total[5m] > 0 | The metric indicates that the instance is actively dropping logs. That typically happens when a log message was rejected with a un-retryable status code like a 400. If logs are dropped, an alert should be raised. | ## Limitations diff --git a/docs/user/03-traces.md b/docs/user/03-traces.md index addf085fd..6c84d7377 100644 --- a/docs/user/03-traces.md +++ b/docs/user/03-traces.md @@ -420,9 +420,9 @@ To avoid and detect these scenarios, you must monitor the instances by collectin The relevant metrics are: | Name | Threshold | Description | |---|---|---| -| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | +| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer because the buffer is exhausted. Typically, that happens when the configured backend cannot handle the load on time and is causing back pressure. | | otelcol_exporter_send_failed_spans | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | -| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | +| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore because a processor refuses them. Typically, that happens when memory of the collector is exhausted because too much data arrived and throttling started. | ## Limitations diff --git a/docs/user/04-metrics.md b/docs/user/04-metrics.md index 8263fcb8a..24003071f 100644 --- a/docs/user/04-metrics.md +++ b/docs/user/04-metrics.md @@ -391,9 +391,9 @@ To avoid and detect these scenarios, you must monitor the instances by collectin The relevant metrics are: | Name | Threshold | Description | |---|---|---| -| otelcol_exporter_enqueue_failed_metric_points | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer anymore as the buffer is exhausted. That usually happens when the configured backend cannot handle the load on time and is causing backpressure. | +| otelcol_exporter_enqueue_failed_metric_points | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer because the buffer is exhausted. Typically, that happens when the configured backend cannot handle the load on time and is causing back pressure. | | otelcol_exporter_send_failed_metric_points | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status | -| otelcol_processor_refused_metric_points | total[5m] > 0 | Indicates that items cannot be received anymore as a processor refuses them. That usually happens when memory of the collector is exhausted as too much data is arriving, then a throttling will start. | +| otelcol_processor_refused_metric_points | total[5m] > 0 | Indicates that items cannot be received because a processor refuses them. That usually happens when memory of the collector is exhausted because too much data arrived and throttling started.. | ## Limitations From 7e6df4bf3f626faa9262bb1ba9492737b843c080 Mon Sep 17 00:00:00 2001 From: Andreas Thaler Date: Wed, 27 Sep 2023 10:32:24 +0200 Subject: [PATCH 7/7] merge --- docs/user/06-conditions.md | 48 ----------------------------- docs/user/resources/01-telemetry.md | 9 ++++-- 2 files changed, 6 insertions(+), 51 deletions(-) delete mode 100644 docs/user/06-conditions.md diff --git a/docs/user/06-conditions.md b/docs/user/06-conditions.md deleted file mode 100644 index a1da0b95f..000000000 --- a/docs/user/06-conditions.md +++ /dev/null @@ -1,48 +0,0 @@ -# Telemetry CR conditions - -This section describes the possible states of the Telemetry CR. -The state of the Telemetry CR is derived from the combined state of all the subcomponents, namely, from the condition types `LogComponentsHealthy`, `TraceComponentsHealthy` and `MetricComponentsHealthy`. - -## Log Components State - -The state of the log components is determined by the status condition of type `LogComponentsHealthy`: - -| Condition status | Condition reason | Message | -|------------------|-------------------------|-------------------------------------------------| -| True | NoPipelineDeployed | No pipelines have been deployed | -| True | FluentBitDaemonSetReady | Fluent Bit DaemonSet is ready | -| False | ReferencedSecretMissing | One or more referenced Secrets are missing | -| False | FluentBitDaemonSetNotReady | Fluent Bit DaemonSet is not ready | -| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: LogPipelines (resource-1, resource-2,...), LogParsers (resource-1, resource-2,...) | - - -## Trace Components State - -The state of the trace components is determined by the status condition of type `TraceComponentsHealthy`: - -| Condition status | Condition reason | Message | -|------------------|---------------------------|--------------------------------------------| -| True | NoPipelineDeployed | No pipelines have been deployed | -| True | TraceGatewayDeploymentReady | Trace gateway Deployment is ready | -| False | ReferencedSecretMissing | One or more referenced Secrets are missing | -| False | TraceGatewayDeploymentNotReady | Trace gateway Deployment is not ready | -| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: TracePipelines (resource-1, resource-2,...) | - -## Metric Components State - -The state of the metric components is determined by the status condition of type `MetricComponentsHealthy`: - -| Condition status | Condition reason | Message | -|------------------|---------------------------|--------------------------------------------| -| True | NoPipelineDeployed | No pipelines have been deployed | -| True | MetricGatewayDeploymentReady | Metric gateway Deployment is ready | -| False | ReferencedSecretMissing | One or more referenced Secrets are missing | -| False | MetricGatewayDeploymentNotReady | Metric gateway Deployment is not ready | -| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: MetricPipelines (resource-1, resource-2,...) | - - -## Telemetry CR State - -- 'Ready': Only if all the subcomponent conditions (LogComponentsHealthy, TraceComponentsHealthy, and MetricComponentsHealthy) have a status of 'True.' -- 'Warning': If any of these conditions are not 'True'. -- 'Deleting': When a Telemetry CR is being deleted. diff --git a/docs/user/resources/01-telemetry.md b/docs/user/resources/01-telemetry.md index 1de3355ae..e887f67f9 100644 --- a/docs/user/resources/01-telemetry.md +++ b/docs/user/resources/01-telemetry.md @@ -88,7 +88,8 @@ The state of the log components is determined by the status condition of type `L | True | FluentBitDaemonSetReady | Fluent Bit DaemonSet is ready | | False | ReferencedSecretMissing | One or more referenced Secrets are missing | | False | FluentBitDaemonSetNotReady | Fluent Bit DaemonSet is not ready | -| False | LogResourceBlocksDeletion | One or more LogPipelines/LogParsers still exist | +| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: LogPipelines (resource-1, resource-2,...), LogParsers (resource-1, resource-2,...) | + ### Trace Components State @@ -100,7 +101,8 @@ The state of the trace components is determined by the status condition of type | True | TraceGatewayDeploymentReady | Trace gateway Deployment is ready | | False | ReferencedSecretMissing | One or more referenced Secrets are missing | | False | TraceGatewayDeploymentNotReady | Trace gateway Deployment is not ready | -| False | TraceResourceBlocksDeletion | One or more TracePipelines still exist | +| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: TracePipelines (resource-1, resource-2,...) | + ### Metric Components State @@ -112,7 +114,8 @@ The state of the metric components is determined by the status condition of type | True | MetricGatewayDeploymentReady | Metric gateway Deployment is ready | | False | ReferencedSecretMissing | One or more referenced Secrets are missing | | False | MetricGatewayDeploymentNotReady | Metric gateway Deployment is not ready | -| False | MetricResourceBlocksDeletion | One or more MetricPipelines still exist | +| False | ResourceBlocksDeletion | The deletion of the module is blocked. To unblock the deletion, delete the following resources: MetricPipelines (resource-1, resource-2,...) | + ### Telemetry CR State