Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add operation sections, move telemetry status docu #423

Merged
merged 7 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion apis/operator/v1alpha1/telemetry_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,18 +50,21 @@ type TelemetryStatus struct {
// If all Conditions are met, State is expected to be in StateReady.
Conditions []metav1.Condition `json:"conditions,omitempty"`

// GatewayEndpoints for trace and metric gateway
// endpoints for trace and metric gateway.
// +nullable
GatewayEndpoints GatewayEndpoints `json:"endpoints,omitempty"`
// add other fields to status subresource here
}

type GatewayEndpoints struct {
//traces contains the endpoints for trace gateway supporting OTLP.
Traces *OTLPEndpoints `json:"traces,omitempty"`
}

type OTLPEndpoints struct {
//GRPC endpoint for OTLP.
GRPC string `json:"grpc,omitempty"`
//HTTP endpoint for OTLP.
HTTP string `json:"http,omitempty"`
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,18 @@ spec:
type: object
type: array
endpoints:
description: GatewayEndpoints for trace and metric gateway
description: endpoints for trace and metric gateway.
nullable: true
properties:
traces:
description: traces contains the endpoints for trace gateway supporting
OTLP.
properties:
grpc:
description: GRPC endpoint for OTLP.
type: string
http:
description: HTTP endpoint for OTLP.
type: string
type: object
type: object
Expand Down
13 changes: 13 additions & 0 deletions docs/user/02-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,19 @@ The record **after** applying the JSON parser:

As per the LogPipeline definition, a dedicated [rewrite_tag](https://docs.fluentbit.io/manual/pipeline/filters/rewrite-tag) filter is introduced. The filter brings a dedicated filesystem buffer for the outputs defined in the related pipeline, and with that, ensures a shipment of the logs isolated from outputs of other pipelines. As a consequence, each pipeline runs on its own [tag](https://docs.fluentbit.io/manual/concepts/key-concepts#tag).

## Operations

A LogPipeline creates a DaemonSet running one Fluent Bit instance per Node in your cluster. That instance collects and ships application logs to the configured backend. The Telemetry module assures that the Fluent Bit instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure.

To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, two Services `telemetry-fluent-bit-metrics` and `telemetry-fluent-bit-exporter-metrics` are located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation.

The relevant metrics are:
| Name | Threshold | Description |
|---|---|---|
| telemetry_fsbuffer_usage_bytes | (bytes/1000000000) * 100 > 90 | The metric indicates the current size (in bytes) of the persistent log buffer running on each instance. If the size reaches 1GB, logs are dropped at that instance. At 90% buffer size, an alert should be raised. |
| fluentbit_output_dropped_records_total| total[5m] > 0 | The metric indicates that the instance is actively dropping logs. That typically happens when a log message was rejected with a un-retryable status code like a 400. If logs are dropped, an alert should be raised. |


## Limitations

Currently, there are the following limitations for LogPipelines that are served by Fluent Bit:
Expand Down
13 changes: 13 additions & 0 deletions docs/user/03-traces.md
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,19 @@ The Kyma [Eventing](https://kyma-project.io/#/01-overview/eventing/README) compo
### Serverless
By default, all engines for the [Serverless](https://kyma-project.io/#/serverless-manager/user/README) module integrate the [Open Telemetry SDK](https://opentelemetry.io/docs/reference/specification/metrics/sdk/). With that, trace propagation no longer is your concern, because the used middlewares are configured to automatically propagate the context for chained calls. Because the Telemetry endpoints are configured by default, Serverless also reports custom spans for incoming and outgoing requests. You can [customize Function traces](https://kyma-project.io/#/03-tutorials/00-serverless/svls-12-customize-function-traces) to add more spans as part of your Serverless source code.

## Operations

A TracePipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure.

To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, a service `telemetry-trace-collector-metrics` is located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation.

The relevant metrics are:
| Name | Threshold | Description |
|---|---|---|
| otelcol_exporter_enqueue_failed_spans | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer because the buffer is exhausted. Typically, that happens when the configured backend cannot handle the load on time and is causing back pressure. |
| otelcol_exporter_send_failed_spans | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status |
| otelcol_processor_refused_spans | total[5m] > 0 | Indicates that items cannot be received anymore because a processor refuses them. Typically, that happens when memory of the collector is exhausted because too much data arrived and throttling started. |

## Limitations

The trace gateway setup is designed using the following assumptions:
Expand Down
15 changes: 14 additions & 1 deletion docs/user/04-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,19 @@ You activated a MetricPipeline and metrics start streaming to your backend. To v
NAME STATUS AGE
backend Ready 44s

## Operations

A MetricPipeline creates a Deployment running OTel Collector instances in your cluster. That instances will serve OTLP endpoints and ship received data to the configured backend. The Telemetry module assures that the OTel Collector instances are operational and healthy at any time. The Telemetry module delivers the data to the backend using typical patterns like buffering and retries (see [Limitations](#limitations)). However, there are scenarios where the instances will drop logs because the backend is either not reachable for some duration, or cannot handle the log load and is causing back pressure.

To avoid and detect these scenarios, you must monitor the instances by collecting relevant metrics. For that, a service `telemetry-metric-gateway-metrics` is located in the `kyma-system` namespace. For easier discovery, they have the `prometheus.io` annotation.

The relevant metrics are:
| Name | Threshold | Description |
|---|---|---|
| otelcol_exporter_enqueue_failed_metric_points | total[5m] > 0 | Indicates that new or retried items could not be added to the exporter buffer because the buffer is exhausted. Typically, that happens when the configured backend cannot handle the load on time and is causing back pressure. |
| otelcol_exporter_send_failed_metric_points | total[5m] > 0 | Indicates that items are refused in an non-retryable way like a 400 status |
| otelcol_processor_refused_metric_points | total[5m] > 0 | Indicates that items cannot be received because a processor refuses them. That usually happens when memory of the collector is exhausted because too much data arrived and throttling started.. |

## Limitations

The metric gateway setup is based on the following assumptions:
Expand Down Expand Up @@ -419,7 +432,7 @@ Cause: The backend is not reachable or wrong authentication credentials are used

Remedy:

1. To check the `telemetry-trace-collector` Pods for error logs, call `kubectl logs -n kyma-system {POD_NAME}`.
1. To check the `telemetry-metric-gateway` Pods for error logs, call `kubectl logs -n kyma-system {POD_NAME}`.
2. Fix the errors.

### Only Istio metrics arrive at the destination
Expand Down
48 changes: 0 additions & 48 deletions docs/user/06-conditions.md

This file was deleted.

8 changes: 5 additions & 3 deletions docs/user/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ Kyma's Telemetry module focuses exactly on the aspects of instrumentation, colle

To support telemetry for your applications, Kyma's Telemetry module provides the following features:

- Guidance for the instrumentation: Based on [Open Telemetry](https://opentelemetry.io/), you get community samples on how to instrument your code using the [Open Telemetry SDKs](https://opentelemetry.io/docs/instrumentation/) in nearly every programming language.
- Tooling for collection, filtering, and shipment: Based on the [Open Telemetry Collector](https://opentelemetry.io/docs/collector/), you can configure basic pipelines to filter and ship telemetry data.
- Integration in a vendor-neutral way to a vendor-specific observability system: Based on the [OpenTelemetry protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/), you can integrate backend systems.
- Tooling for collection, filtering, and shipment: Based on the [Open Telemetry Collector](https://opentelemetry.io/docs/collector/) and [Fluent Bit](https://fluentbit.io/), you can configure basic pipelines to filter and ship telemetry data.
- Integration in a vendor-neutral way to a vendor-specific observability system (traces and metrics only): Based on the [OpenTelemetry protocol (OTLP)](https://opentelemetry.io/docs/reference/specification/protocol/), you can integrate backend systems.
- Guidance for the instrumentation (traces and metrics only): Based on [Open Telemetry](https://opentelemetry.io/), you get community samples on how to instrument your code using the [Open Telemetry SDKs](https://opentelemetry.io/docs/instrumentation/) in nearly every programming language.
- Opt-out from features for advanced scenarios: At any time, you can opt out for each data type, and use custom tooling to collect and ship the telemetry data.
- SAP BTP as first-class integration: Integration into BTP Observability services is prioritized.

Expand Down Expand Up @@ -59,6 +59,8 @@ For details, see [Traces](03-traces.md).

### Metric Gateway/Agent

> **NOTE:** The feature is not available yet. To understand the current progress, watch this [epic](https://github.com/kyma-project/kyma/issues/13079).

The metric gateway and agent are based on an [OTel Collector](https://opentelemetry.io/docs/collector/) [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and a [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). The gateway provides an [OTLP-based](https://opentelemetry.io/docs/reference/specification/protocol/) endpoint to which applications can push the metric signals. The agent scrapes annotated Prometheus-based workloads. According to a MetricPipeline configuration, the gateway processes and ships the metric data to a target system.

For more information, see [Metrics](04-metrics.md).
Expand Down
Loading