Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve telemetry documentation #3962

Merged
merged 12 commits into from
Oct 17, 2023
45 changes: 36 additions & 9 deletions docs/source/configuration/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,38 @@ apollo_router_http_request_duration_seconds_bucket{le="0.9"} 1

> Note that if you haven't run a query against the router yet, you'll see a blank page because no metrics have been generated!

### Available metrics
## Datadog via OTLP

To use Datadog, you must configure the Datadog agent to accept OTLP metrics. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: 127.0.0.1:4317
garypen marked this conversation as resolved.
Show resolved Hide resolved
```

The router must also be configured to send traces to the Datadog agent:

```yaml title="router.yaml"
telemetry:
metrics:
otlp:
enabled: true
# Temporality MUST be set to delta. Failure to do this will result in incorrect metrics.
temporality: delta
# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:4317)
endpoint: default
garypen marked this conversation as resolved.
Show resolved Hide resolved
```

See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details

## Available metrics

The following metrics are available for Prometheus and OpenTelemetry. Attributes are listed where applicable.

#### HTTP
### HTTP

- `apollo_router_http_request_duration_seconds_bucket` - HTTP router request duration
- `apollo_router_http_request_duration_seconds_bucket` - HTTP subgraph request duration, attributes:
Expand All @@ -71,12 +98,12 @@ The following metrics are available for Prometheus and OpenTelemetry. Attributes
- `subgraph`: The subgraph being queried
- `status` : If the retry was aborted (`aborted`)

#### Session
### Session

- `apollo_router_session_count_total` - Number of currently connected clients
- `apollo_router_session_count_active` - Number of in-flight GraphQL requests

#### Cache
### Cache

- `apollo_router_cache_size` — Number of entries in the cache
- `apollo_router_cache_hit_count` - Number of cache hits
Expand All @@ -89,7 +116,7 @@ All cache metrics listed above have the following attributes:
- `kind`: the cache being queried (`apq`, `query planner`, `introspection`)
- `storage`: The backend storage of the cache (`memory`, `redis`)

#### Coprocessor
### Coprocessor

- `apollo_router_operations_coprocessor_total` - Total operations with coprocessors enabled.
- `apollo_router_operations_coprocessor.duration` - Time spent waiting for the coprocessor to answer, in seconds.
Expand All @@ -99,14 +126,14 @@ The coprocessor operations metric has the following attributes:
- `coprocessor.stage`: string (`RouterRequest`, `RouterResponse`, `SubgraphRequest`, `SubgraphResponse`)
- `coprocessor.succeeded`: bool

#### Performance
### Performance

- `apollo_router_processing_time` - Time spent processing a request (outside of waiting for external or subgraph requests) in seconds.
- `apollo_router_query_planning_time` - Time spent planning queries in seconds.
- `apollo_router_query_planning_warmup_duration` - Time spent planning queries in seconds.
- `apollo_router_schema_load_duration` - Time spent loading the schema in seconds.

#### Uplink
### Uplink

- `apollo_router_uplink_fetch_duration_seconds_bucket` - Uplink request duration, attributes:

Expand All @@ -122,13 +149,13 @@ The coprocessor operations metric has the following attributes:

Note that the initial call to uplink during router startup will not be reflected in metrics.

#### Subscription
### Subscription

- `apollo_router_opened_subscriptions` - Number of different opened subscriptions (not the number of clients with an opened subscriptions in case it's deduplicated)
- `apollo_router_deduplicated_subscriptions_total` - Number of subscriptions that has been deduplicated
- `apollo_router_skipped_event_count` - Number of subscription events that has been skipped because too many events have been received from the subgraph but not yet sent to the client.

#### Batching
### Batching

- `apollo_router.operations.batching` - A counter of the number of query batches received by the router.
- `apollo_router.operations.batching.size` - A histogram tracking the number of queries contained within a query batch.
Expand Down
79 changes: 67 additions & 12 deletions docs/source/configuration/tracing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ telemetry:

You will need to experiment to find the setting that are appropriate for your use case.

## Using Datadog
## Using Datadog (native)

The Apollo Router can be configured to connect to either the default agent address or a URL.

Expand All @@ -158,7 +158,7 @@ telemetry:
tracing:
datadog:
enabled: true
# Either 'default' or a URL (example: 'http://127.0.0.1:8126')
# Optional endpoint, either 'default' or a URL (Defaults to http://localhost:8126/v0.4/traces)
endpoint: default
```

Expand Down Expand Up @@ -211,7 +211,9 @@ Instead when `enable_span_mapping` is set to `true` the following trace will be
```


## Using Jaeger
## Using Jaeger (native)

> :warning: [Jaeger native is deprecated](https://opentelemetry.io/blog/2022/jaeger-native-otlp/) and will be removed in a future release of the Apollo Router. Instead, [OTLP](#jaeger) should be used.

The Apollo Router can be configured to export tracing data to Jaeger either via an agent or http collector.

Expand All @@ -226,14 +228,12 @@ telemetry:
enabled: true
# Optional agent configuration,
agent:
# Optional endpoint, either 'default' or a socket address (Defaults to 127.0.0.1:6831)
endpoint: docker_jaeger:14268
# Optional endpoint, either 'default' or a socket address (Defaults to 127.0.0.1:6832)
endpoint: default
```

### Collector config

If you're using Kubernetes, you can inject your secrets into configuration via environment variables:

```yaml title="router.yaml"
telemetry:
tracing:
Expand All @@ -242,16 +242,20 @@ telemetry:
# Optional collector configuration,
collector:
# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:14268/api/traces)
endpoint: "http://my-jaeger-collector"
endpoint: default
username: "${env.JAEGER_USERNAME}"
password: "${env.JAEGER_PASSWORD}"
```

## OpenTelemetry Collector via OTLP
Remember: If you're using Kubernetes, you can inject your secrets into configuration via environment variables.
BrynCooke marked this conversation as resolved.
Show resolved Hide resolved

[OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) is a horizontally scalable collector that you can use to receive, process, and export your telemetry data in a pluggable way.
## OTLP

If you find that the built-in telemetry features of the Apollo Router are missing some desired functionality (e.g., [exporting to Kafka](https://opentelemetry.io/docs/collector/configuration/#exporters)), then it's worth considering this option.
OTLP (OpenTelemetry protocol) is the native protocol for open telemetry. It can be used to export traces to a variety of backends including:
* OpenTelemetry Collector
* Datadog
* Honeycomb
* Lightstep

```yaml title="router.yaml"
telemetry:
Expand Down Expand Up @@ -282,6 +286,57 @@ telemetry:

Remember that `file.` and `env.` prefixes can be used for expansion in config yaml. e.g. `${file.ca.txt}`.

### Datadog

To use Datadog, you must configure the Datadog agent to accept OTLP traces. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: 127.0.0.1:4317
```

The router must also be configured to send traces to the Datadog agent:

```yaml title="router.yaml"
telemetry:
tracing:
otlp:
enabled: true

# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:4317)
endpoint: default
```

See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details

### Jaeger

Since 1.35.0, [Jaeger supports native OTLP ingestion](https://medium.com/jaegertracing/introducing-native-support-for-opentelemetry-in-jaeger-eb661be8183c).

Ensure that when running Jaeger in docker port 4317 is exposed and that `COLLECTOR_OTLP_ENABLED` is set to `true`:
```bash
docker run --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:1.35
```

Then configure the router to send traces via OTLP:
```yaml title="router.yaml"
telemetry:
tracing:
otlp:
enabled: true

# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:4317)
endpoint: default
```

## Using Zipkin

The Apollo Router can be configured to export tracing data to either the default collector address or a URL:
Expand All @@ -293,5 +348,5 @@ telemetry:
enabled: true

# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:9411/api/v2/spans)
endpoint: http://my_zipkin_collector.dev
endpoint: default
```