Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve telemetry documentation #3962

Merged
merged 12 commits into from
Oct 17, 2023
45 changes: 36 additions & 9 deletions docs/source/configuration/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,38 @@ apollo_router_http_request_duration_seconds_bucket{le="0.9"} 1

> Note that if you haven't run a query against the router yet, you'll see a blank page because no metrics have been generated!

### Available metrics
## Datadog vas OTLP
BrynCooke marked this conversation as resolved.
Show resolved Hide resolved

To use Datadog, you must configure the Datadog agent to accept OTLP metrics. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: localhost:4317
BrynCooke marked this conversation as resolved.
Show resolved Hide resolved
```

The router must also be configured to send traces to the Datadog agent:

```yaml title="router.yaml"
telemetry:
metrics:
otlp:
enabled: true
# Temporality MUST be set to delta. Failure to do this will result in incorrect metrics.
temporality: delta
# Set the endpoint of the Datadog agent
endpoint: http://<datadog-agent>:4317
```

See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details

## Available metrics

The following metrics are available for Prometheus and OpenTelemetry. Attributes are listed where applicable.

#### HTTP
### HTTP

- `apollo_router_http_request_duration_seconds_bucket` - HTTP router request duration
- `apollo_router_http_request_duration_seconds_bucket` - HTTP subgraph request duration, attributes:
Expand All @@ -71,12 +98,12 @@ The following metrics are available for Prometheus and OpenTelemetry. Attributes
- `subgraph`: The subgraph being queried
- `status` : If the retry was aborted (`aborted`)

#### Session
### Session

- `apollo_router_session_count_total` - Number of currently connected clients
- `apollo_router_session_count_active` - Number of in-flight GraphQL requests

#### Cache
### Cache

- `apollo_router_cache_size` — Number of entries in the cache
- `apollo_router_cache_hit_count` - Number of cache hits
Expand All @@ -89,7 +116,7 @@ All cache metrics listed above have the following attributes:
- `kind`: the cache being queried (`apq`, `query planner`, `introspection`)
- `storage`: The backend storage of the cache (`memory`, `redis`)

#### Coprocessor
### Coprocessor

- `apollo_router_operations_coprocessor_total` - Total operations with coprocessors enabled.
- `apollo_router_operations_coprocessor.duration` - Time spent waiting for the coprocessor to answer, in seconds.
Expand All @@ -99,14 +126,14 @@ The coprocessor operations metric has the following attributes:
- `coprocessor.stage`: string (`RouterRequest`, `RouterResponse`, `SubgraphRequest`, `SubgraphResponse`)
- `coprocessor.succeeded`: bool

#### Performance
### Performance

- `apollo_router_processing_time` - Time spent processing a request (outside of waiting for external or subgraph requests) in seconds.
- `apollo_router_query_planning_time` - Time spent planning queries in seconds.
- `apollo_router_query_planning_warmup_duration` - Time spent planning queries in seconds.
- `apollo_router_schema_load_duration` - Time spent loading the schema in seconds.

#### Uplink
### Uplink

- `apollo_router_uplink_fetch_duration_seconds_bucket` - Uplink request duration, attributes:

Expand All @@ -122,13 +149,13 @@ The coprocessor operations metric has the following attributes:

Note that the initial call to uplink during router startup will not be reflected in metrics.

#### Subscription
### Subscription

- `apollo_router_opened_subscriptions` - Number of different opened subscriptions (not the number of clients with an opened subscriptions in case it's deduplicated)
- `apollo_router_deduplicated_subscriptions_total` - Number of subscriptions that has been deduplicated
- `apollo_router_skipped_event_count` - Number of subscription events that has been skipped because too many events have been received from the subgraph but not yet sent to the client.

#### Batching
### Batching

- `apollo_router.operations.batching` - A counter of the number of query batches received by the router.
- `apollo_router.operations.batching.size` - A histogram tracking the number of queries contained within a query batch.
Expand Down
55 changes: 49 additions & 6 deletions docs/source/configuration/tracing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ telemetry:

You will need to experiment to find the setting that are appropriate for your use case.

## Using Datadog
## Using Datadog (native)

The Apollo Router can be configured to connect to either the default agent address or a URL.

Expand Down Expand Up @@ -211,7 +211,9 @@ Instead when `enable_span_mapping` is set to `true` the following trace will be
```


## Using Jaeger
## Using Jaeger (native)

> :warning: [Jaeger native is deprecated](https://opentelemetry.io/blog/2022/jaeger-native-otlp/) and will be removed in a future Router release. Instead, [Open Telemetry Collector](OpenTelemetry Collector via OTLP) should be used.
BrynCooke marked this conversation as resolved.
Show resolved Hide resolved

The Apollo Router can be configured to export tracing data to Jaeger either via an agent or http collector.

Expand Down Expand Up @@ -247,11 +249,13 @@ telemetry:
password: "${env.JAEGER_PASSWORD}"
```

## OpenTelemetry Collector via OTLP

[OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) is a horizontally scalable collector that you can use to receive, process, and export your telemetry data in a pluggable way.
## OTLP

If you find that the built-in telemetry features of the Apollo Router are missing some desired functionality (e.g., [exporting to Kafka](https://opentelemetry.io/docs/collector/configuration/#exporters)), then it's worth considering this option.
OTLP is the native protocol for open telemetry. It can be used to export traces to a variety of backends including:
BrynCooke marked this conversation as resolved.
Show resolved Hide resolved
* OpenTelemetry Collector
* Datadog
* Honeycomb
* Lightstep

```yaml title="router.yaml"
telemetry:
Expand Down Expand Up @@ -282,6 +286,45 @@ telemetry:

Remember that `file.` and `env.` prefixes can be used for expansion in config yaml. e.g. `${file.ca.txt}`.

### Datadog

To use Datadog, you must configure the Datadog agent to accept OTLP traces. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: localhost:4317
```

The router must also be configured to send traces to the Datadog agent:

```yaml title="router.yaml"
telemetry:
tracing:
otlp:
enabled: true

# Send to Datagod agent
endpoint: http://<datadog-agent>:4317
```

See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details

### Jaeger via OpenTelemetry Collector

Users wishing to use Jaeger should use [Open Telemetry Collector](https://opentelemetry.io/docs/collector/) via OTLP.

```yaml title="otel-collector.yaml"
exporters:
# Data sources: traces
otlp/jaeger:
endpoint: jaeger-all-in-one:4317
```

See [https://opentelemetry.io/docs/collector/configuration/#exporters](https://opentelemetry.io/docs/collector/configuration/#exporters) for more information.

## Using Zipkin

The Apollo Router can be configured to export tracing data to either the default collector address or a URL:
Expand Down