Skip to content

Commit

Permalink
Improve metrics and tracing docs
Browse files Browse the repository at this point in the history
1. Deprecate Jaeger native.
2. Improve OTLP docs.
3. Add specific documentation for Datadog via OTLP.
  • Loading branch information
bryn committed Oct 3, 2023
1 parent ec6d2dd commit 80ac895
Show file tree
Hide file tree
Showing 2 changed files with 85 additions and 15 deletions.
45 changes: 36 additions & 9 deletions docs/source/configuration/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,38 @@ apollo_router_http_request_duration_seconds_bucket{le="0.9"} 1

> Note that if you haven't run a query against the router yet, you'll see a blank page because no metrics have been generated!
### Available metrics
## Datadog vas OTLP

To use Datadog, you must configure the Datadog agent to accept OTLP metrics. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: localhost:4317
```
The router must also be configured to send traces to the Datadog agent:
```yaml title="router.yaml"
telemetry:
metrics:
otlp:
enabled: true
# Temporality MUST be set to delta. Failure to do this will result in incorrect metrics.
temporality: delta
# Set the endpoint of the Datadog agent
endpoint: http://<datadog-agent>:4317
```
See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details
## Available metrics
The following metrics are available for Prometheus and OpenTelemetry. Attributes are listed where applicable.
#### HTTP
### HTTP
- `apollo_router_http_request_duration_seconds_bucket` - HTTP router request duration
- `apollo_router_http_request_duration_seconds_bucket` - HTTP subgraph request duration, attributes:
Expand All @@ -71,12 +98,12 @@ The following metrics are available for Prometheus and OpenTelemetry. Attributes
- `subgraph`: The subgraph being queried
- `status` : If the retry was aborted (`aborted`)

#### Session
### Session

- `apollo_router_session_count_total` - Number of currently connected clients
- `apollo_router_session_count_active` - Number of in-flight GraphQL requests

#### Cache
### Cache

- `apollo_router_cache_size` — Number of entries in the cache
- `apollo_router_cache_hit_count` - Number of cache hits
Expand All @@ -89,7 +116,7 @@ All cache metrics listed above have the following attributes:
- `kind`: the cache being queried (`apq`, `query planner`, `introspection`)
- `storage`: The backend storage of the cache (`memory`, `redis`)

#### Coprocessor
### Coprocessor

- `apollo_router_operations_coprocessor_total` - Total operations with coprocessors enabled.
- `apollo_router_operations_coprocessor.duration` - Time spent waiting for the coprocessor to answer, in seconds.
Expand All @@ -99,14 +126,14 @@ The coprocessor operations metric has the following attributes:
- `coprocessor.stage`: string (`RouterRequest`, `RouterResponse`, `SubgraphRequest`, `SubgraphResponse`)
- `coprocessor.succeeded`: bool

#### Performance
### Performance

- `apollo_router_processing_time` - Time spent processing a request (outside of waiting for external or subgraph requests) in seconds.
- `apollo_router_query_planning_time` - Time spent planning queries in seconds.
- `apollo_router_query_planning_warmup_duration` - Time spent planning queries in seconds.
- `apollo_router_schema_load_duration` - Time spent loading the schema in seconds.

#### Uplink
### Uplink

- `apollo_router_uplink_fetch_duration_seconds_bucket` - Uplink request duration, attributes:

Expand All @@ -122,13 +149,13 @@ The coprocessor operations metric has the following attributes:

Note that the initial call to uplink during router startup will not be reflected in metrics.

#### Subscription
### Subscription

- `apollo_router_opened_subscriptions` - Number of different opened subscriptions (not the number of clients with an opened subscriptions in case it's deduplicated)
- `apollo_router_deduplicated_subscriptions_total` - Number of subscriptions that has been deduplicated
- `apollo_router_skipped_event_count` - Number of subscription events that has been skipped because too many events have been received from the subgraph but not yet sent to the client.

#### Batching
### Batching

- `apollo_router.operations.batching` - A counter of the number of query batches received by the router.
- `apollo_router.operations.batching.size` - A histogram tracking the number of queries contained within a query batch.
Expand Down
55 changes: 49 additions & 6 deletions docs/source/configuration/tracing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ telemetry:

You will need to experiment to find the setting that are appropriate for your use case.

## Using Datadog
## Using Datadog (native)

The Apollo Router can be configured to connect to either the default agent address or a URL.

Expand Down Expand Up @@ -211,7 +211,9 @@ Instead when `enable_span_mapping` is set to `true` the following trace will be
```


## Using Jaeger
## Using Jaeger (native)

> :warning: [Jaeger native is deprecated](https://opentelemetry.io/blog/2022/jaeger-native-otlp/) and will be removed in a future Router release. Instead, [Open Telemetry Collector](OpenTelemetry Collector via OTLP) should be used.

The Apollo Router can be configured to export tracing data to Jaeger either via an agent or http collector.

Expand Down Expand Up @@ -247,11 +249,13 @@ telemetry:
password: "${env.JAEGER_PASSWORD}"
```

## OpenTelemetry Collector via OTLP

[OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) is a horizontally scalable collector that you can use to receive, process, and export your telemetry data in a pluggable way.
## OTLP

If you find that the built-in telemetry features of the Apollo Router are missing some desired functionality (e.g., [exporting to Kafka](https://opentelemetry.io/docs/collector/configuration/#exporters)), then it's worth considering this option.
OTLP is the native protocol for open telemetry. It can be used to export traces to a variety of backends including:
* OpenTelemetry Collector
* Datadog
* Honeycomb
* Lightstep

```yaml title="router.yaml"
telemetry:
Expand Down Expand Up @@ -282,6 +286,45 @@ telemetry:

Remember that `file.` and `env.` prefixes can be used for expansion in config yaml. e.g. `${file.ca.txt}`.

### Datadog

To use Datadog, you must configure the Datadog agent to accept OTLP traces. This can be done by adding the following to your `datadog.yaml`:

```yaml title="datadog.yaml"
otlp_config:
receiver:
protocols:
grpc:
endpoint: localhost:4317
```

The router must also be configured to send traces to the Datadog agent:

```yaml title="router.yaml"
telemetry:
tracing:
otlp:
enabled: true
# Send to Datagod agent
endpoint: http://<datadog-agent>:4317
```

See [Datadog Agent configuration](https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host) for more details

### Jaeger via OpenTelemetry Collector

Users wishing to use Jaeger should use [Open Telemetry Collector](https://opentelemetry.io/docs/collector/) via OTLP.

```yaml title="otel-collector.yaml"
exporters:
# Data sources: traces
otlp/jaeger:
endpoint: jaeger-all-in-one:4317
```

See [https://opentelemetry.io/docs/collector/configuration/#exporters](https://opentelemetry.io/docs/collector/configuration/#exporters) for more information.

## Using Zipkin

The Apollo Router can be configured to export tracing data to either the default collector address or a URL:
Expand Down

0 comments on commit 80ac895

Please sign in to comment.