Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for exposed metrics #661

Merged
merged 2 commits into from
Nov 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 1 addition & 44 deletions spiceaidocs/docs/clients/Datadog/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,50 +7,7 @@ sidebar_position: 3
pagination_next: null
---

Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info).

## Metrics Endpoint Configuration

The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup.

```bash
2024-07-15T21:48:00.158267Z INFO spiced: Metrics listening on 127.0.0.1:9090
```

Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`:

```bash
spiced --metrics 0.0.0.0:9091
```

or when using Docker:

```Dockerfile
FROM spiceai/spiceai:latest

# Docker configuration ...

# Configure the metrics endpoint on port 9090
CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090
```

Configuration of the metrics endpoint can be verified using a HTTP GET request, for example:

```bash
curl http://localhost:9090/metrics

# TYPE spiced_runtime_http_server_start counter
spiced_runtime_http_server_start 1

# TYPE spiced_runtime_flight_server_start counter
spiced_runtime_flight_server_start 1

# TYPE datasets_count gauge
datasets_count{engine="None"} 1
datasets_count{engine="arrow"} 1
...
```
Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).

## Datadog Agent Configuration

Expand Down
2 changes: 1 addition & 1 deletion spiceaidocs/docs/clients/grafana/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ pagination_prev: 'clients/index'
pagination_next: null
---

Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).
Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).

## Import Grafana Dashboard

Expand Down
108 changes: 108 additions & 0 deletions spiceaidocs/docs/features/monitoring/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: 'Monitoring'
sidebar_label: 'Monitoring'
description: 'Learn how to use Spice telemetry.'
sidebar_position: 9
pagination_prev: null
pagination_next: null
---

Spice can be monitored using the [Spice Prometheus-compatible Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info). Monitoring clients configuration:

- [Grafana](/clients/grafana/)
- [Datadog](/clients/Datadog/)

## Spice Metrics Endpoint Configuration

The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup.

```bash
2024-11-28T19:48:10.942003Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
```

Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`:

```bash
spiced --metrics 0.0.0.0:9091
```

or when using Docker:

```Dockerfile
FROM spiceai/spiceai:latest

# Docker configuration ...

# Configure the metrics endpoint on port 9090
CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090
```

Configuration of the metrics endpoint can be verified using a HTTP GET request, for example:

```bash
curl http://localhost:9090/metrics

# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
# TYPE runtime_flight_server_started counter
runtime_flight_server_started 1
# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
# TYPE runtime_http_server_started counter
runtime_http_server_started 1

# HELP dataset_load_state Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.
# TYPE dataset_load_state gauge
dataset_load_state{dataset="taxi_trips"} 2
dataset_load_state{dataset="taxi_trips_accelerated"} 2

# HELP dataset_active_count Number of currently loaded datasets.
# TYPE dataset_active_count gauge
dataset_active_count{engine="None"} 1
dataset_active_count{engine="duckdb"} 1
...
```

## Metrics

| Metric | Description |
|----------------------------------------------------|---------------------------------------------------------------------------------------------------|
| `accelerated_ready_state_federated_fallback`<br/>*(count)* | Number of times the federated table was queried due to the accelerated table loading the initial data. |
| `catalog_load_errors`<br/>*(count)* | Number of errors loading the catalog provider. |
| `catalog_load_state`<br/>*(gauge)* | Status of the catalog provider. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `dataset_acceleration_last_refresh_time_ms`<br/>*(gauge)* | Unix timestamp in seconds when the last refresh completed. |
| `dataset_acceleration_refresh_duration_ms`<br/>*(histogram)* | Duration in milliseconds to load a full or appended refresh data. |
| `dataset_acceleration_refresh_errors`<br/>*(count)* | Number of errors refreshing the dataset. |
| `dataset_active_count`<br/>*(gauge)* | Number of currently loaded datasets. |
| `dataset_load_state`<br/>*(gauge)* | Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `dataset_unavailable_time_ms`<br/>*(gauge)* | Time dataset went offline in milliseconds. |
| `embeddings_active_count`<br/>*(gauge)* | Number of currently loaded embeddings. |
| `embeddings_load_errors`<br/>*(count)* | Number of errors loading the embedding. |
| `embeddings_load_state`<br/>*(gauge)* | Status of the embedding. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `flight_request_duration_ms`<br/>*(histogram)* | Measures the duration of Flight requests in milliseconds. |
| `flight_requests`<br/>*(count)* | Total number of Flight requests. |
| `http_requests_duration_ms`<br/>*(histogram)* | Measures the duration of HTTP requests in milliseconds. |
| `http_requests_total`<br/>*(count)* | Total number of HTTP requests. |
| `llm_load_state`<br/>*(gauge)* | Status of the LLM model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `model_active_count`<br/>*(gauge)* | Number of currently loaded models. |
| `model_load_duration_ms`<br/>*(histogram)* | Duration in milliseconds to load the model. |
| `model_load_errors`<br/>*(count)* | Number of errors loading the model. |
| `model_load_state`<br/>*(gauge)* | Status of the model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `query_duration_ms`<br/>*(histogram)* | The total amount of time spent planning and executing queries in milliseconds. |
| `query_execution_duration_ms`<br/>*(histogram)* | The total amount of time spent only executing queries (0 for cached queries). |
| `query_executions`<br/>*(count)* | Number of query executions. |
| `query_failures`<br/>*(count)* | Number of query failures. |
| `query_processed_bytes`<br/>*(count)* | Number of bytes processed by the runtime. |
| `query_returned_bytes`<br/>*(count)* | Number of bytes returned to query clients. |
| `results_cache_max_size_bytes`<br/>*(gauge)* | Maximum allowed size of the cache in bytes. |
| `results_cache_requests`<br/>*(count)* | Number of requests to get a key from the cache. |
| `results_cache_hits`<br/>*(count)* | Cache hit count. |
| `results_cache_items_count`<br/>*(gauge)* | Number of items currently in the cache. |
| `results_cache_size_bytes`<br/>*(gauge)* | Size of the cache in bytes. |
| `runtime_flight_server_started`<br/>*(count)* | Indicates the runtime Flight server has started. |
| `runtime_http_server_started`<br/>*(count)* | Indicates the runtime HTTP server has started. |
| `secrets_store_load_duration_ms`<br/>*(histogram)* | Duration in milliseconds to load the secret stores. |
| `tool_active_count`<br/>*(gauge)* | Number of currently loaded LLM tools. |
| `tool_load_errors`<br/>*(count)* | Number of errors loading the LLM tool. |
| `tool_load_state`<br/>*(gauge)* | Status of the LLM tools. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |
| `view_load_errors`<br/>*(count)* | Number of errors loading the view. |
| `view_load_state`<br/>*(gauge)* | Status of the views. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |