From a0613b95afd2245b87a7f1e441835eddcb4b7b53 Mon Sep 17 00:00:00 2001
From: sgrebnov <sergei.grebnov@gmail.com>
Date: Thu, 28 Nov 2024 13:07:13 -0800
Subject: [PATCH] Add documentation for exposed metrics

---
 spiceaidocs/docs/clients/Datadog/index.md     |  45 +-------
 spiceaidocs/docs/clients/grafana/index.md     |   2 +-
 spiceaidocs/docs/features/monitoring/index.md | 108 ++++++++++++++++++
 3 files changed, 110 insertions(+), 45 deletions(-)
 create mode 100644 spiceaidocs/docs/features/monitoring/index.md

diff --git a/spiceaidocs/docs/clients/Datadog/index.md b/spiceaidocs/docs/clients/Datadog/index.md
index fbd39927..70e0b26e 100644
--- a/spiceaidocs/docs/clients/Datadog/index.md
+++ b/spiceaidocs/docs/clients/Datadog/index.md
@@ -7,50 +7,7 @@ sidebar_position: 3
 pagination_next: null
 ---
 
-Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info).
-
-## Metrics Endpoint Configuration
-
-The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup.
-
-```bash
-2024-07-15T21:48:00.158267Z  INFO spiced: Metrics listening on 127.0.0.1:9090
-```
-
-Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`:
-
-```bash
- spiced --metrics 0.0.0.0:9091
-```
-
-or when using Docker:
-
-```Dockerfile
-FROM spiceai/spiceai:latest
-
-# Docker configuration ...
-
-# Configure the metrics endpoint on port 9090
-CMD ["--metrics", "0.0.0.0:9090"]
-EXPOSE 9090
-```
-
-Configuration of the metrics endpoint can be verified using a HTTP GET request, for example:
-
-```bash
-curl http://localhost:9090/metrics
-
-# TYPE spiced_runtime_http_server_start counter
-spiced_runtime_http_server_start 1
-
-# TYPE spiced_runtime_flight_server_start counter
-spiced_runtime_flight_server_start 1
-
-# TYPE datasets_count gauge
-datasets_count{engine="None"} 1
-datasets_count{engine="arrow"} 1
-...
-```
+Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).
 
 ## Datadog Agent Configuration
 
diff --git a/spiceaidocs/docs/clients/grafana/index.md b/spiceaidocs/docs/clients/grafana/index.md
index a0da464f..bafccb59 100644
--- a/spiceaidocs/docs/clients/grafana/index.md
+++ b/spiceaidocs/docs/clients/grafana/index.md
@@ -6,7 +6,7 @@ pagination_prev: 'clients/index'
 pagination_next: null
 ---
 
-Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).
+Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring).
 
 ## Import Grafana Dashboard
 
diff --git a/spiceaidocs/docs/features/monitoring/index.md b/spiceaidocs/docs/features/monitoring/index.md
new file mode 100644
index 00000000..6c411994
--- /dev/null
+++ b/spiceaidocs/docs/features/monitoring/index.md
@@ -0,0 +1,108 @@
+---
+title: 'Monitoring'
+sidebar_label: 'Monitoring'
+description: 'Learn how to use Spice telemetry.'
+sidebar_position: 9
+pagination_prev: null
+pagination_next: null
+---
+
+Spice can be monitored using the [Spice Prometheus-compatible Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info). Monitoring tools configuration:
+
+- [Grafana](/clients/grafana/)
+- [Datadog](/clients/datadog/)
+
+## Spice Metrics Endpoint Configuration
+
+The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup.
+
+```bash
+2024-11-28T19:48:10.942003Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
+```
+
+Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`:
+
+```bash
+ spiced --metrics 0.0.0.0:9091
+```
+
+or when using Docker:
+
+```Dockerfile
+FROM spiceai/spiceai:latest
+
+# Docker configuration ...
+
+# Configure the metrics endpoint on port 9090
+CMD ["--metrics", "0.0.0.0:9090"]
+EXPOSE 9090
+```
+
+Configuration of the metrics endpoint can be verified using a HTTP GET request, for example:
+
+```bash
+curl http://localhost:9090/metrics
+
+# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
+# TYPE runtime_flight_server_started counter
+runtime_flight_server_started 1
+# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
+# TYPE runtime_http_server_started counter
+runtime_http_server_started 1
+
+# HELP dataset_load_state Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.
+# TYPE dataset_load_state gauge
+dataset_load_state{dataset="taxi_trips"} 2
+dataset_load_state{dataset="taxi_trips_accelerated"} 2
+
+# HELP dataset_active_count Number of currently loaded datasets.
+# TYPE dataset_active_count gauge
+dataset_active_count{engine="None"} 1
+dataset_active_count{engine="duckdb"} 1
+...
+```
+
+## Metrics
+
+| Metric                                             | Description                                                                                       |
+|----------------------------------------------------|---------------------------------------------------------------------------------------------------|
+| `accelerated_ready_state_federated_fallback`<br/>*(count)*   | Number of times the federated table was queried due to the accelerated table loading the initial data. |
+| `catalog_load_errors`<br/>*(count)*                | Number of errors loading the catalog provider.                                                   |
+| `catalog_load_state`<br/>*(gauge)*                 | Status of the catalog provider. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.      |
+| `dataset_acceleration_last_refresh_time_ms`<br/>*(gauge)* | Unix timestamp in seconds when the last refresh completed.                                       |
+| `dataset_acceleration_refresh_duration_ms`<br/>*(histogram)* | Duration in milliseconds to load a full or appended refresh data.                                |
+| `dataset_acceleration_refresh_errors`<br/>*(count)*   | Number of errors refreshing the dataset.                                                         |
+| `dataset_active_count`<br/>*(gauge)*               | Number of currently loaded datasets.                                                             |
+| `dataset_load_state`<br/>*(gauge)*                 | Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.               |
+| `dataset_unavailable_time_ms`<br/>*(gauge)*        | Time dataset went offline in milliseconds.                                                       |
+| `embeddings_active_count`<br/>*(gauge)*            | Number of currently loaded embeddings.                                                           |
+| `embeddings_load_errors`<br/>*(count)*             | Number of errors loading the embedding.                                                          |
+| `embeddings_load_state`<br/>*(gauge)*              | Status of the embedding. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.             |
+| `flight_request_duration_ms`<br/>*(histogram)*     | Measures the duration of Flight requests in milliseconds.                                        |
+| `flight_requests`<br/>*(count)*                    | Total number of Flight requests.                                                                 |
+| `http_requests_duration_ms`<br/>*(histogram)*      | Measures the duration of HTTP requests in milliseconds.                                          |
+| `http_requests_total`<br/>*(count)*                | Total number of HTTP requests.                                                                   |
+| `llm_load_state`<br/>*(gauge)*                     | Status of the LLM model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.             |
+| `model_active_count`<br/>*(gauge)*                 | Number of currently loaded models.                                                               |
+| `model_load_duration_ms`<br/>*(histogram)*         | Duration in milliseconds to load the model.                                                      |
+| `model_load_errors`<br/>*(count)*                  | Number of errors loading the model.                                                              |
+| `model_load_state`<br/>*(gauge)*                   | Status of the model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.                 |
+| `query_duration_ms`<br/>*(histogram)*              | The total amount of time spent planning and executing queries in milliseconds.                   |
+| `query_execution_duration_ms`<br/>*(histogram)*    | The total amount of time spent only executing queries (0 for cached queries).                    |
+| `query_executions`<br/>*(count)*                   | Number of query executions.                                                                      |
+| `query_failures`<br/>*(count)*                     | Number of query failures.                                                                        |
+| `query_processed_bytes`<br/>*(count)*              | Number of bytes processed by the runtime.                                                        |
+| `query_returned_bytes`<br/>*(count)*               | Number of bytes returned to query clients.                                                       |
+| `results_cache_max_size_bytes`<br/>*(gauge)*       | Maximum allowed size of the cache in bytes.                                                      |
+| `results_cache_requests`<br/>*(count)*             | Number of requests to get a key from the cache.                                                  |
+| `results_cache_hits`<br/>*(count)*                 | Cache hit count.                                                                                 |
+| `results_cache_items_count`<br/>*(gauge)*          | Number of items currently in the cache.                                                          |
+| `results_cache_size_bytes`<br/>*(gauge)*           | Size of the cache in bytes.                                                                      |
+| `runtime_flight_server_started`<br/>*(count)*      | Indicates the runtime Flight server has started.                                                 |
+| `runtime_http_server_started`<br/>*(count)*        | Indicates the runtime HTTP server has started.                                                   |
+| `secrets_store_load_duration_ms`<br/>*(histogram)* | Duration in milliseconds to load the secret stores.                                              |
+| `tool_active_count`<br/>*(gauge)*                  | Number of currently loaded LLM tools.                                                            |
+| `tool_load_errors`<br/>*(count)*                   | Number of errors loading the LLM tool.                                                           |
+| `tool_load_state`<br/>*(gauge)*                    | Status of the LLM tools. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.             |
+| `view_load_errors`<br/>*(count)*                   | Number of errors loading the view.                                                              |
+| `view_load_state`<br/>*(gauge)*                    | Status of the views. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing.                 |