From a0613b95afd2245b87a7f1e441835eddcb4b7b53 Mon Sep 17 00:00:00 2001 From: sgrebnov Date: Thu, 28 Nov 2024 13:07:13 -0800 Subject: [PATCH] Add documentation for exposed metrics --- spiceaidocs/docs/clients/Datadog/index.md | 45 +------- spiceaidocs/docs/clients/grafana/index.md | 2 +- spiceaidocs/docs/features/monitoring/index.md | 108 ++++++++++++++++++ 3 files changed, 110 insertions(+), 45 deletions(-) create mode 100644 spiceaidocs/docs/features/monitoring/index.md diff --git a/spiceaidocs/docs/clients/Datadog/index.md b/spiceaidocs/docs/clients/Datadog/index.md index fbd39927..70e0b26e 100644 --- a/spiceaidocs/docs/clients/Datadog/index.md +++ b/spiceaidocs/docs/clients/Datadog/index.md @@ -7,50 +7,7 @@ sidebar_position: 3 pagination_next: null --- -Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info). - -## Metrics Endpoint Configuration - -The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup. - -```bash -2024-07-15T21:48:00.158267Z INFO spiced: Metrics listening on 127.0.0.1:9090 -``` - -Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`: - -```bash - spiced --metrics 0.0.0.0:9091 -``` - -or when using Docker: - -```Dockerfile -FROM spiceai/spiceai:latest - -# Docker configuration ... - -# Configure the metrics endpoint on port 9090 -CMD ["--metrics", "0.0.0.0:9090"] -EXPOSE 9090 -``` - -Configuration of the metrics endpoint can be verified using a HTTP GET request, for example: - -```bash -curl http://localhost:9090/metrics - -# TYPE spiced_runtime_http_server_start counter -spiced_runtime_http_server_start 1 - -# TYPE spiced_runtime_flight_server_start counter -spiced_runtime_flight_server_start 1 - -# TYPE datasets_count gauge -datasets_count{engine="None"} 1 -datasets_count{engine="arrow"} 1 -... -``` +Spice can be monitored with [Datadog](https://www.datadoghq.com/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring). ## Datadog Agent Configuration diff --git a/spiceaidocs/docs/clients/grafana/index.md b/spiceaidocs/docs/clients/grafana/index.md index a0da464f..bafccb59 100644 --- a/spiceaidocs/docs/clients/grafana/index.md +++ b/spiceaidocs/docs/clients/grafana/index.md @@ -6,7 +6,7 @@ pagination_prev: 'clients/index' pagination_next: null --- -Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring). +Spice can be monitored with [Grafana](https://grafana.com/grafana/) using the [Spice Metrics Endpoint](/features/monitoring/) and pre-built dashboards available in the [Spice repository](https://github.com/spiceai/spiceai/tree/trunk/monitoring). ## Import Grafana Dashboard diff --git a/spiceaidocs/docs/features/monitoring/index.md b/spiceaidocs/docs/features/monitoring/index.md new file mode 100644 index 00000000..6c411994 --- /dev/null +++ b/spiceaidocs/docs/features/monitoring/index.md @@ -0,0 +1,108 @@ +--- +title: 'Monitoring' +sidebar_label: 'Monitoring' +description: 'Learn how to use Spice telemetry.' +sidebar_position: 9 +pagination_prev: null +pagination_next: null +--- + +Spice can be monitored using the [Spice Prometheus-compatible Metrics Endpoint](https://prometheus.io/docs/instrumenting/exposition_formats/#basic-info). Monitoring tools configuration: + +- [Grafana](/clients/grafana/) +- [Datadog](/clients/datadog/) + +## Spice Metrics Endpoint Configuration + +The metrics endpoint uses port `9090` by default. The metrics endpoint configuration is logged at startup. + +```bash +2024-11-28T19:48:10.942003Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090 +``` + +Pass the `--metrics` parameter to bind to a specific port. For example, to bind to port `9091`: + +```bash + spiced --metrics 0.0.0.0:9091 +``` + +or when using Docker: + +```Dockerfile +FROM spiceai/spiceai:latest + +# Docker configuration ... + +# Configure the metrics endpoint on port 9090 +CMD ["--metrics", "0.0.0.0:9090"] +EXPOSE 9090 +``` + +Configuration of the metrics endpoint can be verified using a HTTP GET request, for example: + +```bash +curl http://localhost:9090/metrics + +# HELP runtime_flight_server_started Indicates the runtime Flight server has started. +# TYPE runtime_flight_server_started counter +runtime_flight_server_started 1 +# HELP runtime_http_server_started Indicates the runtime HTTP server has started. +# TYPE runtime_http_server_started counter +runtime_http_server_started 1 + +# HELP dataset_load_state Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. +# TYPE dataset_load_state gauge +dataset_load_state{dataset="taxi_trips"} 2 +dataset_load_state{dataset="taxi_trips_accelerated"} 2 + +# HELP dataset_active_count Number of currently loaded datasets. +# TYPE dataset_active_count gauge +dataset_active_count{engine="None"} 1 +dataset_active_count{engine="duckdb"} 1 +... +``` + +## Metrics + +| Metric | Description | +|----------------------------------------------------|---------------------------------------------------------------------------------------------------| +| `accelerated_ready_state_federated_fallback`
*(count)* | Number of times the federated table was queried due to the accelerated table loading the initial data. | +| `catalog_load_errors`
*(count)* | Number of errors loading the catalog provider. | +| `catalog_load_state`
*(gauge)* | Status of the catalog provider. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `dataset_acceleration_last_refresh_time_ms`
*(gauge)* | Unix timestamp in seconds when the last refresh completed. | +| `dataset_acceleration_refresh_duration_ms`
*(histogram)* | Duration in milliseconds to load a full or appended refresh data. | +| `dataset_acceleration_refresh_errors`
*(count)* | Number of errors refreshing the dataset. | +| `dataset_active_count`
*(gauge)* | Number of currently loaded datasets. | +| `dataset_load_state`
*(gauge)* | Status of the dataset. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `dataset_unavailable_time_ms`
*(gauge)* | Time dataset went offline in milliseconds. | +| `embeddings_active_count`
*(gauge)* | Number of currently loaded embeddings. | +| `embeddings_load_errors`
*(count)* | Number of errors loading the embedding. | +| `embeddings_load_state`
*(gauge)* | Status of the embedding. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `flight_request_duration_ms`
*(histogram)* | Measures the duration of Flight requests in milliseconds. | +| `flight_requests`
*(count)* | Total number of Flight requests. | +| `http_requests_duration_ms`
*(histogram)* | Measures the duration of HTTP requests in milliseconds. | +| `http_requests_total`
*(count)* | Total number of HTTP requests. | +| `llm_load_state`
*(gauge)* | Status of the LLM model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `model_active_count`
*(gauge)* | Number of currently loaded models. | +| `model_load_duration_ms`
*(histogram)* | Duration in milliseconds to load the model. | +| `model_load_errors`
*(count)* | Number of errors loading the model. | +| `model_load_state`
*(gauge)* | Status of the model. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `query_duration_ms`
*(histogram)* | The total amount of time spent planning and executing queries in milliseconds. | +| `query_execution_duration_ms`
*(histogram)* | The total amount of time spent only executing queries (0 for cached queries). | +| `query_executions`
*(count)* | Number of query executions. | +| `query_failures`
*(count)* | Number of query failures. | +| `query_processed_bytes`
*(count)* | Number of bytes processed by the runtime. | +| `query_returned_bytes`
*(count)* | Number of bytes returned to query clients. | +| `results_cache_max_size_bytes`
*(gauge)* | Maximum allowed size of the cache in bytes. | +| `results_cache_requests`
*(count)* | Number of requests to get a key from the cache. | +| `results_cache_hits`
*(count)* | Cache hit count. | +| `results_cache_items_count`
*(gauge)* | Number of items currently in the cache. | +| `results_cache_size_bytes`
*(gauge)* | Size of the cache in bytes. | +| `runtime_flight_server_started`
*(count)* | Indicates the runtime Flight server has started. | +| `runtime_http_server_started`
*(count)* | Indicates the runtime HTTP server has started. | +| `secrets_store_load_duration_ms`
*(histogram)* | Duration in milliseconds to load the secret stores. | +| `tool_active_count`
*(gauge)* | Number of currently loaded LLM tools. | +| `tool_load_errors`
*(count)* | Number of errors loading the LLM tool. | +| `tool_load_state`
*(gauge)* | Status of the LLM tools. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. | +| `view_load_errors`
*(count)* | Number of errors loading the view. | +| `view_load_state`
*(gauge)* | Status of the views. 1=Initializing, 2=Ready, 3=Disabled, 4=Error, 5=Refreshing. |