You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR, I propose to improve observability for Valkey, like built-in RED time-series metrics
Overview
This proposal outlines a new OBSERVE command to improve Valkey’s observability capabilities. By enabling advanced time-series metrics, custom gathering pipelines, and in-server data aggregation, OBSERVE will equip Valkey users with first-class monitoring commands for granular insight into server behavior and performance.
Background
After discussions with Irfan Ahmad, an attendee at the '24 Valkey Summit, I developed this initial proposal to introduce native observability pipelines within Valkey. Currently, Valkey lacks comprehensive, customizable observability tools embedded directly within the server, and this proposal aims to fill that gap.
Note: This proposal is a work in progress. Feedback on the overall approach and any preliminary design concerns would be greatly appreciated.
Current Observability Limitations in Valkey
Currently, Valkey’s observability relies on commands like MONITOR, SLOWLOG, and INFO.
While useful, these commands have limitations:
MONITOR: Streams every command, generating high data volume that may overload production environments.
SLOWLOG: Logs only commands exceeding a set execution time, omitting quick operations and general command patterns.
INFO: Provides server statistics but lacks detailed command- and key-specific insights.
These commands lack the flexibility for in-depth, customizable observability exposed directly within the valkey-server instance,
such as filtering specific events, sampling data, executing custom processing steps, aggregating metrics over time windows.
Feature proposal
Problem statement and goals
The proposed OBSERVE command suite will bring observability as a core Valkey feature. Through user-defined “observability pipelines,” Valkey instances can produce detailed insights in a structured, efficient manner. These pipelines will be customizable to support diverse use cases, providing users with foundational building blocks for monitoring without overwhelming server resources. This new functionality could be enhanced with integration with tools like Prometheus and Grafana for visualization or alerting, though its fully customizable and primary purpose is in-server analysis.
Proposed solution -- Commands
The OBSERVE command set introduces the concept of observability pipelines — user-defined workflows for collecting, filtering, aggregating, and storing metrics.
Core Commands
OBSERVE CREATE <pipeline_name> <configuration>
Creates an observability pipeline with a specified configuration. Configuration details, specified in the next section, define steps such as filtering, partitioning, sampling, and aggregation.
Pipeline and it's configuration is persisted in the runtime memory (i.e. user needs to re-create the pipeline after server restart).
OBSERVE START <pipeline_name>
Starts data collection for the specified pipeline.
OBSERVE STOP <pipeline_name>
Stops data collection for the specified pipeline.
OBSERVE DELETE <pipeline_name>
Deletes the pipeline and its configuration.
OBSERVE RETRIEVE <pipeline_name>
Retrieves collected data. Alternatively, GET could potentially serve for this function, but further design discussion is needed.
OBSERVE LOADSTEPF <step_name> <lua_code>
Allows defining custom processing steps using Lua, for cases where built-in steps do not meet needed requirements.
Pipeline configuration
Pipelines are configured as chains of data processing stages, including filtering, aggregation, and output buffering. Format is similar to the Unix piping.
Key stages in this pipeline model include:
filter(f): Filters events based on defined conditions (e.g., command type).
partition(f): Partitions events according to a function (e.g., by key prefix).
sample(f): Samples events at a specified rate.
map(f): Transforms each event with a specified function.
window(f): Aggregates data within defined time windows.
reduce(f): Reduces data over a window via an aggregation function.
The goal is to capture time-series metrics within the defined pipeline outputs, f.e. for the pipeline above it would be structured as follows:
[<timestamp1, errors_count1>, <timestamp2, errors_count2>, ...] // capped at 1000 items
It remains uncertain whether storing output data in a format compatible with direct retrieval via GET (or another existing command) will be feasible. Consequently, we might need to introduce an OBSERVE RETRIEVE <since_offset> command for clients polling results data. This command would provide:
{
current_offset: <latest_returned_offset as a number>,
data: [ ... result items ],
lag_detected: <true or false> // true if `since_offset` points to data that’s been removed, signaling potential data loss.
}
Here, offset represents the sequence number of items produced by the pipeline, including any items removed due to buffer constraints. This approach allows clients to poll for results while adjusting their polling frequency based on the lag_detected flag. If lag_detected is true, clients would be advised to increase polling frequency to reduce data loss.
Use-Case Examples
Below are examples of how the proposed OBSERVE command and pipeline configurations could be used to address various
observability needs.
Counting Specific Commands Per Minute with Buffer Size
Use Case: Count the number of GET commands executed per minute.
Explanation: This pipeline filters for GET commands, counts them per every minute, and stores the counts
in a time-series key get_command_count with a buffer size of 1440 (e.g., one day's worth of minute-level data).
Hot Key Analysis
Use Case: Identify and monitor the most frequently accessed keys within a certain time window, allowing for proactive load management and identification of potential bottlenecks.
Explanation: This pipeline filters for sampled 0.5% of GET commands, partitions events by the accessed key, and aggregates their counts in one-minute intervals.
The map_top_keys(10) step then selects the top 10 most frequently accessed keys in each interval along with the access counts.
The result is stored as a time-series in hot_keys with a buffer size of 60, retaining one hour of hot key data.
Average Latency Per Time Window with Buffer
Use Case: Monitor average latency of SET commands per minute.
Explanation: This pipeline filters for SET commands, extracts their latency, aggregates the average latency every
minute, and stores it with a buffer size of 720 (e.g., 12 hours of minute-level data).
Client Statistics
Use Case: Gather command counts per client for GET and SET commands, sampled at 5%.
Explanation: This pipeline filters for GET and SET commands, samples 5% of them, extracts client information, c
ounts commands per client every minute, and stores the data under client_stats with a buffer size of 1440.
Error Tracking
Use Case: Monitor the number of errors occurring per minute.
Pipeline Creation:
OBSERVE CREATE error_tracking_pipeline "filter(filter_event_type('error')) | # likely filter for errors would have to be more advancedwindow(window_duration(1m)) |reduce(count) |output(timeseries_to_key('total_errors', buffer_size=1440))"
Explanation: This pipeline filters events of type 'error', counts them every minute, and stores the totals in tota l_errors with a buffer size of 1440.
TTL Analysis
Use Case: Analyze the average TTL of keys set with SETEX command per minute.
Explanation: This pipeline filters for SETEX commands, extracts the TTL values, calculates the average TTL every
minute, and stores it in average_ttl with a buffer size of 1440.
Distribution of Key and Value Sizes
Use Case: Create a histogram of value sizes for SET commands.
Explanation: This pipeline filters for SET commands, extracts the size of the values, aggregates them into histog
ram buckets every minute, and stores the distributions with a buffer size of 1440.
Feedback Request
Feedback is requested on the following points:
Feature Scope: Does the proposed OBSERVE command align with your vision for Valkey’s observability?
Command Design: Are there any suggestions for the OBSERVE command syntax and structure?
Let's first reach the consensus for the 'Feature Scope'. If the answer is yes, we can discuss the designs.
I am ready to commit to building this feature as soon as the designs are accepted, even in draft form.
Thank you for your time and consideration. I look forward to discussing this proposal further.
The text was updated successfully, but these errors were encountered:
I like the concepts and directionality here. I think it would be profitable to split this into two subsections. One section would be to get more specific on the events that would feed into the observability framework.
A second section would focus on the representation and processing of those events. I mention this because there's quite a bit of overlap in the functionality of the second section and any implementation of a timestream processing module. In other words, can we get both timestream processing of the observability event stream (part 1) and more generic timestream data processing capabilities in the same development effort or conversely split the development effort into two parts that cooperate?
TLDR, I propose to improve observability for Valkey, like built-in RED time-series metrics
Overview
This proposal outlines a new
OBSERVE
command to improve Valkey’s observability capabilities. By enabling advanced time-series metrics, custom gathering pipelines, and in-server data aggregation,OBSERVE
will equip Valkey users with first-class monitoring commands for granular insight into server behavior and performance.Background
After discussions with Irfan Ahmad, an attendee at the '24 Valkey Summit, I developed this initial proposal to introduce native observability pipelines within Valkey. Currently, Valkey lacks comprehensive, customizable observability tools embedded directly within the server, and this proposal aims to fill that gap.
Note: This proposal is a work in progress. Feedback on the overall approach and any preliminary design concerns would be greatly appreciated.
Current Observability Limitations in Valkey
Currently, Valkey’s observability relies on commands like
MONITOR
,SLOWLOG
, andINFO
.While useful, these commands have limitations:
MONITOR
: Streams every command, generating high data volume that may overload production environments.SLOWLOG
: Logs only commands exceeding a set execution time, omitting quick operations and general command patterns.INFO
: Provides server statistics but lacks detailed command- and key-specific insights.These commands lack the flexibility for in-depth, customizable observability exposed directly within the valkey-server instance,
such as filtering specific events, sampling data, executing custom processing steps, aggregating metrics over time windows.
Feature proposal
Problem statement and goals
The proposed
OBSERVE
command suite will bring observability as a core Valkey feature. Through user-defined “observability pipelines,” Valkey instances can produce detailed insights in a structured, efficient manner. These pipelines will be customizable to support diverse use cases, providing users with foundational building blocks for monitoring without overwhelming server resources. This new functionality could be enhanced with integration with tools like Prometheus and Grafana for visualization or alerting, though its fully customizable and primary purpose is in-server analysis.Proposed solution -- Commands
The
OBSERVE
command set introduces the concept of observability pipelines — user-defined workflows for collecting, filtering, aggregating, and storing metrics.Core Commands
OBSERVE CREATE <pipeline_name> <configuration>
Creates an observability pipeline with a specified configuration. Configuration details, specified in the next section, define steps such as filtering, partitioning, sampling, and aggregation.
Pipeline and it's configuration is persisted in the runtime memory (i.e. user needs to re-create the pipeline after server restart).
OBSERVE START <pipeline_name>
Starts data collection for the specified pipeline.
OBSERVE STOP <pipeline_name>
Stops data collection for the specified pipeline.
OBSERVE DELETE <pipeline_name>
Deletes the pipeline and its configuration.
OBSERVE RETRIEVE <pipeline_name>
Retrieves collected data. Alternatively, GET could potentially serve for this function, but further design discussion is needed.
OBSERVE LOADSTEPF <step_name> <lua_code>
Allows defining custom processing steps using Lua, for cases where built-in steps do not meet needed requirements.
Pipeline configuration
Pipelines are configured as chains of data processing stages, including filtering, aggregation, and output buffering. Format is similar to the Unix piping.
Key stages in this pipeline model include:
filter(f)
: Filters events based on defined conditions (e.g., command type).partition(f)
: Partitions events according to a function (e.g., by key prefix).sample(f)
: Samples events at a specified rate.map(f)
: Transforms each event with a specified function.window(f)
: Aggregates data within defined time windows.reduce(f)
: Reduces data over a window via an aggregation function.output(f)
: Directs output to specified sinks.Example configuration syntax:
Output
The goal is to capture time-series metrics within the defined pipeline outputs, f.e. for the pipeline above it would be structured as follows:
It remains uncertain whether storing output data in a format compatible with direct retrieval via GET (or another existing command) will be feasible. Consequently, we might need to introduce an
OBSERVE RETRIEVE <since_offset>
command for clients polling results data. This command would provide:Here, offset represents the sequence number of items produced by the pipeline, including any items removed due to buffer constraints. This approach allows clients to poll for results while adjusting their polling frequency based on the lag_detected flag. If lag_detected is true, clients would be advised to increase polling frequency to reduce data loss.
Use-Case Examples
Below are examples of how the proposed
OBSERVE
command and pipeline configurations could be used to address variousobservability needs.
Counting Specific Commands Per Minute with Buffer Size
Use Case: Count the number of
GET
commands executed per minute.Pipeline Creation:
Explanation: This pipeline filters for
GET
commands, counts them per every minute, and stores the countsin a time-series key
get_command_count
with a buffer size of 1440 (e.g., one day's worth of minute-level data).Hot Key Analysis
Use Case: Identify and monitor the most frequently accessed keys within a certain time window, allowing for proactive load management and identification of potential bottlenecks.
Pipeline Creation:
Explanation: This pipeline filters for sampled 0.5% of GET commands, partitions events by the accessed key, and aggregates their counts in one-minute intervals.
The map_top_keys(10) step then selects the top 10 most frequently accessed keys in each interval along with the access counts.
The result is stored as a time-series in hot_keys with a buffer size of 60, retaining one hour of hot key data.
Average Latency Per Time Window with Buffer
Use Case: Monitor average latency of
SET
commands per minute.Pipeline Creation:
Explanation: This pipeline filters for
SET
commands, extracts their latency, aggregates the average latency everyminute, and stores it with a buffer size of 720 (e.g., 12 hours of minute-level data).
Client Statistics
Use Case: Gather command counts per client for
GET
andSET
commands, sampled at 5%.Pipeline Creation:
Explanation: This pipeline filters for
GET
andSET
commands, samples 5% of them, extracts client information, counts commands per client every minute, and stores the data under
client_stats
with a buffer size of 1440.Error Tracking
Use Case: Monitor the number of errors occurring per minute.
Pipeline Creation:
Explanation: This pipeline filters events of type 'error', counts them every minute, and stores the totals in
tota l_errors
with a buffer size of 1440.TTL Analysis
Use Case: Analyze the average TTL of keys set with
SETEX
command per minute.Pipeline Creation:
Explanation: This pipeline filters for
SETEX
commands, extracts the TTL values, calculates the average TTL everyminute, and stores it in
average_ttl
with a buffer size of 1440.Distribution of Key and Value Sizes
Use Case: Create a histogram of value sizes for
SET
commands.Pipeline Creation:
Explanation: This pipeline filters for
SET
commands, extracts the size of the values, aggregates them into histogram buckets every minute, and stores the distributions with a buffer size of 1440.
Feedback Request
Feedback is requested on the following points:
OBSERVE
command align with your vision for Valkey’s observability?Let's first reach the consensus for the 'Feature Scope'. If the answer is yes, we can discuss the designs.
I am ready to commit to building this feature as soon as the designs are accepted, even in draft form.
Thank you for your time and consideration. I look forward to discussing this proposal further.
The text was updated successfully, but these errors were encountered: