Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] OBSERVE command for enhanced observability in Valkey #1167

Open
mwarzynski opened this issue Oct 14, 2024 · 1 comment
Open

[NEW] OBSERVE command for enhanced observability in Valkey #1167

mwarzynski opened this issue Oct 14, 2024 · 1 comment

Comments

@mwarzynski
Copy link

mwarzynski commented Oct 14, 2024

TLDR, I propose to improve observability for Valkey, like built-in RED time-series metrics

Overview

This proposal outlines a new OBSERVE command to improve Valkey’s observability capabilities. By enabling advanced time-series metrics, custom gathering pipelines, and in-server data aggregation, OBSERVE will equip Valkey users with first-class monitoring commands for granular insight into server behavior and performance.

Background

After discussions with Irfan Ahmad, an attendee at the '24 Valkey Summit, I developed this initial proposal to introduce native observability pipelines within Valkey. Currently, Valkey lacks comprehensive, customizable observability tools embedded directly within the server, and this proposal aims to fill that gap.

Note: This proposal is a work in progress. Feedback on the overall approach and any preliminary design concerns would be greatly appreciated.


Current Observability Limitations in Valkey

Currently, Valkey’s observability relies on commands like MONITOR, SLOWLOG, and INFO.

While useful, these commands have limitations:

  • MONITOR: Streams every command, generating high data volume that may overload production environments.
  • SLOWLOG: Logs only commands exceeding a set execution time, omitting quick operations and general command patterns.
  • INFO: Provides server statistics but lacks detailed command- and key-specific insights.

These commands lack the flexibility for in-depth, customizable observability exposed directly within the valkey-server instance,
such as filtering specific events, sampling data, executing custom processing steps, aggregating metrics over time windows.

Feature proposal

Problem statement and goals

The proposed OBSERVE command suite will bring observability as a core Valkey feature. Through user-defined “observability pipelines,” Valkey instances can produce detailed insights in a structured, efficient manner. These pipelines will be customizable to support diverse use cases, providing users with foundational building blocks for monitoring without overwhelming server resources. This new functionality could be enhanced with integration with tools like Prometheus and Grafana for visualization or alerting, though its fully customizable and primary purpose is in-server analysis.

Proposed solution -- Commands

The OBSERVE command set introduces the concept of observability pipelines — user-defined workflows for collecting, filtering, aggregating, and storing metrics.

Core Commands

  • OBSERVE CREATE <pipeline_name> <configuration>
    Creates an observability pipeline with a specified configuration. Configuration details, specified in the next section, define steps such as filtering, partitioning, sampling, and aggregation.
    Pipeline and it's configuration is persisted in the runtime memory (i.e. user needs to re-create the pipeline after server restart).

  • OBSERVE START <pipeline_name>
    Starts data collection for the specified pipeline.

  • OBSERVE STOP <pipeline_name>
    Stops data collection for the specified pipeline.

  • OBSERVE DELETE <pipeline_name>
    Deletes the pipeline and its configuration.

  • OBSERVE RETRIEVE <pipeline_name>
    Retrieves collected data. Alternatively, GET could potentially serve for this function, but further design discussion is needed.

  • OBSERVE LOADSTEPF <step_name> <lua_code>
    Allows defining custom processing steps using Lua, for cases where built-in steps do not meet needed requirements.

Pipeline configuration

Pipelines are configured as chains of data processing stages, including filtering, aggregation, and output buffering. Format is similar to the Unix piping.

Key stages in this pipeline model include:

  • filter(f): Filters events based on defined conditions (e.g., command type).
  • partition(f): Partitions events according to a function (e.g., by key prefix).
  • sample(f): Samples events at a specified rate.
  • map(f): Transforms each event with a specified function.
  • window(f): Aggregates data within defined time windows.
  • reduce(f): Reduces data over a window via an aggregation function.
  • output(f): Directs output to specified sinks.

Example configuration syntax:

OBSERVE CREATE get_errors_pipeline "
filter(filter_by_commands(['GET'])) |
filter(filter_for_errors) |
window(window_duration(1m)) |
reduce(count) |
output(output_timeseries_to_key('get_errors_count', max_length=1000))
"

Output

The goal is to capture time-series metrics within the defined pipeline outputs, f.e. for the pipeline above it would be structured as follows:

[<timestamp1, errors_count1>, <timestamp2, errors_count2>, ...] // capped at 1000 items

It remains uncertain whether storing output data in a format compatible with direct retrieval via GET (or another existing command) will be feasible. Consequently, we might need to introduce an OBSERVE RETRIEVE <since_offset> command for clients polling results data. This command would provide:

{
    current_offset: <latest_returned_offset as a number>,
    data: [ ... result items ],
    lag_detected: <true or false> // true if `since_offset` points to data that’s been removed, signaling potential data loss.
}

Here, offset represents the sequence number of items produced by the pipeline, including any items removed due to buffer constraints. This approach allows clients to poll for results while adjusting their polling frequency based on the lag_detected flag. If lag_detected is true, clients would be advised to increase polling frequency to reduce data loss.


Use-Case Examples

Below are examples of how the proposed OBSERVE command and pipeline configurations could be used to address various
observability needs.

  1. Counting Specific Commands Per Minute with Buffer Size

    Use Case: Count the number of GET commands executed per minute.

    Pipeline Creation:

    OBSERVE CREATE get_commands_per_minute "
    filter(filter_by_commands(['GET'])) |
    window(window_duration(1m)) |
    reduce(reduce_count) |
    output(output_timeseries_to_key('get_command_count', buffer_size=1440))
    "
    

    Explanation: This pipeline filters for GET commands, counts them per every minute, and stores the counts
    in a time-series key get_command_count with a buffer size of 1440 (e.g., one day's worth of minute-level data).

  2. Hot Key Analysis

    Use Case: Identify and monitor the most frequently accessed keys within a certain time window, allowing for proactive load management and identification of potential bottlenecks.

    Pipeline Creation:

    OBSERVE CREATE hot_keys_analysis "
    filter(filter_by_commands(['GET'])) |
    sample(sample_percentage(0.005)) |
    partition(partition_by_key()) |
    window(window_duration(1m)) |
    reduce(reduce_count) |
    map(map_top_keys(10)) |
    output(output_timeseries_to_key('hot_keys', buffer_size=60))
    "
    

    Explanation: This pipeline filters for sampled 0.5% of GET commands, partitions events by the accessed key, and aggregates their counts in one-minute intervals.
    The map_top_keys(10) step then selects the top 10 most frequently accessed keys in each interval along with the access counts.
    The result is stored as a time-series in hot_keys with a buffer size of 60, retaining one hour of hot key data.

  3. Average Latency Per Time Window with Buffer

    Use Case: Monitor average latency of SET commands per minute.

    Pipeline Creation:

    OBSERVE CREATE set_latency_monitor "
    filter(filter_by_commands('SET')) |
    sample(sample_percentage(0.005)) |
    window(window_duration(1m)) |
    map(map_get_latency) |
    reduce(average) |
    output(timeseries_to_key('set_average_latency', buffer_size=720))
    "
    

    Explanation: This pipeline filters for SET commands, extracts their latency, aggregates the average latency every
    minute, and stores it with a buffer size of 720 (e.g., 12 hours of minute-level data).

  4. Client Statistics

    Use Case: Gather command counts per client for GET and SET commands, sampled at 5%.

    Pipeline Creation:

    OBSERVE CREATE client_stats_per_minute "
    filter(filter_by_commands(['GET', 'SET'])) |
    sample(sample_percentage(0.05)) |
    map(map_client_info) |
    window(window_duration(1m)) |
    reduce(count_by_client) |
    output(timeseries_to_key('client_stats', buffer_size=1440))
    "

    Explanation: This pipeline filters for GET and SET commands, samples 5% of them, extracts client information, c
    ounts commands per client every minute, and stores the data under client_stats with a buffer size of 1440.

  5. Error Tracking

    Use Case: Monitor the number of errors occurring per minute.

    Pipeline Creation:

    OBSERVE CREATE error_tracking_pipeline "
    filter(filter_event_type('error')) | # likely filter for errors would have to be more advanced
    window(window_duration(1m)) |
    reduce(count) |
    output(timeseries_to_key('total_errors', buffer_size=1440))
    "

    Explanation: This pipeline filters events of type 'error', counts them every minute, and stores the totals in tota l_errors with a buffer size of 1440.

  6. TTL Analysis

    Use Case: Analyze the average TTL of keys set with SETEX command per minute.

    Pipeline Creation:

    OBSERVE CREATE ttl_analysis_pipeline "
    filter(filter_by_commands(['SETEX'])) |
    map(map_extract_ttl) |
    window(window_duration(1m)) |
    reduce(average) |
    output(timeseries_to_key('average_ttl', buffer_size=1440))
    "

    Explanation: This pipeline filters for SETEX commands, extracts the TTL values, calculates the average TTL every
    minute, and stores it in average_ttl with a buffer size of 1440.

  7. Distribution of Key and Value Sizes

    Use Case: Create a histogram of value sizes for SET commands.

    Pipeline Creation:

    OBSERVE CREATE value_size_distribution "
    filter(command('SET')) |
    map(extract_value_size) |
    window(window_duration(1m)) |
    reduce(histogram(buckets([0, 64, 256, 1024, 4096, 16384]))) |
    output(timeseries_to_key('value_size_distribution', buffer_size=1440))
    "

    Explanation: This pipeline filters for SET commands, extracts the size of the values, aggregates them into histog
    ram buckets every minute, and stores the distributions with a buffer size of 1440.


Feedback Request

Feedback is requested on the following points:

  1. Feature Scope: Does the proposed OBSERVE command align with your vision for Valkey’s observability?
  2. Command Design: Are there any suggestions for the OBSERVE command syntax and structure?

Let's first reach the consensus for the 'Feature Scope'. If the answer is yes, we can discuss the designs.
I am ready to commit to building this feature as soon as the designs are accepted, even in draft form.


Thank you for your time and consideration. I look forward to discussing this proposal further.

@allenss-amazon
Copy link

I like the concepts and directionality here. I think it would be profitable to split this into two subsections. One section would be to get more specific on the events that would feed into the observability framework.

A second section would focus on the representation and processing of those events. I mention this because there's quite a bit of overlap in the functionality of the second section and any implementation of a timestream processing module. In other words, can we get both timestream processing of the observability event stream (part 1) and more generic timestream data processing capabilities in the same development effort or conversely split the development effort into two parts that cooperate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants