[RFC]Observability Correlation Zones #123
Labels
documentation
Improvements or additions to documentation
enhancement
New feature or request
integration
integration related content
visualization
a visual widget for a specific purpose
Correlation Zones RFC
The purpose of this RFC is to present an Observability driven framework that is targeting the automation and simplification of the daily Site Reliability Engineer (SRE) tasks.
This work will help an SRE with the use case of identification of critical issues within the system and attending to them in the fastest and simple manner.
The Problem Domain
In the observability domain the amount of data collected by different agents is overwhelming, there are various types of observations data which are collected into the system and are needed to be observed in a single projection.
The simplistic approach that includes directly monitoring different fields and metrics is not efficient, cant scale and requires ongoing monitoring effort which cost time and resources.
We are well aware that the daily tasks of researching an anomaly or an incident (in a system) is a difficult and time consuming task.
That is why we embarked on the mission of pre-collecting and prioritize these anomalies and incidents into a special research area we acknowledge as
correlation zones
.Our Goal in this RFC is very specific:
“How do we simplify the SER’s daily system monitoring workflow in an effective and proactive way”
As an example for such case we can imagine a cloud based e-commerce system combined of few hundreds of services that is deployed in multiple regions and covers multiple geographical domains and time-zones.
The amount of incoming observability data collected from the system mount to a few dozen terabytes a day and includes multiple types of datapoints for different features of the system.
Solution Objectives
Our goal in this initiative is to drastically simplify & automate the SRE’s daily workflow and introduce a preemptive approach for monitoring and problem detecting use cases.
Leveraging the standard Observability Protocol and data collection pipeline in addition with deep insight we collected from large customers interactions we are introducing the
correlation zones
concept, framework and dashboards for helping the Observability community and industry.The
correlation zones
concept and framework will revolutionize the way the observability SRE's are working and make room for an efficient and scalable usage patterns for using these foundations at scale.One key goal in this (
correlation zones
) framework is to reduce the SRE’s interaction with the (observability-monitoring) system to the minimal extend and only focus on the real significant segments that may solve a potential problem.Another main requirement for this framework is to be independent from a specific opensearch version.
It should be deployed without the need to upgrade opensearch or the dashboard and be available as content for download directly from our website.
The decoupling of content from code will allow the next advantages :
Observability Structured Schema
As part of the large community and industry effort to solve the large variety of different observability data collected we are using the OpenTelemetry protocol which consolidates the different signals arriving from various system collection agents.
By utilizing the Observability Simple Schema the collected data is structured in advanced and this helps to build special structured dashboards and workflow to accommodate this information
Integration infrastructure
During the past few month OpenSearch has introduced the concept of pre-defined and opinionated visualizations that are build with the vision of creating a ready for action dashboards and applications.
Integrations are a set of services oriented assets that are bundled together to represent a specific resource which outputs information that is later ingested and analyzed inside OpenSearch.
An integration relies on the well structured notion of the Observability domain and protocol to assist in the monitoring and visualizations of the system's different elements.
The
correlation zones
infrastructure leverage the integration in order to assemble the various parts of the workflows. These workflows are consistent of both background tasks (preparing the data for visualization) and visual assets that project and assist the engineer in the daily tasks.Data Preparation And Transformations
Using the knowledge we obtained on typical daily workflows done by an SRE engineer - we are defining automation patterns and transformations that are the fundamental steps in the
correlation zones
.The next paragraphs present these steps and detail how we are approaching the goal of simplifying the workflow:
1) Partitioning Traces By Features
There are cases in which there are dozens of different trace data producers that are not related to one another and it make no sense viewing them in a single pane.
Using the Simple Schema naming convention we are able to separately ingest and visualize different traces that belong to a different data perspectives whether they are due to different application Id’s, different domain tenant of any other feature by-which the customer would like to partition the trace data.
This correlation-zones solution addresses this partitioning using the build-in simple schema naming convention patterns that creates a custom mechanism for data partitioning according to a custom user-based rules.
Here are some guidance into how the ingestion routing can provide this pipeline based partitioning :
Data-prepper’s ingestion routing allows the customer to partition data in the actual pipeline itself.
OTEL collector’s transform processor allows s configuring multiple context statements for traces, metrics, and logs. The value of
context
specifies which OTTL Context to use when interpreting the associated statements.2) Service Based Pivoting
The main entry point for the Observability engineer will be the Services projection. This concept creates the abstract notion of a high level (Service) entity which can be monitored and alerted by using different metrics and measurements.
A service has the following attributes and dimensions :
3) Correlation Zones indices
We acknowledged that the process of researching an anomaly or an incident (in a system) is a difficult and time consuming task.
That is why we embarked on the mission of pre-collecting and prioritize these anomalies and incidents into a special research area we acknowledge as
correlation zones
.These correlations zones are deliberately constructed with the following agenda:
4) Priorities Based Rules
One key goal in this
correlation zones
framework is to reduce the SRE’s interaction with the Observability monitoring system to the minimal extend and only once the engineer need to actually intervene to solve a potential problem.We are providing a new mechanism for the system to differentiate the indication found by the system and score each one according to some rule.
These rules will determine the working queue size and importance which accordance to the different services SLO's and monitoring objectives.
Correlation Zone Dashboard
The SRE will interact with these zones using a special dashboard that reflects the distinct correlation zone attributes and capabilities.
The Priority List - this is the list the engineer will engage with and it will reflect the daily working queue of pending investigations.
The specific Correlation investigation details dialog:
The Prioritized list of the daily investigation alerts / incidents
This is a vega based dashboard composed of multiple vega visualization widgets correlated using the distinct correlation zone.
Once the SRE engineer has selected to start the investigation for a specific row (correlation zone) the Correlation investigation details dialog will be opened fully in a new dialog and the following dashboard will be displayed:
In addition to the pre-build visualizations, the correlation dialog will include a query bar to specifically allow an advanced user to query / join different indices / datasources for further investigations or even data transformations using the PPL query language.
##TODO - add image of the visualization
Building the Correlation Zones
Generating the correlation zone requires a set of steps which include different operations, these are described in details in the next paragraphs .
Routing Observability Signals
The routing technique will use the OTEL collector ingestion pipeline for supplying the mechanism and infrastructure to determine the target indices that will hold the different information.
*** TODO - add examples ***
Preprocessing Services Aggregated Data
The preprocessing step will take advantage of the Transformation api to prepare and pre-aggregate relevant data for optimize query time performance and reduce cost in storage and compute.
*** TODO - add examples ***
Trigger based data processing
The existing Alert triggering mechanism will provide the necessary tool for defining a rule by which the workflow of collecting data for the correlation zone will be triggered and executed.
*** TODO - add examples ***
Using Integrations / Visualizations as building blocks
Utilizing the existing capability of dynamically loading different integrations will provide a mechanism of letting the customers to configure and optimize a workflow that will customize the correlation zone construction and behaviour
*** TODO - add examples ***
We are currently building the fundamental block that will consist the different steps in the workflow - feel free to add and comment on these elements and request new ones if needed using the "new-issue/Integration-suggestion"
Workflow based integrations
Workflow that transforms and prepares the data to be ready for the investigation process takes multiple steps.
Each step is described using a specific integration which describes the step's actions, parameters and API template.
These steps can be used separately but will bring the upmost value when used as a opinionated workflow which brings the value of the
correlation-zone
infrastructure directly and intuitively to the SRE engineering.Some of the available steps:
_reindex
API call for preparing the different trace partitions including additional filter-by parameters_transform
API call for hourly aggregating the services into optimal time buckets for RED metrics analysis_alert
API call for preparing a threshold based query for filtering the problems in the services behaviour._reindex
API call for preparing the different correlation zone data collection and preparation of the suspected incidents into one indexAdditional Context
This RFC represents an ongoing work of improving and refactoring the Observability capability of the OpenSearch codebase.
The text was updated successfully, but these errors were encountered: