Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add field mapping correlation type metadata concept #7082

Open
YANG-DB opened this issue Apr 10, 2023 · 2 comments
Open

[Feature request] Add field mapping correlation type metadata concept #7082

YANG-DB opened this issue Apr 10, 2023 · 2 comments
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Apr 10, 2023

Is your feature request related to a problem?
As part of the Integration campaign and [Integration RFC(https://github.com/opensearch-project/OpenSearch-Dashboards/issues/3412) , we have introduction the SimpleSchema for Observability Domain that is based on the concept of a well-structured index which is based on a schema

Schema
A schema is associated to an index using the mapping configuration .

This mapping structure is also composable using the composed_of template capabilities which is used extensively to allow the different assemblies of various log types.

Another concept behind the schema is the capability of reflecting relationships.
This representation is currently defined in a proprietary way of adding this information to
the index mapping template's metadata

In the Observability domain - a log's entity relationship to a trace entity (:log)-[:associated]-(:trace) using the traceId correlation field is described in the log's mapping metadata section:

 "_meta": {
        "description": "Simple Schema For Observability",
        "catalog": "observability",
        "type": "logs",
        "correlations": [
          {
            "field": "spanId",
            "foreign-schema": "traces",
            "foreign-field": "spanId"
          },
          {
           "field": "traceId",
            "foreign-schema": "traces",
            "foreign-field": "traceId"
           }
          ]
        }

Screenshot 2023-04-04 at 10 19 22 AM

What solution would you like?

I would like that the field mapping API would be extended with this metadata information.

Recently there have been large extensions in the conceptual operation of opensearch as a search engine.
These extensions include:

The evolution of the knowledge layer on top of the data layer is an existing trend both in opensearch and in additional storage engines.

Key part of any knowledge layer is the concept of relationships between the different Entities .

P1 - The First Step

This step includes the introduction of the correlations concept into the field mapping.

Even though the concept of index relationships does exist today:

Both options imply a physical explicit index interrelationship that has a strong side effect of index physical storage and query time.
In addition, the specific field mapping has no reflection of this join which is only present in the higher index mapping level.

The new field-mapping-correlation feature is addressing the metadata aspect of the relationship between well-structured
entities residing in different indices.

A correlation is a weaker constraint in the sense that it doesn't impose a relational like DB foreign key constraint but rather implies that such correlation exist and may be joined
using a query engine

Another difference from the existing join fields is that this correlation will be at first a metadata declarative definition that will not be enforced with respect to the
actual data inside the indices - only the mapping correlation metadata will be enforced as detailed below.

New Correlation Section in Field mapping

Field mapping for a field which has a relationship to another foreign field in the target entity's index:
GET log/_mapping/field/traceId

Will respond with:

{
  "logs": {
    ...
    "mappings": {
      ...
        "traceId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "spanId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "traceIdFk": {
          "type": "correlation",
          "path": "traceId",
           "target_schema": "traces",
            "target_field":"traceId"
        },
        "spanIdFk": {
          "type": "correlation",
          "path": "spanId",
           "target_schema": "traces",
            "target_field":"spanId"
        },

    }
  }
}

This metadata information will be used by the SQL / PPL query engine to allow explicit correlation between different data-streams or datasources.
Having this information explicitly will allow better understanding and enhance investigation capabilities.

Once a SQL / PPL correlation (join) query is submitted to the corresponding index - it will create a regular sql join query.

Enforcement

In the first P1 step the mapping API would enforce the following when a field mapping correlation is requested:

  • validate target index schema foreign-schema mapping exists ( in the above example the "foreign-schema": "traces" must imply an index template traces exist)
  • validate target index schema foreign-field mapping exists ( in the above example the "foreign-field": "traceId" must imply a field named traceId must exist)
  • Field type must be in sync between the source and target field as well.

The correlations field may accept multiple correlations for additional remote indices including remote tables including datasources

P2 - The next Step

The next phase of the correlation capability would be including the actual precompute of the correlated data using some auxiliary data structure / indices
The auxiliary data structure may take the form of an eager correlation task which precomputes the join and materialized it into a secondary storage.
An additional skipping-index can be introduced to further optimize the filter based queries using bloomfilter of other probabilistic data sketch

The result of an SQL query would be much faster due to these auxiliary structures and allow faster and investigative driven use cases on top of huge indices and event data-lake
based correlations.

What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

@YANG-DB YANG-DB added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 10, 2023
@RyanL1997 RyanL1997 added the feature New feature or request label Apr 27, 2023
@saratvemulapalli
Copy link
Member

@YANG-DB are you looking for feedback or would contribute these changes?

@YANG-DB
Copy link
Member Author

YANG-DB commented May 5, 2023

I wanted to get feedback on this suggestion and how it fits with the current correlation initiative

@mch2 mch2 added the discuss Issues intended to help drive brainstorming and decision making label May 9, 2023
@anasalkouz anasalkouz added Indexing Indexing, Bulk Indexing and anything related to indexing and removed untriaged labels Jun 1, 2023
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
Status: New
Development

No branches or pull requests

5 participants