Skip to content

Latest commit

 

History

History
264 lines (194 loc) · 14.3 KB

0009-data_stream-fields.md

File metadata and controls

264 lines (194 loc) · 14.3 KB

0009: Data stream fields

  • Stage: 2 (candidate)
  • Date: 2021-04-19

When introducing the new indexing strategy for Elastic Agent which uses data streams, we found that adding a few constant_keyword fields corresponding to the central components in the new indexing strategy would be advantageous.

Fields

This RFC proposes to introduce a new fieldset called "data_stream". The fieldset consists of the following fields:

Field Mapping type Description
data_stream.type constant_keyword An overarching type for the data stream. Currently allowed values include "logs", "metrics". We expect to also add "traces" and "synthetics" in the near future
data_stream.dataset constant_keyword The field can contain anything that makes sense to signify the source of the data. Examples include nginx.access, prometheus, endpoint etc. For data streams that otherwise fit, but that do not have dataset set we use the value "generic" for the dataset value. event.dataset should have the same value as data_stream.dataset.
data_stream.namespace constant_keyword A user defined namespace. Namespaces are useful to allow grouping of data. Many of our customers already organize their indices this way, and now we are providing this best practice as a default. Many people will use default as the value.

In the new indexing strategy, the value of the data stream fields combine to the name of the actual data stream in the following manner {data_stream.type}-{data_stream.dataset}-{data_stream.namespace}. This means the fields can only contain characters that are valid as part of names of data streams.

The fields can be found in rfcs/text/0009/data_stream.yml.

Restrictions on values

Due to the fact that the values of the data_stream fields make up the data stream name, the restrictions on data stream names also apply to values for the data_stream fields. As an example, they cannot include \, /, *, ?, ", <, >, |, , ,, #. Please see the Elasticsearch reference for restrictions on index/data stream names. Here follows the additional restrictions imposed on the data stream fields:

data_stream.type

data_stream.type is restricted to logs or metrics for now.

Any future values for data_stream.type should also adhere to the following restrictions (these are derived from the Elasticsearch index restrictions):

  • Must not contain -
  • Must not start with + or _

data_stream.dataset

  • Must not contain -
  • No longer than 100 chars

data_stream.namespace

  • No longer than 100 chars

On the use of Constant Keyword fields

The new indexing strategy results in users having many more indices than they used to. Elasticsearch is very good at searching for specific documents across indices, but for some common queries we can make it even better by using constant_keyword fields. For example, it's often the case that you'd want to find only documents that contain logs from a certain service or logs from a given namespace. For a query such as data_stream.type: logs AND data_stream.namespace: billing-app Elasticsearch can quickly determine that only a small subset of the indices are relevant to search through.

Usage

Data stream fields are already in use in Elastic Agent. Leveraging the data stream fields described here allow users to filter by a specific data type (logs, metrics etc.), dataset (nginx.access, prometheus) or namespace. The following are examples of common queries pertaining to specific datatypes, datasets or namespaces:

  • data_stream.type: logs
  • data_stream.dataset: nginx.access
  • data_stream.type: logs AND data_stream.namespace: web-frontend

As previously described, fields mapped as constant_keyword allows Elasticsearch to drastically optimize queries involving those fields. See the Elasticsearch documentation on constant_keyword for more information.

Source data

Today, Elastic Agent adds the data_stream fields in all documents ingested. It's also possible to use the fields in data from other data sources. Elasticsearch 7.9+ ships with built-in index template mappings which will ensure that documents indexed into data streams that match logs-*-* and metrics-*-* will get the fields mapped correctly to constant_keyword types.

Here are two example events, one for logs, one for metrics. It must be noted that for better readability some of the fields were removed.

Example source document of type metrics:

{
  "@timestamp": "2020-12-23T10:10:45.704Z",
  "event": {
    "dataset": "system.process_summary",
    "module": "system",
    "duration": 34693020
  },
  "service": {
    "type": "system"
  },
  "system": {
    "process": {
      "summary": {
        "dead": 0,
        "total": 236,
        "sleeping": 49,
        "running": 0,
        "idle": 95,
        "stopped": 0,
        "zombie": 0,
        "unknown": 92
      }
    }
  },
  "data_stream": {
    "dataset": "system.process_summary",
    "namespace": "default",
    "type": "metrics"
  }
}

Example source document of type logs:

{
  "@timestamp": "2020-12-23T10:17:35.902Z",
  "log.level": "debug",
  "log.logger": "processors",
  "log.origin": {
    "file.name": "processing/processors.go",
    "file.line": 203
  },
  "message": "Hello world ECS",
  "input": {
    "type": "log"
  },
  "event": {
    "dataset": "elastic_agent.metricbeat"
  },
  "log": {
    "file": {
      "path": "/opt/Elastic/Agent/data/elastic-agent-1da173/logs/default/metricbeat-json.log"
    },
    "offset": 685026
  },
  "data_stream": {
    "dataset": "elastic_agent.metricbeat",
    "namespace": "default",
    "type": "logs"
  }
}

Using data_stream fields with regular indices

data_stream fields only make sense when indexing into data streams. They should not to be used for regular indices.

Scope of impact

  • We've described that generic is a valid value for data_stream.dataset in some cases. Since event.dataset should always have the same value, this will also apply to event.dataset. We should update the documentation on event.dataset to reflect this.
  • Since data_stream.dataset and event.dataset should contain the same value, the restrictions imposed on data_stream.dataset might affect the event.dataset value. This means users may need to translate their custom dataset values (e.g. event.dataset: firewall/config) to an equivalent legal dataset, according to the character restrictions imposed by the use of the value in data_stream.dataset, for example data_stream.dataset: firewall.config.

Concerns

Relation to event.* fields

Concerns have been raised about how these fields relate to the event fields. Specifically, event.type, event.kind, event.category etc. Specifically, data_stream.type seems closer to event.kind than event.type. There are other inconsistencies here and we didn't find a way to square this concern at the time. It was decided to move forward with the data_stream fields for now and consider them to be unrelated to the event fields. event.dataset and data_stream.dataset, however, should contain the same value.

Real-world implementations

Elastic Agent already uses the data_stream fields.

Additionally, as previously described, beginning in version 7.9, Elasticsearch ships with built-in index templates for data streams which will automatically ensure that data_stream fields get correctly mapped when the data stream name match logs-*-* and metrics-*-*.

People

The following are the people that consulted on the contents of this RFC.

  • @roncohen | author, sponsor
  • @ruflin | author, sponsor, subject matter expert

References

RFC Pull Requests