- Stage: 2 (candidate)
- Date: 2021-04-19
When introducing the new indexing strategy for Elastic Agent which uses data streams, we found that adding a few constant_keyword fields corresponding to the central components in the new indexing strategy would be advantageous.
This RFC proposes to introduce a new fieldset called "data_stream". The fieldset consists of the following fields:
Field | Mapping type | Description |
---|---|---|
data_stream.type | constant_keyword | An overarching type for the data stream. Currently allowed values include "logs", "metrics". We expect to also add "traces" and "synthetics" in the near future |
data_stream.dataset | constant_keyword | The field can contain anything that makes sense to signify the source of the data. Examples include nginx.access , prometheus , endpoint etc. For data streams that otherwise fit, but that do not have dataset set we use the value "generic" for the dataset value. event.dataset should have the same value as data_stream.dataset . |
data_stream.namespace | constant_keyword | A user defined namespace. Namespaces are useful to allow grouping of data. Many of our customers already organize their indices this way, and now we are providing this best practice as a default. Many people will use default as the value. |
In the new indexing strategy, the value of the data stream fields combine to the name of the actual data stream in the following manner {data_stream.type}-{data_stream.dataset}-{data_stream.namespace}
. This means the fields can only contain characters that are valid as part of names of data streams.
The fields can be found in rfcs/text/0009/data_stream.yml
.
Due to the fact that the values of the data_stream
fields make up the data stream name, the restrictions on data stream names also apply to values for the data_stream
fields. As an example, they cannot include \
, /
, *
, ?
, "
, <
, >
, |
,
, ,
, #
. Please see the Elasticsearch reference for restrictions on index/data stream names. Here follows the additional restrictions imposed on the data stream fields:
data_stream.type
data_stream.type
is restricted to logs
or metrics
for now.
Any future values for data_stream.type
should also adhere to the following restrictions (these are derived from the Elasticsearch index restrictions):
- Must not contain
-
- Must not start with
+
or_
data_stream.dataset
- Must not contain
-
- No longer than 100 chars
data_stream.namespace
- No longer than 100 chars
The new indexing strategy results in users having many more indices than they used to. Elasticsearch is very good at searching for specific documents across indices, but for some common queries we can make it even better by using constant_keyword
fields. For example, it's often the case that you'd want to find only documents that contain logs from a certain service or logs from a given namespace. For a query such as data_stream.type: logs AND data_stream.namespace: billing-app
Elasticsearch can quickly determine that only a small subset of the indices are relevant to search through.
Data stream fields are already in use in Elastic Agent. Leveraging the data stream fields described here allow users to filter by a specific data type (logs, metrics etc.), dataset (nginx.access, prometheus) or namespace. The following are examples of common queries pertaining to specific datatypes, datasets or namespaces:
data_stream.type: logs
data_stream.dataset: nginx.access
data_stream.type: logs AND data_stream.namespace: web-frontend
As previously described, fields mapped as constant_keyword
allows Elasticsearch to drastically optimize queries involving those fields. See the Elasticsearch documentation on constant_keyword
for more information.
Today, Elastic Agent adds the data_stream fields in all documents ingested. It's also possible to use the fields in data from other data sources. Elasticsearch 7.9+ ships with built-in index template mappings which will ensure that documents indexed into data streams that match logs-*-*
and metrics-*-*
will get the fields mapped correctly to constant_keyword
types.
Here are two example events, one for logs, one for metrics. It must be noted that for better readability some of the fields were removed.
Example source document of type metrics:
{
"@timestamp": "2020-12-23T10:10:45.704Z",
"event": {
"dataset": "system.process_summary",
"module": "system",
"duration": 34693020
},
"service": {
"type": "system"
},
"system": {
"process": {
"summary": {
"dead": 0,
"total": 236,
"sleeping": 49,
"running": 0,
"idle": 95,
"stopped": 0,
"zombie": 0,
"unknown": 92
}
}
},
"data_stream": {
"dataset": "system.process_summary",
"namespace": "default",
"type": "metrics"
}
}
Example source document of type logs:
{
"@timestamp": "2020-12-23T10:17:35.902Z",
"log.level": "debug",
"log.logger": "processors",
"log.origin": {
"file.name": "processing/processors.go",
"file.line": 203
},
"message": "Hello world ECS",
"input": {
"type": "log"
},
"event": {
"dataset": "elastic_agent.metricbeat"
},
"log": {
"file": {
"path": "/opt/Elastic/Agent/data/elastic-agent-1da173/logs/default/metricbeat-json.log"
},
"offset": 685026
},
"data_stream": {
"dataset": "elastic_agent.metricbeat",
"namespace": "default",
"type": "logs"
}
}
data_stream
fields only make sense when indexing into data streams. They should not to be used for regular indices.
- We've described that
generic
is a valid value fordata_stream.dataset
in some cases. Sinceevent.dataset
should always have the same value, this will also apply toevent.dataset
. We should update the documentation onevent.dataset
to reflect this. - Since
data_stream.dataset
andevent.dataset
should contain the same value, the restrictions imposed ondata_stream.dataset
might affect theevent.dataset
value. This means users may need to translate their custom dataset values (e.g.event.dataset: firewall/config
) to an equivalent legal dataset, according to the character restrictions imposed by the use of the value indata_stream.dataset
, for exampledata_stream.dataset: firewall.config
.
Concerns have been raised about how these fields relate to the event fields. Specifically, event.type
, event.kind
, event.category
etc. Specifically, data_stream.type
seems closer to event.kind
than event.type
. There are other inconsistencies here and we didn't find a way to square this concern at the time. It was decided to move forward with the data_stream
fields for now and consider them to be unrelated to the event fields. event.dataset
and data_stream.dataset
, however, should contain the same value.
Elastic Agent already uses the data_stream fields.
Additionally, as previously described, beginning in version 7.9, Elasticsearch ships with built-in index templates for data streams which will automatically ensure that data_stream fields get correctly mapped when the data stream name match logs-*-*
and metrics-*-*
.
The following are the people that consulted on the contents of this RFC.
- @roncohen | author, sponsor
- @ruflin | author, sponsor, subject matter expert
- Elasticsearch documentation on the constant_keyword mapping type
- https://www.elastic.co/guide/en/elasticsearch/reference/current/faster-filtering-with-constant-keyword.html
- Previous discussion on dataset fields
- Discussion on field value restrictions
- Restrictions on index names
- Blog post: An introduction to the Elastic data stream naming scheme
- Elasticsearch documentation on data stream naming scheme