[Discover] Implement logs data source context resolution #184079

davismcphee · 2024-05-23T02:59:23Z

📓 Summary

The first data source profile supported by One Discover will be "logs". This issue covers the initial implementation of a logs DataSourceProfileProvider, primarily its resolve method. The aim should be to identify and broadly categorize logs as a data source category by inspecting the current ES|QL query or data view. Associated extension point implementations will be added later under separate issues.

Some ideas of how we might implement this:

ES|QL queries and data views both have index patterns which we can rely on in the context resolution process.
We can check for the existence of specific fields based on the index pattern through the field caps API.
We can have a list of known logs indices which we can check the index pattern against for a match.

✔️ Acceptance criteria

Define a set of heuristics to identify and categorize logs data sources.
Create and register a logs DataSourceProfileProvider with a resolve method based on the defined heuristics.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-23T02:59:25Z

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

elasticmachine · 2024-05-23T02:59:25Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

tonyghiani · 2024-05-31T08:28:33Z

Initial criteria for classifying a log data source

We can inspect the passed index pattern to understand if it matches a well-known list of log sources.
- Anything that starts with logs-: e.g. logs-foo-bar.
- Anything that starts with filebeat-, winlogbeat-, logstash- or auditbeat-
Any CCS prefix to one of the above patterns should be supported
- remote_cluster:logs-* should be resolved as a logs data source
- remote_cluster:metrics-* should NOT be resolved as a logs data source
A pattern of multiple comma-separated indices should be classified as a logs data source only if EVERY index matches criteria 1 or 2.
- remote_cluster:logs-*,auditbeat-* should be resolved as a logs data source
- remote_cluster:metrics-*,logs-* should NOT be resolved as a logs data source, as it's not a logs-only data source and we'll match logs at an individual document level.

The above logic is already implemented for Logs Explorer and can be extracted for reuse into a package:

buildIndexPatternRegExp.
It relies on a list of base patterns from a kibana setting (currently defaults to logs, auditbeat, filebeat, winlogbeat).
Still, we probably want to improve this and rely on more specific settings for logs used across Kibana, as this was for an experimental setting for internal use only.

Warning

We might want to rely on a hardcoded value initially, as this setting is specifically registered by an Observability project (logs_explorer)

ruflin · 2024-05-31T13:54:45Z

On the index patterns, I would like to extend it to *filebeat* and the other beats as there are quite a few cases where users prefix the beat indices. Potentially we should apply the same for logs, so if you have foo-logs-{date} it also matches logs.

For the specific fields, I suggest we lean on log.level and message. We can't rely on @timestamp or the resource field as these will also show up for metrics, traces, synthetics. But having the two above in addition should be a strong indicator. It can happen that a metric event has a message field but in this scenario, hopefully the index name already resolved it.

Last, we can take all the ECS logs fields as a strong indicator: https://www.elastic.co/guide/en/ecs/current/ecs-log.html

@davismcphee What happens with the field caps if we have mixed data? For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example.

ruflin · 2024-05-31T13:57:42Z

BTW the comment here from @weltenwort covers the fields I have mentioned above in a nice summary: #184080 (comment)

Lets start simple, we can always add more.

tonyghiani · 2024-05-31T14:33:47Z

On the index patterns, I would like to extend it to filebeat and the other beats as there are quite a few cases where users prefix the beat indices. Potentially we should apply the same for logs, so if you have foo-logs-{date} it also matches logs.

This makes sense, we can extend the current index pattern-matching capability 👌

Regarding the fields, that might be more difficult to distinguish if a field applies to everything matching the pattern or just partially, because AFAIK the data view fields are merged and returns everything matching the pattern without making any grouping on them (e.g. logs-*,metrics-* create a unique fields map).

The index pattern will be our strongest indicator at first, and then we can rely on ECS logs fields to double-check the match.

davismcphee · 2024-05-31T18:12:09Z

@davismcphee What happens with the field caps if we have mixed data? For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example.

@ruflin I looked into this today, and I believe there's a way through field caps, although it's a bit convoluted and I need verification from ES that my understanding is correct (I reached out to them in Slack for confirmation).

In any case, this info isn't included by default in the data view field list, and adding it by default would cause a significant increase in data view field caps data transfer. I would suggest we create a utility to use in context resolution that sends a bespoke field caps request checking for only the fields we care about during context resolution, e.g. await fieldsExistInMatchingIndices('my-index-pattern-*', ['message', 'log.level']);. This should keep the request fast and the response size small without affecting the main data view field list fetching. I'll investigate a bit further and follow up.

flash1293 · 2024-06-03T07:53:32Z

For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example

The danger of that is that a single bad index can "poison" a query - on field caps level we don't know whether an index will actually contribute to the shown result, so a one-off ill configured index in a broad index pattern could yank us out of logs mode for no good reason.

One approach would be to only go by the index pattern for now and see how well that works in practice - especially if the field-level check is harder to do with an additional request.

flash1293 · 2024-06-03T08:06:19Z

Related to this - we plan to introduce a central observability setting to define the "logs index patterns" which will be used by entity resolution as well. Should it be part of this as well?

cc @tommyers-elastic - did this get created already?

tonyghiani · 2024-06-03T08:59:30Z

One approach would be to only go by the index pattern for now and see how well that works in practice - especially if the field-level check is harder to do with an additional request.

I agree with @flash1293, also given an index pattern match we surely know we do talk about logs (the strictness of the patterns matching on logs-, filebeat- etc should be enough to know the data view is about logs), so that in this case a fields caps requests shouldn't even be necessary.

In case the index pattern doesn't strictly match our criteria, it can be really anything, particularly:

Index pattern	Could be logs?
`logs-,metrics-`	❌
`logs-,maybe-logs-pattern-`	Maybe, we could assert via caps field check

Even in the case the pattern might only refer to logs, it's non-trivial to know if the fields apply to all the patterns.
I'd rather start as well with the pattern only and explore if we can add more specific assertions against the data type.

Related to this - we plan to introduce a central observability setting to define the "logs index patterns" which will be used by entity resolution as well.

This is surely useful because it allows user to also specify their own logs patterns on top of the criteria defined above, which should help making a pattern like logs-*,maybe-logs-pattern-* definitely a logs one.

Should it be part of this as well?

I would keep this work split from this implementation, as it's split between the definition/implementation of this setting and its usage in Discover, we can use it as soon as it's ready in a following step to not block this work.

flash1293 · 2024-06-03T15:47:15Z

Agreed @tonyghiani - opened a follow-up issue here: #184670

davismcphee · 2024-06-03T22:20:06Z

If the Logs UX team is aligned on this, it also works for me 👍 Starting simple and seeing how far we can get with the index pattern alone make sense to me. We can always revisit later if it ends up being insufficient.

tonyghiani self-assigned this May 31, 2024

This was referenced May 31, 2024

[Discover] Add logs source and document contexts #184601

Merged

[Discover] Implement log document context resolution #184080

Closed

flash1293 mentioned this issue Jun 3, 2024

[Discover] Extend logs data source context resolution by observability logs settings #184670

Closed

tonyghiani closed this as completed in #184601 Jun 18, 2024

tonyghiani closed this as completed in 051b91a Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discover] Implement logs data source context resolution #184079

[Discover] Implement logs data source context resolution #184079

davismcphee commented May 23, 2024

elasticmachine commented May 23, 2024

elasticmachine commented May 23, 2024

tonyghiani commented May 31, 2024

ruflin commented May 31, 2024

ruflin commented May 31, 2024

tonyghiani commented May 31, 2024

davismcphee commented May 31, 2024

flash1293 commented Jun 3, 2024

flash1293 commented Jun 3, 2024

tonyghiani commented Jun 3, 2024

flash1293 commented Jun 3, 2024

davismcphee commented Jun 3, 2024

[Discover] Implement logs data source context resolution #184079

[Discover] Implement logs data source context resolution #184079

Comments

davismcphee commented May 23, 2024

📓 Summary

✔️ Acceptance criteria

elasticmachine commented May 23, 2024

elasticmachine commented May 23, 2024

tonyghiani commented May 31, 2024

Initial criteria for classifying a log data source

ruflin commented May 31, 2024

ruflin commented May 31, 2024

tonyghiani commented May 31, 2024

davismcphee commented May 31, 2024

flash1293 commented Jun 3, 2024

flash1293 commented Jun 3, 2024

tonyghiani commented Jun 3, 2024

flash1293 commented Jun 3, 2024

davismcphee commented Jun 3, 2024