Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discover] Implement logs data source context resolution #184079

Closed
davismcphee opened this issue May 23, 2024 · 12 comments · Fixed by #184601
Closed

[Discover] Implement logs data source context resolution #184079

davismcphee opened this issue May 23, 2024 · 12 comments · Fixed by #184601
Assignees
Labels
enhancement New value added to drive a business result Feature:Discover Discover Application Project:OneDiscover Enrich Discover with contextual awareness Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. Team:obs-ux-logs Observability Logs User Experience Team

Comments

@davismcphee
Copy link
Contributor

📓 Summary

The first data source profile supported by One Discover will be "logs". This issue covers the initial implementation of a logs DataSourceProfileProvider, primarily its resolve method. The aim should be to identify and broadly categorize logs as a data source category by inspecting the current ES|QL query or data view. Associated extension point implementations will be added later under separate issues.

Some ideas of how we might implement this:

  • ES|QL queries and data views both have index patterns which we can rely on in the context resolution process.
  • We can check for the existence of specific fields based on the index pattern through the field caps API.
  • We can have a list of known logs indices which we can check the index pattern against for a match.

✔️ Acceptance criteria

  • Define a set of heuristics to identify and categorize logs data sources.
  • Create and register a logs DataSourceProfileProvider with a resolve method based on the defined heuristics.
@davismcphee davismcphee added Feature:Discover Discover Application enhancement New value added to drive a business result Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. Team:obs-ux-logs Observability Logs User Experience Team Project:OneDiscover Enrich Discover with contextual awareness labels May 23, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

@tonyghiani tonyghiani self-assigned this May 31, 2024
@tonyghiani
Copy link
Contributor

Initial criteria for classifying a log data source

  1. We can inspect the passed index pattern to understand if it matches a well-known list of log sources.
    • Anything that starts with logs-: e.g. logs-foo-bar.
    • Anything that starts with filebeat-, winlogbeat-, logstash- or auditbeat-
  2. Any CCS prefix to one of the above patterns should be supported
    • remote_cluster:logs-* should be resolved as a logs data source
    • remote_cluster:metrics-* should NOT be resolved as a logs data source
  3. A pattern of multiple comma-separated indices should be classified as a logs data source only if EVERY index matches criteria 1 or 2.
    • remote_cluster:logs-*,auditbeat-* should be resolved as a logs data source
    • remote_cluster:metrics-*,logs-* should NOT be resolved as a logs data source, as it's not a logs-only data source and we'll match logs at an individual document level.

The above logic is already implemented for Logs Explorer and can be extracted for reuse into a package:

buildIndexPatternRegExp.
It relies on a list of base patterns from a kibana setting (currently defaults to logs, auditbeat, filebeat, winlogbeat).
Still, we probably want to improve this and rely on more specific settings for logs used across Kibana, as this was for an experimental setting for internal use only.

Warning

We might want to rely on a hardcoded value initially, as this setting is specifically registered by an Observability project (logs_explorer)

@ruflin
Copy link
Contributor

ruflin commented May 31, 2024

On the index patterns, I would like to extend it to *filebeat* and the other beats as there are quite a few cases where users prefix the beat indices. Potentially we should apply the same for logs, so if you have foo-logs-{date} it also matches logs.

For the specific fields, I suggest we lean on log.level and message. We can't rely on @timestamp or the resource field as these will also show up for metrics, traces, synthetics. But having the two above in addition should be a strong indicator. It can happen that a metric event has a message field but in this scenario, hopefully the index name already resolved it.

Last, we can take all the ECS logs fields as a strong indicator: https://www.elastic.co/guide/en/ecs/current/ecs-log.html

@davismcphee What happens with the field caps if we have mixed data? For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example.

@ruflin
Copy link
Contributor

ruflin commented May 31, 2024

BTW the comment here from @weltenwort covers the fields I have mentioned above in a nice summary: #184080 (comment)

Lets start simple, we can always add more.

@tonyghiani
Copy link
Contributor

On the index patterns, I would like to extend it to filebeat and the other beats as there are quite a few cases where users prefix the beat indices. Potentially we should apply the same for logs, so if you have foo-logs-{date} it also matches logs.

This makes sense, we can extend the current index pattern-matching capability 👌

Regarding the fields, that might be more difficult to distinguish if a field applies to everything matching the pattern or just partially, because AFAIK the data view fields are merged and returns everything matching the pattern without making any grouping on them (e.g. logs-*,metrics-* create a unique fields map).

The index pattern will be our strongest indicator at first, and then we can rely on ECS logs fields to double-check the match.

@davismcphee
Copy link
Contributor Author

@davismcphee What happens with the field caps if we have mixed data? For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example.

@ruflin I looked into this today, and I believe there's a way through field caps, although it's a bit convoluted and I need verification from ES that my understanding is correct (I reached out to them in Slack for confirmation).

In any case, this info isn't included by default in the data view field list, and adding it by default would cause a significant increase in data view field caps data transfer. I would suggest we create a utility to use in context resolution that sends a bespoke field caps request checking for only the fields we care about during context resolution, e.g. await fieldsExistInMatchingIndices('my-index-pattern-*', ['message', 'log.level']);. This should keep the request fast and the response size small without affecting the main data view field list fetching. I'll investigate a bit further and follow up.

@flash1293
Copy link
Contributor

For example if only part of the data has a message field, we should fall back to the default view as it could be a mix between logs and metrics as an example

The danger of that is that a single bad index can "poison" a query - on field caps level we don't know whether an index will actually contribute to the shown result, so a one-off ill configured index in a broad index pattern could yank us out of logs mode for no good reason.

One approach would be to only go by the index pattern for now and see how well that works in practice - especially if the field-level check is harder to do with an additional request.

@flash1293
Copy link
Contributor

Related to this - we plan to introduce a central observability setting to define the "logs index patterns" which will be used by entity resolution as well. Should it be part of this as well?

cc @tommyers-elastic - did this get created already?

@tonyghiani
Copy link
Contributor

One approach would be to only go by the index pattern for now and see how well that works in practice - especially if the field-level check is harder to do with an additional request.

I agree with @flash1293, also given an index pattern match we surely know we do talk about logs (the strictness of the patterns matching on logs-, filebeat- etc should be enough to know the data view is about logs), so that in this case a fields caps requests shouldn't even be necessary.

In case the index pattern doesn't strictly match our criteria, it can be really anything, particularly:

Index pattern Could be logs?
logs-*,metrics-*
logs-*,maybe-logs-pattern-* Maybe, we could assert via caps field check

Even in the case the pattern might only refer to logs, it's non-trivial to know if the fields apply to all the patterns.
I'd rather start as well with the pattern only and explore if we can add more specific assertions against the data type.

Related to this - we plan to introduce a central observability setting to define the "logs index patterns" which will be used by entity resolution as well.

This is surely useful because it allows user to also specify their own logs patterns on top of the criteria defined above, which should help making a pattern like logs-*,maybe-logs-pattern-* definitely a logs one.

Should it be part of this as well?

I would keep this work split from this implementation, as it's split between the definition/implementation of this setting and its usage in Discover, we can use it as soon as it's ready in a following step to not block this work.

@flash1293
Copy link
Contributor

Agreed @tonyghiani - opened a follow-up issue here: #184670

@davismcphee
Copy link
Contributor Author

If the Logs UX team is aligned on this, it also works for me 👍 Starting simple and seeing how far we can get with the index pattern alone make sense to me. We can always revisit later if it ends up being insufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Discover Discover Application Project:OneDiscover Enrich Discover with contextual awareness Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. Team:obs-ux-logs Observability Logs User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants