Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Add support for pipeline details in ECS #940

Open
webmat opened this issue Aug 18, 2020 · 4 comments
Open

[meta] Add support for pipeline details in ECS #940

webmat opened this issue Aug 18, 2020 · 4 comments

Comments

@webmat
Copy link
Contributor

webmat commented Aug 18, 2020

We'd like to define how to capture pipeline details in ECS.

Pipelines can come in many shapes

  • Agent to Elasticsearch
  • All the way to Syslog=> agent => Logstash => Queue => Logstash => Elasticsearch ingest node => Elasticsearch

We'd like to define a way or at least guidance on how to capture information about various kinds of pipelines. The information folks usually want to track can fall into various categories, usually across each step of their pipelines:

  • Technology (product, version)
  • Host name, address
  • Timing: plain timestamps at each step, processing duration per step
  • Pipeline name that processed the event
  • Pipeline error handling (where should information be populated within an ECS event?)

Past discussions around this:

#8, #40, #76, #154, #315, #453, #700, #730, #1027 , #1059

@dainperkins
Copy link
Contributor

adding agent.ip, and maybe even agent.hostname seems like a good plan to allow for full identification of e.g. the host filebeat is running on in e.g. a syslog or API pull scenario.

While I've used agent to describe logstash (e.g. for netflow/syslog data) it seems like another field set to identify a "collector", and possibly a "queue" might make sense for end to end descriptions of all entities involved in data ingest?

Tho that would not easily solve the dual logstash setup with an agent (assuming the agent -> logstash hop is necessary) which would suggest an array of hosts/ips?

@ypid-geberit
Copy link
Contributor

ypid-geberit commented Sep 6, 2021

I am proposing something like agent.config_version to be added as well. Background: Pipeline config should be tracked using git. When the parsing is changed, a similar log event might then be parsed differently without a reason for the end user (in case they look at event.original and what fields are populated. It could be interesting, at least when developing/testing new log types/pipelines, to communicate the config/pipeline version to (test) users. As the value, I would suggest the output from git rev-parse --short HEAD as an example.

@rsk0
Copy link

rsk0 commented Aug 12, 2022

Articulated Arbitrarily-Structured Pipeline Details

Regarding timing, and considering just the example of arrival timestamps...

It's very important to be able to see latency throughout logging infrastructure. ECS currently offers only three timestamps for this purpose: event occurrence (@timestamp), record picked up by pipeline (event.created (a confusing name)), and record received into the data store (event.ingested). This small set of fields is too coarse and inflexible for best handling of sophisticated or large-scale operations.

In order to support arbitrary pipeline arrangements, what if we had two related arrays, one for juncture name and one for juncture arrival time?

@timestamp = 1660329591000
event.created = 1660329592000
foo.pipeline.junctures = [ "fluent bit source", "kafka amsterdam", "message classification processor", "logstash us-east-1" ]
foo.pipeline.arrivals = [ 1660329593000, 1660329594000, 1660329595000, 1660329596000 ]
event.ingested = 1660329597000

Alternatively, a list of objects.

@timestamp = 1660329591000
event.created = 1660329592000
foo.pipeline.times = [
  { "junction_name": "fluent bit source", "arrival_time": 1660329593000 },
  { "junction_name": "kafka amsterdam", "arrival_time": 1660329594000 },
  { "junction_name": "message classification processor", "arrival_time": 1660329595000 },
  { "junction_name": "logstash us-east-1", "arrival_time": 1660329595000 }
]
event.ingested = 1660329597000

Either way, I don't think Kibana would be able to visualize this data? Still, the information would be there for operators to use via other methods.

ECS's Values-Agnostic Philosophy

I think ECS development so far has been shying away from specification of values, the content in fields, and I understand this is important for fostering adoption by being open to various sources / implementations. (Let me know if I'm reading things correctly?) However, if that's a firm philosophical stance for ECS development, I suspect the issue of articulated pipeline details recording can't then be well handled via ECS per se.

I think there's an interoperability cost if ECS doesn't at least make recommendations about values. As a specific example, my company is having to devise its own idea of severity level values and likely won't do a better job than numerous companies in collaboration around ECS, and certainly can't be as effective as encouraging broad adoption of the scheme and thus interoperability as Elastic/ECS would be.

Non-binding recommendations about values could continue ECS's non-specificity tack, avoiding blocking adoption, while at the same time helping foster interoperability via a kind of "proto-standard". I'm thinking something like RFC "MAY" / "OPTIONAL".

log.level: The textual severity level of the original event. It can be whatever the heck you please. As one possibility, it could be -- no pressure -- it could be these values with these meanings, just some ideas, we don't know, whatever floats your boat, friend:
"trace": blah
"debug": blah
"information": blah
...

Maybe, if ECS stewardship wants to maintain a firm stance on values agnosticism, we could benefit from a consortium of ECS-using companies developing a values recommendation addendum to ECS.

@rsk0
Copy link

rsk0 commented Aug 17, 2023

Timing: plain timestamps at each step, processing duration per step

Just a note that issue 1059 (roughly "fix event.ingested and event.created fields") rolls up into this issue -- to make sure those bugs get addressed when this issue is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants