Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event.ingested as the ingest timestamp #582

Merged
merged 6 commits into from
Nov 19, 2019

Conversation

cwurm
Copy link

@cwurm cwurm commented Oct 9, 2019

Especially in a security use case, the ingest timestamp is important. Unlike @timestamp which contains the timestamp when the event originally occurred (e.g. a process was started), event.ingested is to contain the timestamp when an event arrived in the central data store (usually Elasticsearch).

This is an important timestamp to have since event and ingest can be far apart for various reasons:

  • An endpoint device (e.g. laptop) is running and collecting data, but is unable to send it (e.g. because it is being used offline for a while, e.g. on a plane).
  • The ingestion pipeline is backed up, or temporarily broken (e.g. too many events are coming in, so backpressure is applied through the system and events are spooled to disk in a messaging queue like Redis or Kafka, to be processed at a later time when the peak is over).
  • An attacker has gained control over a machine and has changed the system time to the distant past or future to hide security events from this machine from the usual scrutiny (e.g. queries and dashboards looking at the last N minutes/hours of data only).

Having the ingest timestamp in addition to the event timestamp allows making sure that all data is processed as it arrives into Elasticsearch (e.g. by a scheduled query on the ingest timestamp).

It will also allow to run analysis on the relation between event and ingest timestamp, e.g. find events where the event timestamp is after the ingest timestamp, or significantly before it. This is useful for both security purposes (e.g. to find attackers manipulating system time) as well as operational (e.g. there might be a problem in the ingest pipeline, an endpoint might be misconfigured, or an NTP server is having a bad day).

The easiest way to fill the ingest timestamp is to use an ingest processor in Elasticsearch like this (docs):

"set": {
  "field": "event.ingested",
  "value": "{{_ingest.timestamp}}"
}

This PR is adding event.ingested as a core field - the expectation being that it should be filled by all data sources (esp. security/auditing data sources) so it's possible to run any queries depending on seeing all data on it (esp. infosec queries).

@willemdh
Copy link
Contributor

@cwurm So together with event.created in a normal situation this would be the order of the timestamps:

@timestamp => This is the date/time extracted from the event
event.created => Contains the date/time when the event was first read by an agent, or by your pipeline.
event.ingested => Contains the timestamp when an event arrived in elasticsearch

Just some thoughts about a Logstash pipeline I'm using. What if an event only has second granulaity. As the source of those events spawn a huge amount of events, I need to use the @timestamp generated by elasticsearch, which has milliseconds granularity, and not timestamp in the original event. Do you also consider this auto generated @timestamp to be put in event.ingested? Or in event.created?

@cwurm
Copy link
Author

cwurm commented Oct 11, 2019

@timestamp => This is the date/time extracted from the event
event.created => Contains the date/time when the event was first read by an agent, or by your pipeline.
event.ingested => Contains the timestamp when an event arrived in elasticsearch

@willemdh Correct. Out of these three, I personally see @timestamp and event.ingested as essential, event.created maybe less so. As it stands, few data sources seem to fill event.created (I see it in Winlogbeat, Filebeat suricata.eve, Filebeat googlecloud.vpcflow).

Are you collecting it? Do you see a lot of value in it?

Just some thoughts about a Logstash pipeline I'm using. What if an event only has second granulaity. As the source of those events spawn a huge amount of events, I need to use the @timestamp generated by elasticsearch, which has milliseconds granularity, and not timestamp in the original event. Do you also consider this auto generated @timestamp to be put in event.ingested? Or in event.created?

I'm not sure I follow. What do you mean by "second granularity"?

@willemdh
Copy link
Contributor

willemdh commented Oct 11, 2019

@cwurm

Are you collecting it? Do you see a lot of value in it?

Actually I do use event.created in multiple custom made pipelines, but maybe not always in the way it was meant to be..

With second granularity I mean that the log source is sometimes not able to log the timestamp with milliseconds, for example:

2019-10-11 14:56:23

When that is the case with high speed logging sources, I need to use the @timestamp created at ingestion by elasticsearch, otherwise all events within a second have the same @timestamp
This is the case for example in our F5 BIG-IP syslog. For example:

<134>Oct 11 15:02:00 slot1/f5extpr info tmm[12871]: Authplatform 2.0 (Auth_Platform_2_dispatch Policy Selection) 1570794527 10.22.61.82:50607 Data Group List lookup for 'ap.stad.be' yielded 'weak' as result

So where do I put the Oct 11 15:02:00 then?

I can't put it in @timestamp as I put the auto generated timestamp (with millisec granularity) there. But I still need it, so I created f5.timestamp. An ecs place to put the original log timestamp would be useful too imho.

Also what if I am able to use the raw log timestamp, but I'd like to keep the auto generated @timestamp? For example for debugging what's the latency between the timestamp the log was created and the timestamp it was finally inserted into es.. When playing around with logstash workers for high speed logging sources (such as a perimeter Palo Alto fw), that can be quite useful.

Sorry if I was unclear. Imho there are even more usable timestamps and we should have a place for all of them.

log.timestamp => original raw log timestamp

log.syslog.timestamp => orginal raw log syslog timestamp (yes, I do have multiple log sources which have 2 timestamps, such as Brocade syslog and ESXi)

Example:

<190>Oct 11 14:21:57 MYSASWITCH raslogd: AUDIT, 2019/10/11-14:21:56 (CEST), [SRM-1001], INFO, RAS, NONE/root/NONE/None/CLI, ad_0/BrocadeG620/CHASSIS, 8.2.1a, , , , , , , System is alive.

In the above case Oct 11 14:21:57 would go into log.syslog.timestamp an 2019/10/11-14:21:56 (CEST) would go into log.timestamp

event.created => Contains the date/time when the event was first read by an agent, or by your pipeline. But what if for example Filebeat is sending to a Logstash pipeline? I can imagine reasons you want to know when Filebeat picked it up, but also when Logstash starts the called pipeline?

event.ingested => Contains the timestamp when an event arrived in elasticsearch

Imho having an ECS timestamp for every piece in the ingestion process, would clear up confusion for everyone.

Grtz

@cwurm
Copy link
Author

cwurm commented Oct 14, 2019

@willemdh Thanks for the details!

When that is the case with high speed logging sources, I need to use the @timestamp created at ingestion by elasticsearch, otherwise all events within a second have the same @timestamp

Makes sense. I think that's exactly where event.ingested would be used. @timestamp can be from the original event (even if not precise), then the user can choose between looking at event time (where several events might share the same timestamp because of missing granularity) or ingestion time (which should be more unique - though still not 100% in a high-frequency data source with sub-millisecond differences between subsequent events).

event.created => Contains the date/time when the event was first read by an agent, or by your pipeline. But what if for example Filebeat is sending to a Logstash pipeline? I can imagine reasons you want to know when Filebeat picked it up, but also when Logstash starts the called pipeline?

I can see how it can be useful to know when each event was at each step of the ingest pipeline. I think it would be hard to define these so they work well in most cases - many ingestion pipelines have many steps, e.g. Beat -> Logstash -> Kafka -> Logstash -> Elasticsearch is a lot of timestamps.

In the "standard" case, I think @timestamp for the original event time and event.ingested for the time when it was indexed into Elasticsearch will often be enough. Any other timestamp fields filled somewhere during ingestion can be added as custom fields on a case-by-case basis as needed.

@willemdh
Copy link
Contributor

Adding event.ingested is definitely needed, please continue with this pr. :)

But I still believe there should be an official syslog timestamp field. It's the only missing field to match all syslog related info now we finally have cleared up the syslog priority / facility / severity. So something like log.syslog.timestamp

I'd make a new GitHub issue for that, but if noone else cares then maybe I shouldn't.. 😃

@jordansissel
Copy link

What if an event only has second granulaity. As the source of those events spawn a huge amount of events, I need to use the @timestamp generated by elasticsearch

In this scenario, @timestamp is the same as event.created because you are not using the original log's declared timestamp.

@willemdh
Copy link
Contributor

@jordansissel Not necessarily. As event.created is described as the time when the event was first read by an agent or by a pipeline. Some beats modules add the field, but I'm not adding that in my Logstash pipeline with 'now', so the ingest timestamp is my best option. @timestamp would be equal to event.ingested (if it existed yet)...

@cwurm
Copy link
Author

cwurm commented Nov 6, 2019

@webmat The PR to fill event.ingested in Libbeat is progressing (elastic/beats#14001) - would be good to get feedback on this soon.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @cwurm, as usual thanks for the detailed thoughts @willemdh :-) An additional issue to discuss some more about original timestamp granularity is welcome, thanks for bringing that to our attention.

Adding event.ingested is definitely needed, please continue with this pr

Agreed, I don't think there's any outstanding issues with the PR, other than my comment below.

In a surprising turn of events, Christoph will be out for a little bit, and he asked me to take over. So I'm providing feedback on a PR that I will finish up myself 😂

I'll see if I have access to push to Christoph's branch, I think not. I may have to start from this PR and create a new one.

short: Ingest timestamp
description: >
Time when the event was ingested. This is different from `@timestamp`
which is when the event originally occurred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be a little more explicit in the relationship with event.created as well here.

Here's what I'm thinking:

      short: Timestamp when an event arrived in the central data store
      description: >
        Timestamp when an event arrived in the central data store.

        This is different from `@timestamp`, which is when the event originally occurred.
        It's also different from `event.created`, which is meant to capture the first time an agent saw the event.

        In normal conditions, assuming no tampering, the timestamps should chronologically look like this: `@timestamp` < `event.created` < `event.ingested`.

Just like @timestamp and event.created, the description is pretty verbose in order to clarify the relationship. I think we should fast-track this PR and we may not necessarily need to adjust @timestamp and event.created descriptions for now.

@webmat
Copy link
Contributor

webmat commented Nov 13, 2019

@andrewkroh With Christoph out, can I elect you to double-check me on finishing up his PR? I think this is ready to merge if there's nothing major sticking out.

This relates to the Beats PR elastic/beats#14001, so cc @tsg as a reviewer on the Beats PR.

@webmat webmat requested review from andrewkroh and removed request for MikePaquette November 13, 2019 21:48
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -276,3 +277,18 @@

This is mainly useful if you use more than one system that assigns
risk scores, and you want to see a normalized value across all systems.

- name: ingested
level: core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to introduce it directly as core. From the feedback we've seen, this timestamp is considered more useful than event.created.

A few reasons why that is so:

  • It's another system's timestamp, which can help detect tampering of the clock on the monitored machine
  • It can also be used to detect slowdowns in the overall pipeline, assuming no tampering

PRs such as elastic/beats#14001 could also help populating it broadly and reliably, without having to revisit all modules or all beats.

If you have strong feelings and would really prefer to start by introducing as extended, I can go with that, in order to get this in quickly. But I think it would send the wrong message wrt to this timestamp's importance vs event.created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for your response on this, and would like to merge this tomorrow if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants