Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event.original setting to data stream #100320

Open
ruflin opened this issue Oct 5, 2023 · 3 comments
Open

Add event.original setting to data stream #100320

ruflin opened this issue Oct 5, 2023 · 3 comments
Labels
:Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team

Comments

@ruflin
Copy link
Contributor

ruflin commented Oct 5, 2023

event.original is an ECS field that can be useful in many scenarios, especially in the security context. Currently many integrations add it as part of their ingest pipeline. In Fleet, there is also the option to opt into having the field but it needs to be part of each integration. For more details on this see elastic/integrations#4733

There are several problems with the current approach:

  • It only works if there is an integration for the dataset
  • Each integration developer must add the config option to their integration
  • If something on the edge like Elastic Agent or Logstash already added the field, more logic in the integrations is needed to deal with it.

Instead of having to repeat the same logic in many places, I propose to add a setting to data streams if the field should be added or not, something like:

"data_stream": {
  "event.original": true
}

This means not the integration decides if event.original is captured, but it is set on the data stream. Many integrations can be used for observability or security. If the use case is security, the setting event.original can be turned on for all dataset without having to modify any integrations.

In the scenario of where data is routed, this would also ensure event.original contains the data before it was routed in case on the data stream that triggers the routing, event.original: true is set.

Expected behaviour

The behaviour of the setting would be as follow:

  • if event.original does not exist, first action before applying any ingest pipeline, message is copied to event.original
  • if event.orignal already exist, nothing is done

Change in integrations

It seems at the moment in integrations as we add event.original manually (1, 2) the integrations rename the message to event.original and then all the processing happens on event.original. I'm proposing to change this to keep all the processing on message as now integrations would always assume event.original might not be around.

Questions

  • How does event.original work in combination with TSDB / synthetics source?

Links

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 5, 2023
@ruflin
Copy link
Contributor Author

ruflin commented Oct 5, 2023

I had a good conversation with @P1llus about this issue. Currently many integrations use event.original as the source for all processing and not the message field. To ensure after the processing the event.original field does not stick around and uses up lots of storage, the final pipeline has a remove process that checks for the tag preserve_original_event and removes the event.original if not set.

The above could mean, that there might be 2 config options needed:

  • Config option that add event.original to enable the processing
  • Config option to decide if event.original should be kept

I'm challenging if event.original should be used as the source for processing instead of messaging but it seems it is currently the default in many integrations and as @P1llus mentioned, there are also advantages that it could be used for reindexing if needed and the same pipeline still works.

@P1llus In the scenarios where the original event is not in message, how do integrations handle this at the moment? Pick the field where it is and put it into event.original?

@nik9000 nik9000 added the :Data Management/Data streams Data streams and their lifecycles label Oct 5, 2023
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Oct 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@zez3
Copy link

zez3 commented Aug 28, 2024

What I usually do, is copy the message on the beat level by using the copy filed+string(contains) processor but that means you have to know exactly which part fails which is not always the case

I think this failure store discussed here #108559 and #95534 would be a better approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants