Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs+] Add pipeline that parses JSON log events into top-level fields #96083

Merged
merged 41 commits into from
May 23, 2023

Conversation

eyalkoren
Copy link
Contributor

Closes #95522

Note for reviewer

I added a validation for pipeline-dependencies, similar to the validation we have for composable index templates, so that a pipeline can be installed only if the pipelines it refers to are already installed.
Due to the complexities related to fully-automated resolution of pipeline dependencies, I simplified the solution by letting each registry manually define the required dependencies. I think this makes sense, as the pipelines required by another pipeline are part of the specific information known by the declaring registry, similar to the pipeline's ID and its related file.

NOTE that the validation doesn't care for versions of pipeline dependencies, so it is satisfied if the required pipelines are installed, based on their IDs, regardless of their version. I think this is a sufficient requirement, since it prevents race condition and it doesn't affect ingested data consistency in rolling upgrades (meaning- if the referred pipeline changes, ingested documents may be inconsistent either way).

@eyalkoren eyalkoren requested review from jbaiera and felixbarny May 14, 2023 14:06
@eyalkoren eyalkoren self-assigned this May 14, 2023
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team external-contributor Pull request authored by a developer outside the Elasticsearch team labels May 14, 2023
@felixbarny
Copy link
Member

There shouldn't be parsing-related rejections.

Right, it's "just" mapping issues that could happen as a result of adding more fields to the mapping that are parsed out of the JSON message.

As long as we don't map anything explicitly to object types, does this PR increase risk related to subobjects: false?

What I meant is that the JSON may contain conflicting keys, such as "foo" and "foo.bar" or that some of the keys in the JSON conflict with metadata fields added by Filebeat, such as "host" vs "host.name".

Maybe this is what you mean but we could for now already add the JSON pipeline with the registry but not call it from the logs-*-* pipeline. So if users want to use it, they can "opt-in" by calling it from the logs@custom pipeline.

It would be a trivial change to call the JSON pipeline by default once we have more safety measures in place.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented May 17, 2023

Maybe this is what you mean but we could for now already add the JSON pipeline with the registry but not call it from the logs-- pipeline. So if users want to use it, they can "opt-in" by calling it from the logs@custom pipeline.

Yes, that is what I meant. For logs users and for anyone else that is interested in this capability.
EDIT: sorry, this is specific to logs atm as it only looks for the message field. Maybe there's a way to generalize it somehow, but definitely out of scope at this stage.

eyalkoren and others added 2 commits May 17, 2023 07:05
Co-authored-by: Felix Barnsteiner <[email protected]>
Co-authored-by: Felix Barnsteiner <[email protected]>
@ruflin
Copy link
Contributor

ruflin commented May 17, 2023

Maybe we can split this up into 2 phases. First have this "json" pipeline in place. Think of it like the ecs templates that can easily embeded and we keep optimising. In a second phase, we can enable it by default. By then, the pipeline is also tested. Or instead, we might make it a config option on the data stream.

@@ -0,0 +1,48 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss the naming of this pipeline. Because at the moment it would conflict with a pipeline for the dataset: json and namespace: pipeline. We need to come up with a non conflicting naming convention for these global asset. Ideally the component templates and pipelines follow the same logic. @kpollich @joshdover You might have ideas here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this being a separate pipeline at all? Are we going to be reusing it elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshdover I think this separation just proved to be useful with the concerns raised by @felixbarny - it will now allow us to remove it from the default pipeline, but still allow users to easily opt in by calling it from the logs@custom pipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstrated now in this PR's test

@eyalkoren
Copy link
Contributor Author

eyalkoren commented May 17, 2023

I think I don't have any more input for this.
@ruflin @felixbarny let me know once you decide on the followings, so I can bring this to completion-

  • do we merge it, or block until other safety valves are in place?
  • if merging- do we disable it by default for now?
  • how to name the pipeline?

@felixbarny
Copy link
Member

do we merge it, or block until other safety valves are in place?
if merging- do we disable it by default for now?

Let's merge it but make it opt-in for now.

how to name the pipeline?

Taking inspiration from pipeline names used in the elastic/integrations repo. I'd suggest pipeline-json-message.

@eyalkoren
Copy link
Contributor Author

The pipeline is now disabled by default, with easy opt-in option, as the test shows.
So we are only left with a decision about pipeline naming - @ruflin

Taking inspiration from pipeline names used in the elastic/integrations repo. I'd suggest pipeline-json-message.

I think it doesn't match very well to all other config files around it

@felixbarny
Copy link
Member

Because at the moment it would conflict with a pipeline for the dataset: json and namespace: pipeline.

I think that's more of a theoretical concern, isn't it? Package-provided pipelines have the structure <type>-<dataset>-<version>, so I don't see the conflict with logs-json-pipeline. But if we still want to avoid that association, we should probably prefix pipelines that we set up in ES with pipeline-, such as pipeline-logs-json or pipeline-json-message.

@ruflin
Copy link
Contributor

ruflin commented May 22, 2023

I think that's more of a theoretical concern, isn't it?

Not sure that is why I'm not comfortable with it. All the current integrations have a version prefixed, but what about the integrations that are built in Kibana? I rather be safe on this one.

I expect over the coming months we keep adding some more reusable assets to Elasticsearch like ecs templates, other ingest pipelines. For a user, it should be very easy to understand that these are assets loaded by the system and globally available, ideally easy to remember names / convention. As this is the first asset of this kind that makes it in, we should come up with the convention.

Here some thoughts:

  • We don't need the pipeline part in the name, it is already a pipeline
  • Should we prefix with the type these things are for? logs@json? If for all signals, signals@json? {type}@{processor}
    • How do the ECS templates play into this? logs@ecs-core, ecs-core@signals. But this is broader? ecs@system, ecs@managed ?
  • The above should work for assets with a version identifier and without one. For ECS mappings, I expect these to have a version, but some base templates / pipelines not.

@eyalkoren
Copy link
Contributor Author

As agreed offline- we'll change to logs@json-message and follow up with a proper definition of the builtin pipeline naming convention

@eyalkoren eyalkoren requested review from ruflin and felixbarny May 22, 2023 14:32
@eyalkoren eyalkoren merged commit 7d57731 into elastic:main May 23, 2023
@eyalkoren eyalkoren deleted the json-parse-logs branch May 23, 2023 03:22
@felixbarny felixbarny changed the title [Logs+] Automatically parse JSON log events into top-level fields [Logs+] Add pipeline that parses JSON log events into top-level fields Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Logs+] Add JSON parsing pipeline
6 participants