Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Avoiding duplication of ECS field definitions #63

Closed
andrewkroh opened this issue Oct 15, 2020 · 57 comments
Closed

[Discuss] Avoiding duplication of ECS field definitions #63

andrewkroh opened this issue Oct 15, 2020 · 57 comments
Assignees

Comments

@andrewkroh
Copy link
Member

Currently each dataset duplicates the definitions of the ECS fields that it uses (data type, descriptions, examples, etc). This puts a burden on the package maintainers to copy the ECS definitions into the dataset and keep it in sync with ECS.

It would be simpler to develop and maintain a package if the dataset only required listing the names of ECS fields that the module uses (and the ECS version). When the package is build the full field definitions for the specified fields can be imported from ECS.

The dataset would declare:

  • ECS version
  • List of ECS fields (maybe allowing for patterns like host.*, but that's part of the discussion)

The build step would create field declarations in YAML format that conform the package spec.

@mtojek
Copy link
Contributor

mtojek commented Oct 15, 2020

/cc @ruflin @ph @ycombinator

Let me transfer this issue to the package-spec repository, which is the right place to talk about the format changes.

@mtojek mtojek transferred this issue from elastic/integrations Oct 15, 2020
@andrewkroh
Copy link
Member Author

andrewkroh commented Oct 15, 2020

I'm not sure this affects the specification. Assuming that the build step used in elastic/integrations continues to generate packages that comply to the package-spec then I assume no changes are necessary.

@mtojek
Copy link
Contributor

mtojek commented Oct 15, 2020

It depends on the results of the discussion, we may consider to change the fields file format to contain references (pointers?) to ECS fields instead of duplicate entries.

Regarding the build step, the long term plan is to get rid of the Magefile in integrations repo and 100% depend on https://github.com/elastic/elastic-package . Ideally the integrations will remain a repository with config files without any Go code inside. The builder tools will be totally separated from the source of integrations.

@mtojek
Copy link
Contributor

mtojek commented Oct 15, 2020

@ycombinator and I had a discussion about possible mitigation of ECS duplication. One of the ideas we talked about bases on introducing a new JSON/YAML file in <data_stream>/_dev directory with listed ECS fields. This file would be processed during the build step and will result in the generated fields/ecs.yml with generated ECS content using the versioned fields from the ECS repository.

The _dev is an extension to the package specification format that contains resources used to build or test packages, not strictly package assets like fields, configs, images, etc.

@ruflin
Copy link
Contributor

ruflin commented Oct 16, 2020

If we add this to the package spec, I think it should not be a requirement. A dev that still wants to create all fields inside the dataset should be able to. I'm ok to make it a generic feature but would at first iterate in integrations on it.

@mtojek
Copy link
Contributor

mtojek commented Oct 16, 2020

If we add this to the package spec, I think it should not be a requirement.

It must be in the package-spec to preserve the standard across all packages. It's not strictly the package specification, but the _dev extension. We decided to define _dev together in the package specification and not the create separate on and depend on the package-spec (KISS).

A dev that still wants to create all fields inside the dataset should be able to

Regarding the requirement, I'm rather fan of hard requirements which make the standard stricter. Let's say the ecs.yml is restricted and is only generated during the build step. I'm open for discussion, but this will help us to solve the original problem. Unless developers should modify/overwrite the ECS field definitions.

I'm ok to make it a generic feature but would at first iterate in integrations on it.

The feature will appear only in integrations as the _dev folder is not copied to the package-storage.

@ruflin
Copy link
Contributor

ruflin commented Oct 16, 2020

I think we are talking about the same thing? If a developer builds a package outside integrations in repo "foo", elastic-package can be used but no _dev directory must be added, it is optional? This dev just must define all his fields in some *.yml file?

@mtojek
Copy link
Contributor

mtojek commented Oct 16, 2020

elastic-package can be used but no _dev directory must be added, it is optional.

Yes, the _dev directory is optional.

This dev just must define all his fields in some *.yml file?

I suppose so, but I'm wondering about the ecs.yml, whether it MUST be always generated (in pair with ECS field names in _dev). I'd like to hear more voices about it.

@webmat
Copy link
Contributor

webmat commented Oct 16, 2020

I agree with the idea of having an automated way to fetch ECS fields to reduce the manual work. By name or by glob would sound like a reasonable first step. Thanks for opening this, @andrewkroh!

We might also want eventually to allow the removal of certain globs, as a second pass. E.g. include source.* but exclude source.as.*. We do have some field sets that have grown pretty big, and I understand that one of the design goals of packages is to have only the fields that are absolutely needed, and no more. 👍

I also agree with @ruflin that using this mechanism should be optional. Some packages may have nothing to do with what ECS covers.

Some packages might also want to override ECS fields. I think this is also acceptable. Legit reasons of doing so: clarifying a description in the context of this package, or adopting experimental ECS changes that aren't yet in the main ECS spec.

How this gets implemented in the package build process is up to you. But it's worth pointing out that the tools used by the ECS team can already help accomplish some of this (see USAGE.md). If this is helpful, we're happy to help you adopt our tool to deliver this, add features you need & so on. If you'd rather implement this from scratch in the package build, let us know if we can help :-)

@ruflin
Copy link
Contributor

ruflin commented Oct 21, 2020

Some packages might also want to override ECS fields. I think this is also acceptable. Legit reasons of doing so: clarifying a description in the context of this package

This is spot on and one of the reasons we redefine it every time. It is not really about overwriting the field itself, but the information like title and description to make it specific to the dataset.

@jalvz
Copy link
Contributor

jalvz commented Nov 5, 2020

It would be great to extend the idea to avoid duplication of any fields (ECS or non ECS)

@mtojek
Copy link
Contributor

mtojek commented May 6, 2021

Let me propose a solution (healthy one).

There are few problems around field duplication we need to address:

  1. Keep field definitions in sync with ECS (the definition is always the same across all packages).
  2. We don't want to install all fields in Elasticsearch - 99% of them are not required.
  3. ECS field updates (patch, minor?) shouldn't require rebuilding or redistributing package.
  4. We don't want to create a special repository with same/similar content as https://github.com/elastic/ecs

Solution:

The solution consists of following items:

  • Auto-generate ECS package out of the http://github.com/elastic/ecs repository
  • Add field annotations: lazy (bool), external (bool)
  • Add property "dependencies" to the package manifest (NPM style)

Field annotations:

  1. Lazy field (bool) - it means that the field definition won't be installed immediately, but it may be installed if other package asks for it.
  2. External field (bool) - it means that the field is defined in one of package dependencies.

Manifest properties:

  1. Dependencies - a map that contains package dependencies in the NPM format (package name + version constraint)

Comments:

Frankly speaking I would be for delegating responsibility of resolving dependency graph to the Kibana than including it
in the build phase of a package and I understand that it's not a simple task to enable dependency management there.
I have a feeling that we can leverage from it in the future.

Counter-argument to the build phase:

Sure, we can resolve dependencies during the package build time, but it means that all packages in the Package Storage
will end up with duplicated fields (larger repo size) and packages that are contributed directly to the Package Storage won't be able
to use this mechanism.

@ruflin
Copy link
Contributor

ruflin commented May 10, 2021

As you know, I strongly prefer to stay away from package dependencies as long as we can as in general a lot of complexity comes along with these things.

@mtojek You mentioned in the beginning that fields.yml could contain references / pointers: #63 (comment) These would be resolved during build time. What is the issue with this solution?

@mtojek
Copy link
Contributor

mtojek commented May 10, 2021

@mtojek You mentioned in the beginning that fields.yml could contain references / pointers: #63 (comment) These would be resolved during build time. What is the issue with this solution?

I'm concerned about the scaling/maintenance problem here - it's the same situation like with every library version bump.
To preserve idempotent package builds, we need to introduce a build-time dependency (versioned/commit hash) on the ECS package (this way we won't have any runtime dependencies). If the team behind ECS updates/bugfixes some fields in the original repo, these changes should be reflected in packages, which means that somebody (maintainer) will have to update them manually (or automatically, doesn't matter). Every ECS update may trigger an avalanche of pull requests with field updates (and package rebuilds).

I'm looking for seeing more comments around this puzzle.

@ruflin
Copy link
Contributor

ruflin commented May 11, 2021

Solving the problem of pulling in a specific version of ECS seems much simpler and self contained then starting a create a dynamic dependency graph that needs live resolving in Kibana and much more error prone. It also brings along the problem, that a package installed today or tomorrow might be different because of a change in ECS and I don't think this should be the case. The same package installed should always be the same.

For the ECS version resolution, it would be nice if by default a package creator does not have to think about it. The fields.yml supports references and when elastic-package build is run, it copies in the correct references from the current state. As soon as the package is pushed to snapshot, no reference exists anymore. This makes sure, even if ECS changes, no changes are made to the package anymore.

The next time a user works on the package and runs elastic-package build, ECS will be updated if there is a new version. There should also be an option for the user to specify a constant version of ECS to be used if needed.

One odd thing that might happen is that the user built a package with a certain ECS version and when CI is then run, a different version was released and a diff happens. For this, a local file could be used to specify the ECS version used like other build tools do.

@mtojek
Copy link
Contributor

mtojek commented May 11, 2021

It also brings along the problem, that a package installed today or tomorrow might be different because of a change in ECS and I don't think this should be the case. The same package installed should always be the same.

The original package will remain the same, only it's dependency (same concept as library) will be bumped up.

Anyway, I see that you're not in favor of introducing package dependenciees. I'm ok with this as I don't have capacity to craft alone something more sophisticated.

then starting a create a dynamic dependency graph that needs live resolving

In fact this is an external dependency on the ECS repository and the "local file" you mentioned would be exactly a metafile describing the simple dependency graph. It would mean that we'll have build-time dependency, resolved by the elastic-package while processing fields. So the elastic-package would not only process Kibana dashboards, but also render fields.

Any more comments around this? @jsoriano @exekias @andrewkroh

@exekias
Copy link
Contributor

exekias commented May 13, 2021

I think we should differentiate two use cases were we may want to apply some magic to bring external fields:

  1. Fields that are not generated by the integration itself, but appended by the Agent. These are mostly the ones coming from add_*_metadata, and affect most (all?) integrations based on beats inputs.
  2. Fields that are actually generated by the integration, but have a common definition somewhere else, like ECS. For instance a http.request.method field read from a log line.

For the first case I think that these fields are unexpected for the integrations developer, and they don't necessarily need to know about them, or be able to reproduce an scenario populating them. For instance, cloud fields will only appear when Agent is running in cloud, still we need to make sure they get the right mapping. I don't think we should be putting the pressure of adding these fields on the integrations developer, for instance, a change in the Agent would mean we need to update all integrations.

For the second case, developers know about these fields as they are populating it, so here it would be a matter of providing shortcuts to avoid duplicating the effort of defining the mapping.

What problem are we discussing here? 1) or 2)? Do we agree they are different problems with (potentially) different solutions?

@mtojek
Copy link
Contributor

mtojek commented May 13, 2021

What problem are we discussing here? 1) or 2)? Do we agree they are different problems with (potentially) different solutions?

It would be nice if we can find a common/single solution. Actually the 1) is more complex - if the agent starts emitting a new field, we should update ALL packages which may use them, which IMHO is a "no-go" for build-time dependencies.

@andresrc
Copy link

andresrc commented May 14, 2021

Thanks @exekias for the clarification. I agree, these are two different problems, and even though we could think on an unifying solution I would prefer to try to find and implement simpler partial solutions to each of them, learn from the use-cases and based on the evolution of the solution, such as future requirements and evolutions (e.g., will integrations be able to add new processors at some point?) think on how to evolve / unify the solutions in the future if needed.

I think the initial issue we are discussing here is (2). In this case, I think we want:

  • The integration author to be in control.
  • Not tied to the integration repository.

For this case a build time solution seems more than reasonable: the author references (one or more) field libraries (by URL) and decides which fields to import, with or without modification. Achieving reproducible builds (which we should try) implies just adding a hash of the imported content together with the URL.

(1), as @mtojek points out, smells like requiring a run-time approach, as processors change with stack/agent versions, independently of packages. This would require more time and thought as there are more problems to solve (conflict resolution, updating mappings for existing packages at upgrade time, etc.) but besides thinking about it (because I think we need to go this route at some point) we would a shorter-term solution. And that can be just using the same approach that for problem (2), with a caveat: this solution will be manageable while most packages are maintained by us and in the same repo, as we can upgrade the packages with a stack release when needed. It would not be perfect, but it might be good enough to make progress.

@mtojek
Copy link
Contributor

mtojek commented May 19, 2021

TL;DR:

  • build-time dependency on ECS fields
  • build-time dependency on Agent fields
  • manual update procedure for refreshing (syncing) field definitions

Let's sketch an execution plan for this:

  • Fields: define new field property "external" (e.g source: ecs or source: agent) Introduce "external" property for fields #179
  • Package manifest: define build dependencies on external repositories (name + commit hash) Define build time dependencies #181
  • Elastic Package: implement logic that can load external fields from ECS repository during the build phase. Built packages will contain all field definitions (as they are present now)
  • Agent Field Schema: create a dedicated repository with common Agent and Beats fields (will include also ones for processors)
  • Elastic Package: update field loading logic to cover agent fields

@andrewkroh
Copy link
Member Author

That sounds like a good approach to me.

How would you expect the "includes" for processors to be specified? In the ECS case we are specifying the individual field names and importing the rest of the field's information from ECS ({field: destination.ip, source: ecs}). The processor fields case is different in that I think we want to include all of a given processor's fields.

@mtojek
Copy link
Contributor

mtojek commented May 27, 2021

@andrewkroh would you mind providing links to some existing use cases? It would be much easier for me to analyze these cases.

@andrewkroh
Copy link
Member Author

Agent uses the default Filebeat config that forces any log dataset to include processors add_host_metadata, add_cloud_metadata, add_docker_metadata, and add_kubernetes_metadata. So as a use-case, any log dataset should be declaring/including the fields from these processors.

Each of those processors has a set of fields that they can populate. I don't think the fields.yml files in Beats for these processors represents the full set of fields that they populate, but we can fix this. (The reason they are incomplete is because the processor's fields.yml does not include ECS fields since they are defined elsewhere.)

Similarly if a dataset's agent config uses a processor then it would be nice to be able to simply include/import all the fields that this processor could populate. This is a less common use-case. The only example I can think of for this is the cef module (it's the only place remaining that we [security team] still used edge processing). But I expect there will be some future use cases in both logs/metrics that require a beats processor.

@mtojek
Copy link
Contributor

mtojek commented May 27, 2021

Thank you for the explanation, Andrew.

Each of those processors has a set of fields that they can populate. I don't think the fields.yml files in Beats for these processors represents the full set of fields that they populate, but we can fix this.

I'm afraid we'll have to write down these dependencies (processor with their fields) in some mapping file. As a first thought I would recommend to embed it in the proposed "Agent Field Schema".

How would you expect the "includes" for processors to be specified?

The processor-fields mappings defined in "Agent Field Schema" and marked as default will be always included (which will result in fields files for processors). We can figure out it while flushing out the repository's draft.

Speaking of custom processors, what do you think about introducing additional manifest property for a data stream - uses:, e.g.

uses:
  processors:
  - decode_cef

It's a bit too verbose, but may we can reuse it in the UI for smarter navigation to docs (quickly jump to the processor doc page)

@jsoriano
Copy link
Member

Proposed plan (#63 (comment)) sounds good to me for ECS and initially for common agent fields, though I think that longer term a different solution is needed for agent fields. In principle all datasets can be used with composable providers (autodiscover), that include their own metadata, so with this approach all packages are going to need to include all these fields. Package developers shouldn't need to know what composable providers agent supports, so maybe elastic-package should just include all these fields for a given version of the agent.

For that the "Agent Common Schema" should include all the fields defined now in processors (also the x-pack ones for nomad and cloudfoundry), and for autodiscover providers, that also include metadata. Most autodiscover providers happen to don't have any fields.yml now only because they reuse the ones from their add_*_metadata counterparts.

A more future-proof alternative for agent fields can be to rethink them in a way that can be included under some dynamic mapping, so a fields deffinition is not needed for them. For example reserving meta.*.* for these agent fields, and make kubernetes processors and composable providers to put all their non-ECS metadata under meta.kubernetes.*, nomad ones under meta.nomad.* and so on.

@ruflin
Copy link
Contributor

ruflin commented May 28, 2021

I'm good with the approach. Lets focus first on ECS and continue the discussion on processor, autodiscover and agent fields in a follow up issue / proposal to go step by step.

@mtojek
Copy link
Contributor

mtojek commented May 28, 2021

Thank you, Jaime for your input.

Package developers shouldn't need to know what composable providers agent supports, so maybe elastic-package should just include all these fields for a given version of the agent.

There is a tiny border between lazy field loading and approach we have in Beats (load-them-all), but totally I see your point.

For example reserving meta.. for these agent fields, and make kubernetes processors and composable providers to put all their non-ECS metadata under meta.kubernetes., nomad ones under meta.nomad. and so on.

We may hit the wall with backward-consistency against Beats, but this is also an option. In fact we lose all detailed descriptions here.

Summary

Looks like we have an agreement on the first part of the solution (ECS fields), so I will proceed with that part and ask you to think about other potential options for agent/autodiscovery/processors. I have feeling that finding a reasonable (best) solution for "autodiscovery" might be tough challenge.

@urso
Copy link

urso commented Jul 5, 2021

I thinkg we have to distinguish between short term and long term goals.

Long term, there should be no need for an agent based schema. Right now Agent even adds an ecs.version field. How can the agent tell the reported version is really in line with the Input/Integration?
Instead, the Agent should not add any fields at all. The Input/Integration will be responsible for adding fields. The add_*_metadata processors should be replaced with alternatives. We provide system meta-data that are available to inputs today. And data providers (replacing auto-discovery in Beats) can also acquire system meta-data.
The issue at hand exists, because Beats internals are heavily leaking into the Agent. Instead we might think about ways to create inputs that can leverage the meta-data and allow the Integration author to add them properly.

That's why we're looking for an option to either create a similar schema to ECS, the Agent Fields Schema

What exactly should the Agent Fields Schema include? Just the fields currently added by libbeat (like ecs.version and host.name)? Or also fields from processors? Metricbeat/Heartbeat do add their own meta-data like fields, also not maintained by the Agent Team. Some Beats/APM-Server even configure libbeat to emit a different set of default fields.
Different components in different Beats can add fields, plus they are maintained by different teams, with the Agent Team being the only team that has not added/removed any of these fields for a long time.

TBH, I'm quite reluctant to make the Agent Team responsible for tracking changes to internal fields, potentially made by multiple teams.

@mtojek
Copy link
Contributor

mtojek commented Jul 5, 2021

Ok, let me bring again the original use case:

As the Fleet user I'd like to know which fields will be populated when I enable a specific integration. I'd like to browse and review them in the integration's README.

We managed to introduce an option to standardize all ECS fields (same description, types, consistency). I believe the same actions should (can?) be performed for processors

What exactly should the Agent Fields Schema include? Just the fields currently added by libbeat (like ecs.version and host.name)? Or also fields from processors?

I was thinking mainly about this meta:

    "agent": {
        "ephemeral_id": "4d35807f-c708-46e6-97f3-b3369fbc34e8",
        "hostname": "docker-fleet-agent",
        "id": "d8213996-c24f-495c-96cb-f564b71a2762",
        "name": "docker-fleet-agent",
        "type": "filebeat",
        "version": "7.14.0"
    },

Instead, the Agent should not add any fields at all. The Input/Integration will be responsible for adding fields.

Out of curiosity - what about the above JSON?

I think we have to distinguish between short term and long term goals.
TBH, I'm quite reluctant to make the Agent Team responsible for tracking changes to internal fields, potentially made by multiple teams.

Agree, but I'm afraid that switching to true input based approach is a long term. That's why we focused on processors and we're exploring options to describe the schema. A possible solution was described above: #63 (comment) . As there is no way to customize/enable additional processors in the Fleet, we may include it in the short term plan.

I'm open to work out a solution/plan that will satisfy all our needs.

@jsoriano
Copy link
Member

jsoriano commented Jul 5, 2021

agent fields will be maintained by the Agent team, they have to ensure that fields are inline with existing integrations.

TBH, I'm quite reluctant to make the Agent Team responsible for tracking changes to internal fields, potentially made by multiple teams.

I think we can discuss about ownerships later, if we decide to have an "Agent Fields Schema" (or however we call it) I agree that it doesn't need to be necessarily owned by the Agent team. We can leave this out of the discussion by now, let's focus on the technical solution.

How can the agent tell the reported version is really in line with the Input/Integration?
Instead, the Agent should not add any fields at all. The Input/Integration will be responsible for adding fields.

And data providers (replacing auto-discovery in Beats) can also acquire system meta-data.

Inputs and data providers are distributed with the agent, this would be the reason to couple the "Agent Fields Schema" with the Agent, even if the Agent doesn't include any field, components distributed with it do it. This "Schema" would include all the fields that can be added by all the inputs and data providers included in an Agent release.

Integrations should be independent to this schema as they don't have any control on these fields, we would be coupling integrations to agent releases and this is something we want to avoid. Integrations should only include the mappings for the fields with the information they collect.

Fields that are currently included in integrations, but provided by the agent (or the inputs or data providers distributed with it), as the mentioned ecs.version would be moved to the "Agent Fields Schema".

@andresrc
Copy link

andresrc commented Jul 7, 2021

As I see it, we have added a mechanism to add ECS fields (which is a "well-known" and "privileged" schema) and we are now looking for a similar way to add additional sets of fields. The fact that those fields currently come from processors or agent itself is an "implementation detail" so we should avoid naming that implies that, as it may change / evolve in the future. We could start with something like "Common Data Providers Schema" or something like that.

Shouldn't we go directly for multiple schemas support? It would be identified by and URL (plus some hash for build reproducibility) and an integration package could just import fields from as many as needed.

@mtojek
Copy link
Contributor

mtojek commented Jul 7, 2021

Shouldn't we go directly for multiple schemas support? It would be identified by and URL (plus some hash for build reproducibility) and an integration package could just import fields from as many as needed.

This way seems to be intuitive and flexible option, agree here. It isn't hard to extend current implementation.

The fact that those fields currently come from processors or agent itself is an "implementation detail" so we should avoid naming that implies that, as it may change / evolve in the future

I'm afraid we can't import every single field with the exported field definition. Processors may potentially add dozens of them. That's why I proposed "uses" directive in #63 (comment) . With "uses" we can mark that the package utilizes processors, so appropriate fields could be imported.

Another option would be extending the build.yml file with other "Common Data Providers" with additional flag (fields: ["add_host_metadata", "..."]?) signaling to import a subset of fields (e.g. processors like add_kubernetes_metadata or add_host_metadata).

@jsoriano
Copy link
Member

jsoriano commented Jul 7, 2021

As I see it, we have added a mechanism to add ECS fields (which is a "well-known" and "privileged" schema) and we are now looking for a similar way to add additional sets of fields.

@andresrc but take into account that they are set of fields needed at different moments. The mechanism added for ECS fields is for integrations to more easily declare fields that they provide, this is something specific of each integration. The "Common Data Providers Schema" is not specific of integrations, it is specific of the "data providers", integrations don't have any control on these fields, and in my opinion they shouldn't have any reference to them.

We could start with something like "Common Data Providers Schema" or something like that.

SGTM.

Shouldn't we go directly for multiple schemas support? It would be identified by and URL (plus some hash for build reproducibility) and an integration package could just import fields from as many as needed.

+1 to support multiple schemas support in the external fields mechanism introduced for ECS, but it wouldn't solve the other issue.

That's why I proposed "uses" directive in #63 (comment) . With "uses" we can mark that the package utilizes processors, so appropriate fields could be imported.

This uses directive can be nice for processors that can be used by integrations and generate a known set of fields, such as decode_def. And it would be also handy to support reused fields. This mechanism would be more in the line of the mechanism introduced for ECS, and it wouldn't solve the other issue. Integrations don't know of other fields that could be added by data providers.

Another option would be extending the build.yml file with other "Common Data Providers" with additional flag (fields: ["add_host_metadata", "..."]?) signaling to import a subset of fields (e.g. processors like add_kubernetes_metadata or add_host_metadata).

This would force all integrations to maintain a list with all the possible data providers, in my opinion this is a no-go. Actually I think that packages or the the package-spec shouldn't include anything related to the fields provided by data providers.

Maybe we should move this conversation to more specific issues. We could extract something like these different tasks:

  • Implement the uses directive to import groups of fields under given objects. (affects package-spec and elastic-package).
  • Support multiple schemas apart of ECS for uses and external (affects package-spec and elastic-package).
  • Define the "Common Data Providers Schema". Extracting from Beats/Agent something like ECS, but for the fields provided by processors, data providers and so on. This schema can include external ECS fields. Any kind of data collector compatible with the Integrations/Fleet experience can only add fields covered by this schema. (ownership to be decided).
  • Somehow install the "Common Data Providers Schema" on every data stream (affects only fleet?).
  • Ensure that all data providers use fields defined in the "Common Data Providers Schema" (continuous effort of teams maintaining data providers).
  • Ensure that integrations don't have fields conflicting with the "Common Data Providers Schema" (affects elastic-package, and fleet if it does install checks).

@andresrc
Copy link

andresrc commented Jul 8, 2021

So, there are two approaches here:

  • Opt-in: the integration decides with of the data providers (I'm using provider as a more generic name than processor, we could use that for uses as well) are allowed to participate in the events.
  • Opt-out: we can add providers dynamically to any integration.

The main disadvantage of opt-in is that we need to explicitly list providers in integrations and update the package versions as those providers evolve. The main advantage is that everything can be solved at build time and it's deterministic (there are no surprises between what I built and transformations performed at runtime).

The main advantage of opt-out is that there is higher decoupling between integration and provider development. The disadvantage is that conflicts can appear at runtime and that we need more runtime support.

Because of this I think we should start with the opt-in approach:

  • It seems easier to implement.
  • Everything can be checked at build time.
  • We control all the packages (at least for now).
  • We always have the possibility to move to an opt-out approach in the future.

@mtojek
Copy link
Contributor

mtojek commented Jul 8, 2021

Thank you for joining the discussion, it seems that we reached an agreement - a package uses specific data providers.
I will prepare a sample PR with data providers schema (a few processors). I suppose we would like to keep in a separate repository - any preferences here?

My candidates:

  • elastic-package-data-providers
  • package-data-providers
  • package-common-schema
  • also package-spec ?

@andresrc
Copy link

andresrc commented Jul 8, 2021

package-data-providers sounds ok for me. I think it needs to be a different repo so that we can version it independently more easily.

One question: Can some of those providers reference ECS fields? If so, how would the reference work?

Can we send an email with a summary of the final proposal?

@mtojek
Copy link
Contributor

mtojek commented Jul 8, 2021

One question: Can some of those providers reference ECS fields? If so, how would the reference work?

They have to refer to ECS fields, but no fields will be stored in resolved form. I wouldn't resolve them on this level to prevent dependency hell (packages uses ECS 1.10, some processors uses ECS 1.9, some ones uses ECS 1.8 - it's impossible to keep consistency between field definitions).

I think we can adopt a similar format to packages - fields directory. Every YAML file represents a data provider (e.g. Beats processor), contains field definitions and references to ECS fields. When an integration uses a data provider, the elastic-package will load an appropriate fields file.

Can we send an email with a summary of the final proposal?

Sure I can do this. I will also highlight that format of package-data-providers will be decided during the implementation phase (first pull requests).

@jsoriano
Copy link
Member

jsoriano commented Jul 8, 2021

Because of this I think we should start with the opt-in approach:

  • It seems easier to implement.
  • Everything can be checked at build time.
  • We control all the packages (at least for now).
  • We always have the possibility to move to an opt-out approach in the future.

@andresrc, for the data providers fields, this implies to copy all fields definitions of all the providers to all the packages, are we ok with this solution by now?

For clarity, I am refering to the fields added by composable data providers, such as these ones: https://github.com/elastic/beats/blob/master/x-pack/elastic-agent/pkg/composable/providers/kubernetes/kubernetes.go#L103-L106

I agree that this solution is easier to implement (or looks easier), but this couples packages to agent releases, we should release all packages every time a new field is added to a composable provider included in the agent. And users should update all their integrations to these new versions every time an agent is upgraded to a version including new fields.

@mtojek
Copy link
Contributor

mtojek commented Jul 8, 2021

I had a chat about the solution with @jsoriano .

Jaime raised concerns around kubernetes.go, which means that there are some fields added directly by elastic-agent, so we can't just trust fields defined in integrations. Even if we document all common fields used by integrations, values/types for some of them may differ depending on the elastic-agent's version.

@andresrc Is it an acknowledge risk or we should raise this issue to the Agent team?

EDIT:

Oh, it looks we posted it in the same time :)

@andresrc
Copy link

andresrc commented Jul 8, 2021

Let's send a summary of the current proposal to the whole team and continue the conversation there, thanks!

@epixa
Copy link

epixa commented Jul 8, 2021

One point I'm not totally clear on (apologies if this point was discussed, I may have missed it): why is it preferred that these fields be maintained in a separate repo rather than contributed to ecs itself?

@djptek
Copy link

djptek commented Jul 8, 2021

If the issue is e.g. syncing different Agents to different versions of ECS I would imagine that would be easier to manage by using the ECS toolset to generate &/or extend the desired field subsets, rather than creating these externally and then e.g. adding new tools to reconcile these against ECS

@djptek
Copy link

djptek commented Jul 8, 2021

So, for example, generating a new JSON/YAML file in <data_stream>/_dev directory with listed ECS fields looks interesting

@mtojek
Copy link
Contributor

mtojek commented Jul 8, 2021

One point I'm not totally clear on (apologies if this point was discussed, I may have missed it): why is it preferred that these fields be maintained in a separate repo rather than contributed to ecs itself?

Currently all these fields are scattered across Beats code and most likely won't be good candidates to import 1:1 into ECS. Whenever a field is a good fit into the ECS, we'll go for it. Also, we need a place that we own, similarly to Beats, but with higher consistency rate.

If the issue is e.g. syncing different Agents to different versions of ECS I would imagine that would be easier to manage by using the ECS toolset to generate &/or extend the desired field subsets, rather than creating these externally and then e.g. adding new tools to reconcile these against ECS

I'm not quite sure if we like to follow this way. We have fields that are product oriented, e.g. Kubernetes, Docker. I'm not sure if these are good candidates.

So, for example, generating a new JSON/YAML file in <data_stream>/_dev directory with listed ECS fields looks interesting

This feature has been already delivered. Integrations can import fields directly from ECS (based on ECS version). If you're interested, here is the HOWTO guide.

@jsoriano
Copy link
Member

jsoriano commented Jul 8, 2021

If the issue is e.g. syncing different Agents to different versions of ECS I would imagine that would be easier to manage by using the ECS toolset to generate &/or extend the desired field subsets, rather than creating these externally and then e.g. adding new tools to reconcile these against ECS

Please notice that this issue actually includes two issues (see this comment):

  • Fields generated by integrations, but are defined by ECS. For this issue we already have a solution, there is now a way to include external fields in the fields definition of each integration (see Define build time dependencies #181).
  • Fields that are not generated by integrations, but by some processor or data provider included in the Agent. For this part we are proposing having the list of fields somewhere else. Even if all these fields end up in ECS, we would still need the list of fields somewhere else, because we don't want to install the whole ECS in all the index mappings. We can use something like the mechanism implemented for the previous point to reuse ECS fields.

@djptek
Copy link

djptek commented Jul 8, 2021

re Kubernetes, there is an completed RFC 0012: Orchestrator field set creation
- might that be relevant?

@ebeahan
Copy link
Member

ebeahan commented Jul 8, 2021

Currently all these fields are scattered across Beats code and most likely won't be good candidates to import 1:1 into ECS. Whenever a field is a good fit into the ECS, we'll go for it. Also, we need a place that we own, similarly to Beats, but with higher consistency rate.

Yes, there will likely always be fields that are not candidates for ECS, and it also seems reasonable to have the list of fields in one place vs. scattered. However, I do worry that over time as new fields are considered and added to the "agent common schema", those fields may become less likely to make their way into ECS.

But we also already have similar challenges today. For example, os.build or os.codename in add_host_metadata are defined local to the processor but perhaps should be included in ECS like many of the other add_host_metadata fields. Having Agent's schema defined in a single location may make identifying these types of discrepancies easier.

@ruflin
Copy link
Contributor

ruflin commented Jul 9, 2021

In general I like the deterministic approach we are following here. There is a concern I have about the number of fields this will add to each dataset and with it increase size of all templates and mapping. This multiplies quickly with many dataset and if we just add k8s fields to all integrations even if someone does not use it, it is not great.

There might be a partial way out here: dynamic mappings. Instead of configuring all fields for k8s in the referenced fields it is a dynamic mapping that makes sure the fields are dynamically mapped correctly, most are keywords anyways so likely do not even need the mapping to be set as this is the default.

One completely different alternative is to use more recent feature in Elasticsearch that the mapping can be sent as part of the request. Like this the creation of these mappings would be delegated to Beats as part of the pull request. But it would have to be investigated if this causes issues with the permissions.

PS: I don't like that we keep mixing two discussions into a single issue. It keeps creating confusion. We should close this issue and have a separate one for the "current" discussion.

@jsoriano
Copy link
Member

jsoriano commented Jul 9, 2021

There is a concern I have about the number of fields this will add to each dataset and with it increase size of all templates and mapping. This multiplies quickly with many dataset and if we just add k8s fields to all integrations even if someone does not use it, it is not great.

Yes, this also worries me. I didn't want to add factors to the discussion, because this is in progress, but having many fields (and many sources of fields) may also complicate the controlled annotation of dimensions (elastic/elasticsearch#74014).

We have also proposed different approaches based on dynamic templates and mappings during this thread (#63 (comment), #63 (comment), #63 (comment)). This would be actually the ideal situation because each data stream would only contain the mappings needed for the data it stores, we should dedicate some efforts to explore this line.

PS: I don't like that we keep mixing two discussions into a single issue. It keeps creating confusion. We should close this issue and have a separate one for the "current" discussion.

+1, and probably this discussion doesn't belong to the package spec, because as I have said before I don't think packages or integrations should have any reference to fields they don't control.

@mtojek
Copy link
Contributor

mtojek commented Jul 9, 2021

@ruflin good idea about splitting the thread as ECS deduplication is done. Let's continue in the follow-up issue.

@mtojek mtojek closed this as completed Jul 9, 2021
@djptek
Copy link

djptek commented Jul 10, 2021

Sorry to re-open, I've been reading through the thread in detail & there are a couple of things I'd like to understand better.

I understand:

TL;DR:
build-time dependency on ECS fields
build-time dependency on Agent fields

I'm not sure about

manual update procedure for refreshing (syncing) field definitions

My concern is that we will move towards 2 distinct technologies for maintaining schemas. Perhaps more importantly than the field definitions themselves, if one of these technologies adds/removes a feature it may then be incumbent on the other technology to implement this, which adds, not removes more dependencies.

The ECS Toolset currently provides for both a common schema AND extensions to that schema

However, while the ECS repository currently accomodates Common Schema I fully understand there is a Gap - a repository to house Agent fields.

So, here's my question -

To prevent divergence (without getting into the detail of design &/or reuse) is it technically feasible that the ECS and Agent fields might both share tooling derived from a common parent?

@djptek djptek reopened this Jul 10, 2021
@mtojek
Copy link
Contributor

mtojek commented Jul 12, 2021

manual update procedure for refreshing

If you look into elastic/integrations, you will notice that in many cases developers tend to copy same ECS fields definitions (name, type, description, other properties) into multiple places. This way you can introduce some inconsistencies or errors. We needed an adjustment here, so we decided to improve our builder tool - elastic-package.

My concern is that we will move towards 2 distinct technologies for maintaining schemas. Perhaps more importantly than the field definitions themselves, if one of these technologies adds/removes a feature it may then be incumbent on the other technology to implement this, which adds, not removes more dependencies.
The ECS Toolset currently provides for both a common schema AND extensions to that schema

I'm happy to review the ECS toolset, didn't know about it, thanks! Please keep in mind that package development is relatively dynamic and we wouldn't like to get blocked, as some fields are waiting for an official ECS release. We're just trying to organize them on our end (as they're scattered now across the entire Beats repository - field validation is impossible).

If you have any tools to create and manage field schema extensions in external repositories, please share them.

To prevent divergence (without getting into the detail of design &/or reuse) is it technically feasible that the ECS and Agent fields might both share tooling derived from a common parent?

I can review the ECS toolset you're offering, but we have one principle/hard requirement on our end - elastic-package is the only tool a developer needs to build a package/integration. I'm not sure about a common parent. Our intention was introducing a lightweight solution to deduplicate field definitions and keep consistency, we definitely don't want to over-engineer here :)


I hope you don't mind if I close the issue and we can continue the discussion in the linked follow-up - #199 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests