Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Pipelines][discuss] Support custom pipelines #129

Closed
simitt opened this issue Feb 5, 2021 · 16 comments
Closed

[Ingest Pipelines][discuss] Support custom pipelines #129

simitt opened this issue Feb 5, 2021 · 16 comments

Comments

@simitt
Copy link

simitt commented Feb 5, 2021

Feature Request

Allow users to customize Ingest Pipelines per data stream. This might be done in multiple iterations, e.g.
(1) allow to disable loading the default pipelines, and documenting how custom pipelines can be set up manually. Removing pipelines could break data streams. The package-spec could be extended to define which pipelines are required and which are optional. Optional pipelines could then be disabled by users.
(2) support customizing default pipelines. Building on top of (1) the spec could be extended to allow adding custom pipelines.

Why is it important

APM Server today allows users to define pipelines via a json file. It allows users to configure whether ingest pipelines should be set up and overwritten. When migrating to managed APM Server users should at least be able to set up their pipelines manually and ensure these are not overwritten.

Current State

@ruflin started a docs PR to document how users can set up their own pipelines.

rw-access pushed a commit to rw-access/package-spec that referenced this issue Mar 23, 2021
* Removing outdated information

* Moving context to correct place

* Fixing links to anchors
@axw
Copy link
Member

axw commented Mar 25, 2021

One challenge around customising pipelines is that they are renamed by Fleet, based on the data stream and package version. When you upgrade the package, new pipelines will be installed and customisations to existing ones will be lost. Also, because pipelines may change between package versions, I'm not sure we can ever sensibly carry customisations across.

I'm inclined to find a solution that does not involve either manually modifying pipelines that are installed by Fleet, or changing which of those pipelines are run. Instead, can we limit the feature to enabling users to define a pipeline which runs either before or after the default pipeline (or both)? This could be implemented in multiple ways, e.g. post-default by using a final pipeline, or by having Fleet inject pipeline processor(s) into the package-defined pipeline, referencing the user-specified pipeline(s).

@jalvz
Copy link
Contributor

jalvz commented Mar 25, 2021

Users have already other ways to create ingest pipelines, right? (ingest API, Kibana) What if we don't give them another way to do it and resort existing mechanisms? Or in other words: how the suggested way is better/simpler?

@axw
Copy link
Member

axw commented Mar 25, 2021

The problem is not in modifying a pipeline, it's in carrying it across package upgrades. Users do not have an existing mechanism for that. They can create their own custom pipeline using the usual mechanisms.

Maybe we could just allow users to modify the pipeline installed by Fleet, and have Fleet perform a three-way merge with the new pipeline on upgrade. I'm not convinced that'll be simpler, but it's certainly an option.

@ruflin
Copy link
Member

ruflin commented Mar 25, 2021

I would prefer to stay away from a 3-way merge if possible. Instead I think it would be nicer if Elasticsearch would support to add a list of ingest pipelines: elastic/elasticsearch#61185 Like this we only need to merge the list of pipelines and not the pipelines itself. The alternative is using the full fledged document routing where Elasticsearch would support adding certain pipelines for certain data(streams): elastic/elasticsearch#63798

@felixbarny
Copy link
Member

I’ve been thinking about this in a different context and have created a proposal for a potential solution: elastic/elasticsearch#61185 (comment).



@axw @simitt could you have a look and see whether the proposal would work for the use case mentioned in this issue?

@axw
Copy link
Member

axw commented Apr 11, 2022

@felixbarny thanks for chiming in. I think that would work. One issue would be about ownership: who owns logs-foo once a user has customised it? If it was installed by a package, should uninstalling the package remove the customised pipeline? How do we know it has been customised?

I had in mind something similar, but reversed:

  • allow the pipeline processor to refer to a non-existent pipeline
  • have package-provided pipelines reference initially non-existent before/after pipelines with stable identifiers, using the pipeline processor

e.g. something like

  • install package v1.0.0, it creates pipeline logs-foo-1.0.0. This pipeline will call pipeline processors logs-foo-before and logs-foo-after (before and after the builtin processors, respectively)
  • create logs-foo-before; it will now be called by logs-foo-1.0.0
  • upgrade package to v1.1.0, creates pipeline logs-foo-1.1.0 and removes logs-foo-1.0.0; it still calls logs-foo-before
  • uninstall package, removes pipeline logs-foo-1.1.0, but leaves logs-foo-before alone

@felixbarny
Copy link
Member

If it was installed by a package, should uninstalling the package remove the customised pipeline?

Good question. One approach would be to never remove the whole pipeline but to only remove the pipeline processor that's managed by the integration which can be identified with a processor tag. One issue is that re-installing the integration could mess up the order. A potential solution would be to not remove the pipeline processor, just disabling it with if: false.

have package-provided pipelines reference initially non-existent before/after pipelines with stable identifiers, using the pipeline processor

I've been thinking about that, too. It would even be possible today by configuring the pipeline processors with ignore_failure=true. However, this will still create an exception in Elasticsearch which adds overhead. We could try to avoid creating an Exception if ignore_failure=true but from looking at the code, I couldn't find an elegant solution to do that. Not saying it's not possible, though.

@felixbarny
Copy link
Member

felixbarny commented Apr 11, 2022

have package-provided pipelines reference initially non-existent before/after pipelines with stable identifiers, using the pipeline processor

Or we just create empty before/after pipelines where users can add their processors. (inspired by elastic/elasticsearch#61185 (comment))

@axw
Copy link
Member

axw commented Apr 11, 2022

Good idea. That sounds reasonable to me, since it's exactly what we do for component templates.

@ruflin
Copy link
Member

ruflin commented Apr 11, 2022

Is there any cost attached to empty pipelines?

@felixbarny
Copy link
Member

felixbarny commented Apr 11, 2022

@joshdover
Copy link
Contributor

I've been thinking about that, too. It would even be possible today by configuring the pipeline processors with ignore_failure=true. However, this will still create an exception in Elasticsearch which adds overhead. We could try to avoid creating an Exception if ignore_failure=true but from looking at the code, I couldn't find an elegant solution to do that. Not saying it's not possible, though.

I've attempted to do this using an on_failure block with a fail processor that is so close to accomplishing this with the existing functionality.

My idea was to use an on_failure block with a single fail processor that has a condition to not fail if the failure was a non-existent pipeline error. However, it seems I can’t figure how to access the _ingest.on_failure_* fields from within the condition, but they are accessible from within the message option. I tried accessing this via _ingest. and ctx['_ingest'] as well without any luck. I'm chatting with ES Data Management on Slack to determine how to work around this, but posting here in case you all have any ideas. Here’s the pipeline I’m trying to create:

PUT _ingest/pipeline/test-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "test-custom",
        "tag": "fleet-custom-caller",
        "on_failure": [
          {
            "fail": {
              "if": "_ingest.on_failure_processor_type != 'pipeline' || _ingest.on_failure_processor_tag != 'fleet-custom-caller' || !_ingest.on_failure_message.contains('non-existent pipeline [test-custom]')",
              "message": "test-custom pipeline failed with {{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    }
  ]
}

@felixbarny
Copy link
Member

I think it would be cleaner to create a new parameter, such as ignore_missing and avoid an exception to be created in the first place.

@joshdover
Copy link
Contributor

Opened an issue to discuss this: elastic/elasticsearch#87323

@gbanasiak
Copy link

Has elastic/kibana#133740 made this issue obsolete?

@axw
Copy link
Member

axw commented Oct 24, 2022

@gbanasiak thanks, yes, this is resolved by elastic/kibana#133740.

@axw axw closed this as completed Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants