Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

talevy · 2018-03-23T23:57:26Z

Summary

It is a long requested feature to easily execute Ingest Pipelines within Logstash as installations/inputs/outputs grow beyond the limited ability of IngestNode within an Elasticsearch
Cluster.

To achieve this goal, we must begin the effort of extracting Ingest's interfaces into a standalone
library whose processors can depend on other libraries (just not Elasticsearch)

Logstash could feasibly read in the Ingest Pipeline as-is within the Logstash Pipeline Config.

input {
  kafka { }
}

filter {
  ingest {
    pipeline => '
      {
      "description" : "Ingest pipeline",
       "processors" : [
         {"grok": {"field": "message", "patterns": ["%{IPORHOST:clientip} - %{QS:agent}"]}},
         {"date": {"field": "timestamp","formats": [ "dd/MMM/YYYY:HH:mm:ss Z" ]}},
         {"geoip": {"field": "clientip"}},
         {"user_agent": { "field": "agent"}}
        ]
      }
    '
  }
}

output {
  elasticsearch {}
  kafka {}
  s3 {}
}

Background & Motivation

There has been a lot of work around what it would look like to keep Elasticsearch Ingest Processors
compatible with Logstash Filter Plugins. One thing that made this difficult was that the two were
developed in different codebases and were not able to share code. This meant keeping feature-parity over time is difficult. Even when we bring all the Processors into Logstash, these will live separately
from the existing Filters. For example, the Grok Processor will keep its existing behavior and config options, while Logstash Grok Filter will operate as it has. One thing that could change, is the underlying Grok evaluation engine and pattern library; there is no reason this cannot be shared!

Design & Implementation

In general, the idea is to move all the relevant classes that are used by the CompoundProcessor to execute the pipeline are to be moved to their own library in libs/ingest alongside the existing libs/grok in the ES repository.

libs/ingest

This library will only include the necessary interfaces for executing pipelins/processors. The reason for this is that we do not want to package all the existing Processors together since that would potentially bring in dependencies like Maxmind into Core. We do not want that.

libs/<plugin/module-name>

Since we do not want to clutter the core interface library with implementation details of plugins, each module/plugin will have its own library sub-project within ES that will contain their respective Processor implementations. For example, modules:ingest-common will split out its Processor classes to libs/ingest-common but keep the IngestCommonPlugin definition. Same will be true for plugins:ingest-attachment, plugins:ingest-geoip, plugins:ingest-user-agent.

What about Scripts?

I am not sure what we would plan to do with Script support for arbitrary languages, but Painless
is planning on splitting out into its own library, so we can theoretically pick that up into Logstash.
Logstash would have to potentially copy some ScriptContext and class-whitelisting that Elasticsearch does out of the box.

How Users Use These Libraries (Don't, just Don't)

WARNING: WE DO NOT WANT ANY USERS ACTUALLY DOING THIS. THIS IS A STOPGAP UNTIL
A BETTER APPROACH IS REACHED SO THAT LOGSTASH HAS ACCESS TO THESE LIBRARIES SOONER

All the necessary info to construct and execute processors will exist across libs:ingest, libs:ingest-common, libs:ingest-geoip, libs:ingest-attachment, libs:ingest-user-agent (and libs:lang-painless?)

These libraries can be built locally and loaded into a Java project's classpath.

How will Logstash Use These Libraries

This means that Logstash would be able to build these subprojects and vendor their jar artifacts, as well as any of their third-party dependencies. Once all these jars are loaded in Logstash's classpath... GAME ON!

How exactly Logstash chooses to execute these pipeline definition is out of scope for here, but there are a few options there. One can wrap these processors with classes and implement both Processor and LogstashFilterPlugin interfaces. This option would enable Logstash to expose processor-level metrics.

Decoupling Plan Overview

First step is to recognize what is using what. Lots of these can be easily decoupled, but
some other things like Ingest's use of ES's patched Mustache parsing for field-referencing will
take some re-thinking.

Import This!

Here is a list of Elasticsearch classes used by Ingest in various places, just to give
an idea. It is a big list that can easily be reduced down to only a few classes.

Note, this is just for the main code, Ingest depends on a lot of things in Elasticsearch's Test Framework, but we do not need to worry about that since tests will not be used externally.

import org.elasticsearch.ElasticsearchException;                                                                                                                        
import org.elasticsearch.common.Nullable;                                                                                                                               
import org.elasticsearch.common.ParseField;                                                                                                                             
import org.elasticsearch.common.Strings;                                                                                                                                
import org.elasticsearch.common.bytes.BytesArray;                                                                                                                       
import org.elasticsearch.common.bytes.BytesReference;                                                                                                                   
import org.elasticsearch.common.component.AbstractComponent;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.io.stream.Writeable;
import org.elasticsearch.common.metrics.CounterMetric;
import org.elasticsearch.common.metrics.MeanMetric;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.ClusterSettings;
import org.elasticsearch.common.settings.IndexScopedSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.settings.SettingsFilter;
import org.elasticsearch.common.util.LocaleUtils;
import org.elasticsearch.common.util.concurrent.AbstractRunnable;
import org.elasticsearch.common.util.concurrent.ThreadContext;
import org.elasticsearch.common.util.set.Sets;
import org.elasticsearch.common.xcontent.ContextParser;
import org.elasticsearch.common.xcontent.DeprecationHandler;
import org.elasticsearch.common.xcontent.LoggingDeprecationHandler;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.common.xcontent.ObjectParser;
import org.elasticsearch.common.xcontent.ToXContent.Params;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.ToXContentFragment;
import org.elasticsearch.common.xcontent.ToXContentObject;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentHelper;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.XContentParserUtils;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.common.xcontent.json.JsonXContent;
import org.elasticsearch.common.xcontent.json.JsonXContentParser;
import org.elasticsearch.script.ExecutableScript;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptException;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.script.ScriptType;
import org.elasticsearch.script.TemplateScript;

Progress Steps

Remove dependency on Elasticsearch scripting for field-referencing
Move Processor and other related interfaces from server/ingest to libs/ingest
TBD

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-03-23T23:57:27Z

Pinging @elastic/es-core-infra

talevy · 2019-07-12T15:00:49Z

Closing since this is no longer a concern

talevy added Meta :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Mar 23, 2018

talevy changed the title ~~Remove ES core (:server) as a dependency of Ingest Node Processor Execution~~ Remove Elasticsearch code as a dependency of Ingest Node Processor Execution Mar 23, 2018

talevy added high hanging fruit >non-issue and removed >non-issue labels Mar 24, 2018

talevy mentioned this issue Mar 28, 2018

Support Elasticsearch date-mapping pattern names in DateProcessor #23032

Closed

talevy closed this as completed Jul 19, 2018

talevy reopened this Jul 19, 2018

talevy closed this as completed Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

talevy commented Mar 23, 2018 •

edited

Loading

elasticmachine commented Mar 23, 2018

talevy commented Jul 12, 2019

Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

Comments

talevy commented Mar 23, 2018 • edited Loading

Summary

Background & Motivation

Design & Implementation

libs/ingest

libs/<plugin/module-name>

What about Scripts?

How Users Use These Libraries (Don't, just Don't)

How will Logstash Use These Libraries

Decoupling Plan Overview

Import This!

Progress Steps

elasticmachine commented Mar 23, 2018

talevy commented Jul 12, 2019

talevy commented Mar 23, 2018 •

edited

Loading