Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233

Closed
3 tasks
talevy opened this issue Mar 23, 2018 · 2 comments
Closed
3 tasks
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP high hanging fruit Meta

Comments

@talevy
Copy link
Contributor

talevy commented Mar 23, 2018

Summary

It is a long requested feature to easily execute Ingest Pipelines within Logstash as installations/inputs/outputs grow beyond the limited ability of IngestNode within an Elasticsearch
Cluster.

To achieve this goal, we must begin the effort of extracting Ingest's interfaces into a standalone
library whose processors can depend on other libraries (just not Elasticsearch)

Logstash could feasibly read in the Ingest Pipeline as-is within the Logstash Pipeline Config.

input {
  kafka { }
}

filter {
  ingest {
    pipeline => '
      {
      "description" : "Ingest pipeline",
       "processors" : [
         {"grok": {"field": "message", "patterns": ["%{IPORHOST:clientip} - %{QS:agent}"]}},
         {"date": {"field": "timestamp","formats": [ "dd/MMM/YYYY:HH:mm:ss Z" ]}},
         {"geoip": {"field": "clientip"}},
         {"user_agent": { "field": "agent"}}
        ]
      }
    '
  }
}

output {
  elasticsearch {}
  kafka {}
  s3 {}
}

Background & Motivation

There has been a lot of work around what it would look like to keep Elasticsearch Ingest Processors
compatible with Logstash Filter Plugins. One thing that made this difficult was that the two were
developed in different codebases and were not able to share code. This meant keeping feature-parity over time is difficult. Even when we bring all the Processors into Logstash, these will live separately
from the existing Filters. For example, the Grok Processor will keep its existing behavior and config options, while Logstash Grok Filter will operate as it has. One thing that could change, is the underlying Grok evaluation engine and pattern library; there is no reason this cannot be shared!

Design & Implementation

In general, the idea is to move all the relevant classes that are used by the CompoundProcessor to execute the pipeline are to be moved to their own library in libs/ingest alongside the existing libs/grok in the ES repository.

libs/ingest

This library will only include the necessary interfaces for executing pipelins/processors. The reason for this is that we do not want to package all the existing Processors together since that would potentially bring in dependencies like Maxmind into Core. We do not want that.

libs/<plugin/module-name>

Since we do not want to clutter the core interface library with implementation details of plugins, each module/plugin will have its own library sub-project within ES that will contain their respective Processor implementations. For example, modules:ingest-common will split out its Processor classes to libs/ingest-common but keep the IngestCommonPlugin definition. Same will be true for plugins:ingest-attachment, plugins:ingest-geoip, plugins:ingest-user-agent.

What about Scripts?

I am not sure what we would plan to do with Script support for arbitrary languages, but Painless
is planning on splitting out into its own library, so we can theoretically pick that up into Logstash.
Logstash would have to potentially copy some ScriptContext and class-whitelisting that Elasticsearch does out of the box.

How Users Use These Libraries (Don't, just Don't)

WARNING: WE DO NOT WANT ANY USERS ACTUALLY DOING THIS. THIS IS A STOPGAP UNTIL
A BETTER APPROACH IS REACHED SO THAT LOGSTASH HAS ACCESS TO THESE LIBRARIES SOONER

All the necessary info to construct and execute processors will exist across libs:ingest, libs:ingest-common, libs:ingest-geoip, libs:ingest-attachment, libs:ingest-user-agent (and libs:lang-painless?)

These libraries can be built locally and loaded into a Java project's classpath.

How will Logstash Use These Libraries

This means that Logstash would be able to build these subprojects and vendor their jar artifacts, as well as any of their third-party dependencies. Once all these jars are loaded in Logstash's classpath... GAME ON!

How exactly Logstash chooses to execute these pipeline definition is out of scope for here, but there are a few options there. One can wrap these processors with classes and implement both Processor and LogstashFilterPlugin interfaces. This option would enable Logstash to expose processor-level metrics.

Decoupling Plan Overview

First step is to recognize what is using what. Lots of these can be easily decoupled, but
some other things like Ingest's use of ES's patched Mustache parsing for field-referencing will
take some re-thinking.

Import This!

Here is a list of Elasticsearch classes used by Ingest in various places, just to give
an idea. It is a big list that can easily be reduced down to only a few classes.

Note, this is just for the main code, Ingest depends on a lot of things in Elasticsearch's Test Framework, but we do not need to worry about that since tests will not be used externally.

import org.elasticsearch.ElasticsearchException;                                                                                                                        
import org.elasticsearch.common.Nullable;                                                                                                                               
import org.elasticsearch.common.ParseField;                                                                                                                             
import org.elasticsearch.common.Strings;                                                                                                                                
import org.elasticsearch.common.bytes.BytesArray;                                                                                                                       
import org.elasticsearch.common.bytes.BytesReference;                                                                                                                   
import org.elasticsearch.common.component.AbstractComponent;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.io.stream.Writeable;
import org.elasticsearch.common.metrics.CounterMetric;
import org.elasticsearch.common.metrics.MeanMetric;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.ClusterSettings;
import org.elasticsearch.common.settings.IndexScopedSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.settings.SettingsFilter;
import org.elasticsearch.common.util.LocaleUtils;
import org.elasticsearch.common.util.concurrent.AbstractRunnable;
import org.elasticsearch.common.util.concurrent.ThreadContext;
import org.elasticsearch.common.util.set.Sets;
import org.elasticsearch.common.xcontent.ContextParser;
import org.elasticsearch.common.xcontent.DeprecationHandler;
import org.elasticsearch.common.xcontent.LoggingDeprecationHandler;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.common.xcontent.ObjectParser;
import org.elasticsearch.common.xcontent.ToXContent.Params;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.ToXContentFragment;
import org.elasticsearch.common.xcontent.ToXContentObject;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentHelper;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.XContentParserUtils;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.common.xcontent.json.JsonXContent;
import org.elasticsearch.common.xcontent.json.JsonXContentParser;
import org.elasticsearch.script.ExecutableScript;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptException;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.script.ScriptType;
import org.elasticsearch.script.TemplateScript;

Progress Steps

  • Remove dependency on Elasticsearch scripting for field-referencing
  • Move Processor and other related interfaces from server/ingest to libs/ingest
  • TBD
@talevy talevy added Meta :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Mar 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@talevy talevy changed the title Remove ES core (:server) as a dependency of Ingest Node Processor Execution Remove Elasticsearch code as a dependency of Ingest Node Processor Execution Mar 23, 2018
@talevy talevy closed this as completed Jul 19, 2018
@talevy talevy reopened this Jul 19, 2018
@talevy
Copy link
Contributor Author

talevy commented Jul 12, 2019

Closing since this is no longer a concern

@talevy talevy closed this as completed Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP high hanging fruit Meta
Projects
None yet
Development

No branches or pull requests

2 participants