Remove Elasticsearch code as a dependency of Ingest Node Processor Execution #29233
Labels
:Data Management/Ingest Node
Execution or management of Ingest Pipelines including GeoIP
high hanging fruit
Meta
Summary
It is a long requested feature to easily execute Ingest Pipelines within Logstash as installations/inputs/outputs grow beyond the limited ability of IngestNode within an Elasticsearch
Cluster.
To achieve this goal, we must begin the effort of extracting Ingest's interfaces into a standalone
library whose processors can depend on other libraries (just not Elasticsearch)
Logstash could feasibly read in the Ingest Pipeline as-is within the Logstash Pipeline Config.
Background & Motivation
There has been a lot of work around what it would look like to keep Elasticsearch Ingest Processors
compatible with Logstash Filter Plugins. One thing that made this difficult was that the two were
developed in different codebases and were not able to share code. This meant keeping feature-parity over time is difficult. Even when we bring all the Processors into Logstash, these will live separately
from the existing Filters. For example, the Grok Processor will keep its existing behavior and config options, while Logstash Grok Filter will operate as it has. One thing that could change, is the underlying Grok evaluation engine and pattern library; there is no reason this cannot be shared!
Design & Implementation
In general, the idea is to move all the relevant classes that are used by the
CompoundProcessor
to execute the pipeline are to be moved to their own library inlibs/ingest
alongside the existinglibs/grok
in the ES repository.libs/ingest
This library will only include the necessary interfaces for executing pipelins/processors. The reason for this is that we do not want to package all the existing Processors together since that would potentially bring in dependencies like Maxmind into Core. We do not want that.
libs/<plugin/module-name>
Since we do not want to clutter the core interface library with implementation details of plugins, each module/plugin will have its own library sub-project within ES that will contain their respective Processor implementations. For example,
modules:ingest-common
will split out its Processor classes tolibs/ingest-common
but keep theIngestCommonPlugin
definition. Same will be true forplugins:ingest-attachment
,plugins:ingest-geoip
,plugins:ingest-user-agent
.What about Scripts?
I am not sure what we would plan to do with Script support for arbitrary languages, but Painless
is planning on splitting out into its own library, so we can theoretically pick that up into Logstash.
Logstash would have to potentially copy some ScriptContext and class-whitelisting that Elasticsearch does out of the box.
How Users Use These Libraries (Don't, just Don't)
WARNING: WE DO NOT WANT ANY USERS ACTUALLY DOING THIS. THIS IS A STOPGAP UNTIL
A BETTER APPROACH IS REACHED SO THAT LOGSTASH HAS ACCESS TO THESE LIBRARIES SOONER
All the necessary info to construct and execute processors will exist across
libs:ingest
,libs:ingest-common
,libs:ingest-geoip
,libs:ingest-attachment
,libs:ingest-user-agent
(andlibs:lang-painless
?)These libraries can be built locally and loaded into a Java project's classpath.
How will Logstash Use These Libraries
This means that Logstash would be able to build these subprojects and vendor their jar artifacts, as well as any of their third-party dependencies. Once all these jars are loaded in Logstash's classpath... GAME ON!
How exactly Logstash chooses to execute these pipeline definition is out of scope for here, but there are a few options there. One can wrap these processors with classes and implement both Processor and LogstashFilterPlugin interfaces. This option would enable Logstash to expose processor-level metrics.
Decoupling Plan Overview
First step is to recognize what is using what. Lots of these can be easily decoupled, but
some other things like Ingest's use of ES's patched Mustache parsing for field-referencing will
take some re-thinking.
Import This!
Here is a list of Elasticsearch classes used by Ingest in various places, just to give
an idea. It is a big list that can easily be reduced down to only a few classes.
Note, this is just for the main code, Ingest depends on a lot of things in Elasticsearch's Test Framework, but we do not need to worry about that since tests will not be used externally.
Progress Steps
Processor
and other related interfaces fromserver/ingest
tolibs/ingest
The text was updated successfully, but these errors were encountered: