Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: processors configuration can too easily be misconfigured resulting in non-deterministic processor execution order #36134

Closed
turchanov opened this issue Dec 1, 2018 · 5 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@turchanov
Copy link

turchanov commented Dec 1, 2018

Elasticsearch version (bin/elasticsearch --version):

Version: 6.5.1, Build: default/rpm/8c58350/2018-11-16T02:22:42.182257Z, JVM: 1.8.0_192

Plugins installed: []
None

JVM version (java -version):

java version "1.8.0_192"
Java(TM) SE Runtime Environment (build 1.8.0_192-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.192-b12, mixed mode)

OS version (uname -a if on a Unix-like system):

4.14.35-1818.3.3.el7uek.x86_64 #2 SMP Mon Sep 24 14:45:01 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
Date ingest processor doesn't support chaining. That is it cannot access fields that were created by preceeding processors of processors chain. Convert processors works ok in this situation.

Steps to reproduce:
This fails with java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [time] not present as part of path [time]

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "log",
          "patterns": [
            "\\[%{HTTPDATE:time}\\]"
          ]
        },
        "date": {
          "field": "time",
          "formats": [
              "dd/MMM/yyyy:HH:mm:ss Z"
          ]
        }
      }
    ]    
  }, 
  "docs":[
    {
      "_source": {
        "log": "[30/Nov/2018:04:00:17 +0000]"
      }
    }
  ]
}

although using Convert processor works fine

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "log",
          "patterns": [
            "\\[%{HTTPDATE:time}\\]"
          ]
        },
        "convert": {
          "field": "time",
          "type": "string"
        }
      }
    ]    
  }, 
  "docs":[
    {
      "_source": {
        "log": "[30/Nov/2018:04:00:17 +0000]"
      }
    }
  ]
}

Provide logs (if relevant):

@turchanov turchanov changed the title Date ingest processor cannot access fields created by other processors in a chain Date ingest processor cannot access fields created by other processors in a chain Dec 1, 2018
@jakelandis
Copy link
Contributor

@turchanov - Thanks for reporting this. It appears that the root cause of your issue is the missing a couple {, } inside your pipeline definition.

The following works correctly:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "log",
          "patterns": [
            "\\[%{HTTPDATE:time}\\]"
          ]
        }
      },
      {
        "date": {
          "field": "time",
          "formats": [
            "dd/MMM/yyyy:HH:mm:ss Z"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "log": "[30/Nov/2018:04:00:17 +0000]"
      }
    }
  ]
}

I am going to change the description on this issue and call this a bug that we don't fail with the configuration you provided. It is way too easy to get this wrong, and I had to stare at this for a while before I saw the issue.

The problem here is that the provided configuration results a List of 1 with 2 Hashmap entries, where the order the processors are executed is non-deterministic (just got luck it works with convert, but not date). The corrected configuration results in List of 2 each with 1 Hashmap entry where the order of the processors are executed deterministically.

@jakelandis jakelandis changed the title Date ingest processor cannot access fields created by other processors in a chain ingest: processors configuration can too easily be misconfigured resulting in non-deterministic processor execution order Dec 2, 2018
@jakelandis jakelandis added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Dec 2, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jakelandis
Copy link
Contributor

The fix here is to fail faster, and don't allow configuration that will result in non-deterministic processor execution.

@turchanov
Copy link
Author

Oh my.... Indeed that was my misstake.

@probakowski probakowski self-assigned this Dec 11, 2019
@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@joegallo
Copy link
Contributor

joegallo commented Jan 4, 2023

Closing in favor #41837 (that one's a duplicate of this one, but I slightly prefer the write up there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

6 participants