Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: bulk scripted_upsert runs the script after the pipeline #36745

Open
jakelandis opened this issue Dec 17, 2018 · 2 comments
Open

ingest: bulk scripted_upsert runs the script after the pipeline #36745

jakelandis opened this issue Dec 17, 2018 · 2 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@jakelandis
Copy link
Contributor

#36618 allows a default pipeline to be used with bulk upserts. However, there behavior of a bulk scripted_upsert with a default pipeline has some surprising behavior.

Given an index with a default pipeline:

DELETE test
PUT test
{
  "settings": {
    "index.default_pipeline": "bytes"
  }
}
PUT _ingest/pipeline/bytes
{
  "processors": [
    {
      "bytes": {
        "field": "bytes"
      }
    }
  ]
}

Performing a non-bulk upsert works as expected:

POST test/doc/1/_update
{
  "scripted_upsert": true, 
  "script":{
    "source": "ctx._source.bytes = '1kb'" 
  },
  "upsert" :{
    "foo" : "bar"
  }
}
GET test/doc/1

results in :

{
...
  "_source" : {
    "bytes" : 1024,
    "foo" : "bar"
  }
}

The script evaluated, then the ingest pipeline ran normally. This matches the expectation that the script is always executed.
However, the same index request, but with the _bulk API is surprising.

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.bytes = '1kb'", "upsert":{"foo":"bar"}, "scripted_upsert" : true}

Results in:

{
  "took" : 0,
  "ingest_took" : 7,
  "errors" : true,
  "items" : [
    {
      "index" : {
        "_index" : null,
        "_type" : null,
        "_id" : null,
        "status" : 500,
        "error" : {
          "type" : "exception",
          "reason" : "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
            "caused_by" : {
              "type" : "illegal_argument_exception",
              "reason" : "field [bytes] not present as part of path [bytes]"
            }
          },
          "header" : {
            "processor_type" : "bytes"
          }
        }
      }
    }
  ]
}

This is because the script will only execute AFTER the pipeline. Note - the script still executes, but only after pipeline...which means that any computed data from the script is not available to the ingest pipeline.
For example, if you move the data that the processor cares about to the upsert (out of the script) it works as expected:

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.foo = 'bar'", "upsert":{"bytes":"1kb"}, "scripted_upsert" : true}
@jakelandis jakelandis added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Dec 17, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@ywelsch
Copy link
Contributor

ywelsch commented Mar 12, 2019

@jakelandis I've rediscovered this same issue in the context of treating the update API just as a bulk with a single element. I've noticed other divergences between the two APIs as well (not only ingest pipeline related), and wonder if we can move the behavior of the update API to be that of the bulk one, i.e., the script would only execute after the ingest pipeline? WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants