ingest: bulk scripted_upsert runs the script after the pipeline #36745

jakelandis · 2018-12-17T23:15:01Z

#36618 allows a default pipeline to be used with bulk upserts. However, there behavior of a bulk scripted_upsert with a default pipeline has some surprising behavior.

Given an index with a default pipeline:

DELETE test
PUT test
{
  "settings": {
    "index.default_pipeline": "bytes"
  }
}
PUT _ingest/pipeline/bytes
{
  "processors": [
    {
      "bytes": {
        "field": "bytes"
      }
    }
  ]
}

Performing a non-bulk upsert works as expected:

POST test/doc/1/_update
{
  "scripted_upsert": true, 
  "script":{
    "source": "ctx._source.bytes = '1kb'" 
  },
  "upsert" :{
    "foo" : "bar"
  }
}
GET test/doc/1

results in :

{
...
  "_source" : {
    "bytes" : 1024,
    "foo" : "bar"
  }
}

The script evaluated, then the ingest pipeline ran normally. This matches the expectation that the script is always executed.
However, the same index request, but with the _bulk API is surprising.

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.bytes = '1kb'", "upsert":{"foo":"bar"}, "scripted_upsert" : true}

Results in:

{
  "took" : 0,
  "ingest_took" : 7,
  "errors" : true,
  "items" : [
    {
      "index" : {
        "_index" : null,
        "_type" : null,
        "_id" : null,
        "status" : 500,
        "error" : {
          "type" : "exception",
          "reason" : "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "java.lang.IllegalArgumentException: field [bytes] not present as part of path [bytes]",
            "caused_by" : {
              "type" : "illegal_argument_exception",
              "reason" : "field [bytes] not present as part of path [bytes]"
            }
          },
          "header" : {
            "processor_type" : "bytes"
          }
        }
      }
    }
  ]
}

This is because the script will only execute AFTER the pipeline. Note - the script still executes, but only after pipeline...which means that any computed data from the script is not available to the ingest pipeline.
For example, if you move the data that the processor cares about to the upsert (out of the script) it works as expected:

POST _bulk
{"update":{"_id":"2","_index":"test","_type":"_doc"}}
{"script": "ctx._source.foo = 'bar'", "upsert":{"bytes":"1kb"}, "scripted_upsert" : true}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-17T23:32:35Z

Pinging @elastic/es-core-features

ywelsch · 2019-03-12T12:39:33Z

@jakelandis I've rediscovered this same issue in the context of treating the update API just as a bulk with a single element. I've noticed other divergences between the two APIs as well (not only ingest pipeline related), and wonder if we can move the behavior of the update API to be that of the bulk one, i.e., the script would only execute after the ingest pipeline? WDYT?

jakelandis mentioned this issue Dec 17, 2018

ingest: support default pipelines + bulk upserts #36618

Merged

jakelandis added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Dec 17, 2018

martijnvg mentioned this issue Nov 12, 2019

Improve ingest node usability #48999

Closed

12 tasks

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

jakelandis mentioned this issue Apr 23, 2021

ingest: Bulk "update with upsert" does not honor final_pipeline and default_pipeline in different cases #72108

Open

joegallo mentioned this issue Jan 30, 2024

Better behavior and documentation on ingest pipelines and update operations #104941

Open

bczifra mentioned this issue Oct 15, 2024

Bulk API documentation does not indicate that pipelines are only supported update operations using the upsert action and no other update actions #114811

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: bulk scripted_upsert runs the script after the pipeline #36745

ingest: bulk scripted_upsert runs the script after the pipeline #36745

jakelandis commented Dec 17, 2018

elasticmachine commented Dec 17, 2018

ywelsch commented Mar 12, 2019

ingest: bulk scripted_upsert runs the script after the pipeline #36745

ingest: bulk scripted_upsert runs the script after the pipeline #36745

Comments

jakelandis commented Dec 17, 2018

elasticmachine commented Dec 17, 2018

ywelsch commented Mar 12, 2019