Transform API update documents already inserted #80277

mjrlgue · 2021-11-03T13:30:08Z

Elasticsearch version (bin/elasticsearch --version): 7.15

License: Basic
Plugins installed: none

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:
Hey I am facing a big issue (I don't know if it's designed to behave like this or not) as we are going to production using Transform API. What we are facing is the aggregation calculated with a transform API are updated once a term aggregation already exist in the destination index.

Here is the transformation:

POST _transform/_preview
{
  "source": {
    "index": "sourceIndex",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now/d-7d-7d",
                "lt": "now/d-7d"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destinationIndex",
   "pipeline": "add_ingestedAt"
  },
  "pivot": {
    "group_by": {
      "os": { "terms": { "field":"infos.os"}}
    },
    "aggs": {
      "mac_unique": {
        "cardinality": {
          "field": "mac_adress"
        }
      }
    }
  }
}

The following response is:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    }
  ]
....

The problem is, when I insert a new document, let's say having os=os3, what I expect is having a new insertion in the destination index as follow:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 1,
      "ingestedAt": "2021-10-20T12:34:25Z"
    },
  ]
....

But it gets updated like this:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 4,
      "ingestedAt": "2021-10-20T12:34:25Z"
    }
  ]
...

From the doc of Transform API, what I understand is that it will calculate aggregations and store them in the destination index, but here it's not inserting but updating documents already inserted in the destination index with the same term.

I added an ingest pipeline to insert a date field ingestedAt but still updating. is it supposed to behave like this ?

Thanks.

The text was updated successfully, but these errors were encountered:

benwtrent · 2021-11-03T18:22:20Z

Hey @mjrlgue transform by default creates document IDs based on the values of the group_by keys.

So, in your case document IDs match 1-1 the os name.

This is by design. There are two ways forward if you want a new occurrence of os3 to be its own document:

1. Add another group_by that is date_histogram. That way the transform groups by time + the OS. So new documents with the same OS won't update old ones.
1. Supply your own document ID. You can do this with your ingest pipeline.

I suggest option 1, as your destination data seems to be time based (being you have the concept of "new" vs "old" os3 documents).

elasticmachine · 2021-11-03T18:22:32Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-11-08T17:37:12Z

@mjrlgue I am going to close this issue as it seems things are working as designed.

Please feel free to reopen, open a http://discuss.elastic.co/ ticket, or a separate issue.

mjrlgue · 2021-11-10T14:50:28Z

Hey @benwtrent Thank you for your reply and sorry for my late reply ^^.

Well we tried to add, as you suggested, the date_histogram but the results we are getting is not really what we wanted, I'll explain my self; when adding a date_histogram, we are getting results from week calendar, for example if we execute a transformation at 10th nov. 2021, we are getting fixed weeks based on the calendar of the current month, what we want is
for an aggregation 7 days ago at 10th nov. 2021, I have to get result for a week at 3rd nov. 2021, etc.

we added the date histogram as follow:

POST _transform/_preview
{
  "source": {
    "index": "sourceIndex",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now/d-7d-7d",
                "lt": "now/d-7d"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destinationIndex",
   "pipeline": "add_ingestedAt"
  },
  "pivot": {
    "group_by": {
"timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "7d"
        }
      },
      "os": { "terms": { "field":"infos.os"}}
    },
    "aggs": {
      "mac_unique": {
        "cardinality": {
          "field": "mac_adress"
        }
      }
    }
  }
}

But I like the idea of providing our own _id for each document, but I don't know how. Could you please provide me with a link or an example to randomly generate an id with ingest pipeline?

I think this detail about Transform API that generates ids based on the value of group_by keys should be specified in the doc. Because it will save time for further users. I think ill do a PR for that for the doc what do you think?

Thanks :)

As part of the effort of making JDBC driver self sufficient, remove the ES lib geo dependencies without any replacement. Currently the JDBC driver takes the WKT text and instantiates a geo object based on the ES lib geo. Moving forward the driver will return the WKT string representation without any conversion letting the user pick the geo library desired. That can be ES lib geo, jts, spatial4j or others. Note this is a breaking change. Relates elastic#80277

As part of the effort of making JDBC driver self sufficient, remove the ES lib geo dependencies without any replacement. Currently the JDBC driver takes the WKT text and instantiates a geo object based on the ES lib geo. Moving forward the driver will return the WKT string representation without any conversion letting the user pick the geo library desired. That can be ES lib geo, jts, spatial4j or others. Note this is a breaking change. Relates #80277

As part of the effort of making JDBC driver self sufficient, remove the ES lib geo dependencies without any replacement. Currently the JDBC driver takes the WKT text and instantiates a geo object based on the ES lib geo. Moving forward the driver will return the WKT string representation without any conversion letting the user pick the geo library desired. That can be ES lib geo, jts, spatial4j or others. Note this is a breaking change. Relates elastic#80277

As part of the effort of making JDBC driver self sufficient, remove the ES lib geo dependencies without any replacement. Currently the JDBC driver takes the WKT text and instantiates a geo object based on the ES lib geo. Moving forward the driver will return the WKT string representation without any conversion letting the user pick the geo library desired. That can be ES lib geo, jts, spatial4j or others. Note this is a breaking change. Relates #80277

Remove JDBC driver dependency on XContent by cloning the necessary classes into the driver. This has the advantage of keeping the parsing/writing code style in sync and hopefully making maintenance easier in the future at the cost of bringing over a lot of code that is potentially unnecessary. The imported classes were kept as close as possible to the original and placed under a different package. Noteable exception is JDBC XContentBuilder which exposes its internal generator to allow unwrapping inside sql-action. The bridging between XContent in ES and JDBC is done in sql-action through ProtoShim which relies on delegation to allow ES XContent to be used inside the JDBC classes. Fix elastic#80277

Remove JDBC driver dependency on XContent by cloning the necessary classes into the driver. This has the advantage of keeping the parsing/writing code style in sync and hopefully making maintenance easier in the future at the cost of bringing over a lot of code that is potentially unnecessary. The imported classes were kept as close as possible to the original and placed under a different package. Noteable exception is JDBC XContentBuilder which exposes its internal generator to allow unwrapping inside sql-action. The bridging between XContent in ES and JDBC is done in sql-action through ProtoShim which relies on delegation to allow ES XContent to be used inside the JDBC classes. Fix #80277

Remove JDBC driver dependency on XContent by cloning the necessary classes into the driver. This has the advantage of keeping the parsing/writing code style in sync and hopefully making maintenance easier in the future at the cost of bringing over a lot of code that is potentially unnecessary. The imported classes were kept as close as possible to the original and placed under a different package. Noteable exception is JDBC XContentBuilder which exposes its internal generator to allow unwrapping inside sql-action. The bridging between XContent in ES and JDBC is done in sql-action through ProtoShim which relies on delegation to allow ES XContent to be used inside the JDBC classes. Fix elastic#80277

Remove JDBC driver dependency on XContent by cloning the necessary classes into the driver. This has the advantage of keeping the parsing/writing code style in sync and hopefully making maintenance easier in the future at the cost of bringing over a lot of code that is potentially unnecessary. The imported classes were kept as close as possible to the original and placed under a different package. Noteable exception is JDBC XContentBuilder which exposes its internal generator to allow unwrapping inside sql-action. The bridging between XContent in ES and JDBC is done in sql-action through ProtoShim which relies on delegation to allow ES XContent to be used inside the JDBC classes. Fix #80277

As part of the effort of making JDBC driver self sufficient, remove the ES lib geo dependencies without any replacement. Currently the JDBC driver takes the WKT text and instantiates a geo object based on the ES lib geo. Moving forward the driver will return the WKT string representation without any conversion letting the user pick the geo library desired. That can be ES lib geo, jts, spatial4j or others. Note this is a breaking change. Relates elastic#80277

Remove JDBC driver dependency on XContent by cloning the necessary classes into the driver. This has the advantage of keeping the parsing/writing code style in sync and hopefully making maintenance easier in the future at the cost of bringing over a lot of code that is potentially unnecessary. The imported classes were kept as close as possible to the original and placed under a different package. Noteable exception is JDBC XContentBuilder which exposes its internal generator to allow unwrapping inside sql-action. The bridging between XContent in ES and JDBC is done in sql-action through ProtoShim which relies on delegation to allow ES XContent to be used inside the JDBC classes. Fix elastic#80277

mjrlgue added >bug needs:triage Requires assignment of a team area label labels Nov 3, 2021

benwtrent added :ml/Transform Transform and removed >bug needs:triage Requires assignment of a team area label labels Nov 3, 2021

elasticmachine added the Team:ML Meta label for the ML team label Nov 3, 2021

benwtrent added the >non-issue label Nov 3, 2021

benwtrent self-assigned this Nov 3, 2021

benwtrent closed this as completed Nov 8, 2021

costin mentioned this issue Dec 31, 2021

SQL: Remove JDBC dependency on ES lib geo #82166

Merged

costin mentioned this issue Jan 4, 2022

SQL: Remove JDBC dependency on ES XContent lib #82225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform API update documents already inserted #80277

Transform API update documents already inserted #80277

mjrlgue commented Nov 3, 2021

benwtrent commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

benwtrent commented Nov 8, 2021

mjrlgue commented Nov 10, 2021 •

edited

Loading

Transform API update documents already inserted #80277

Transform API update documents already inserted #80277

Comments

mjrlgue commented Nov 3, 2021

benwtrent commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

benwtrent commented Nov 8, 2021

mjrlgue commented Nov 10, 2021 • edited Loading

mjrlgue commented Nov 10, 2021 •

edited

Loading