Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform API update documents already inserted #80277

Closed
mjrlgue opened this issue Nov 3, 2021 · 4 comments · Fixed by #82225
Closed

Transform API update documents already inserted #80277

mjrlgue opened this issue Nov 3, 2021 · 4 comments · Fixed by #82225
Assignees
Labels
:ml/Transform Transform >non-issue Team:ML Meta label for the ML team

Comments

@mjrlgue
Copy link
Contributor

mjrlgue commented Nov 3, 2021

Elasticsearch version (bin/elasticsearch --version): 7.15

License: Basic
Plugins installed: none

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:
Hey I am facing a big issue (I don't know if it's designed to behave like this or not) as we are going to production using Transform API. What we are facing is the aggregation calculated with a transform API are updated once a term aggregation already exist in the destination index.

Here is the transformation:

POST _transform/_preview
{
  "source": {
    "index": "sourceIndex",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now/d-7d-7d",
                "lt": "now/d-7d"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destinationIndex",
   "pipeline": "add_ingestedAt"
  },
  "pivot": {
    "group_by": {
      "os": { "terms": { "field":"infos.os"}}
    },
    "aggs": {
      "mac_unique": {
        "cardinality": {
          "field": "mac_adress"
        }
      }
    }
  }
}

The following response is:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    }
  ]
....

The problem is, when I insert a new document, let's say having os=os3, what I expect is having a new insertion in the destination index as follow:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 1,
      "ingestedAt": "2021-10-20T12:34:25Z"
    },
  ]
....

But it gets updated like this:

"preview" : [
    {
      "os" : "os1",
      "mac_unique" : 9,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os2",
      "mac_unique" : 3,
      "ingestedAt": "2021-10-19T12:34:25Z"
    },
    {
      "os" : "os3",
      "mac_unique" : 4,
      "ingestedAt": "2021-10-20T12:34:25Z"
    }
  ]
...

From the doc of Transform API, what I understand is that it will calculate aggregations and store them in the destination index, but here it's not inserting but updating documents already inserted in the destination index with the same term.

I added an ingest pipeline to insert a date field ingestedAt but still updating. is it supposed to behave like this ?

Thanks.

@mjrlgue mjrlgue added >bug needs:triage Requires assignment of a team area label labels Nov 3, 2021
@benwtrent
Copy link
Member

Hey @mjrlgue transform by default creates document IDs based on the values of the group_by keys.

So, in your case document IDs match 1-1 the os name.

This is by design. There are two ways forward if you want a new occurrence of os3 to be its own document:

    1. Add another group_by that is date_histogram. That way the transform groups by time + the OS. So new documents with the same OS won't update old ones.
    1. Supply your own document ID. You can do this with your ingest pipeline.

I suggest option 1, as your destination data seems to be time based (being you have the concept of "new" vs "old" os3 documents).

@benwtrent benwtrent added :ml/Transform Transform and removed >bug needs:triage Requires assignment of a team area label labels Nov 3, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Nov 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-assigned this Nov 3, 2021
@benwtrent
Copy link
Member

@mjrlgue I am going to close this issue as it seems things are working as designed.

Please feel free to reopen, open a http://discuss.elastic.co/ ticket, or a separate issue.

@mjrlgue
Copy link
Contributor Author

mjrlgue commented Nov 10, 2021

Hey @benwtrent Thank you for your reply and sorry for my late reply ^^.

Well we tried to add, as you suggested, the date_histogram but the results we are getting is not really what we wanted, I'll explain my self; when adding a date_histogram, we are getting results from week calendar, for example if we execute a transformation at 10th nov. 2021, we are getting fixed weeks based on the calendar of the current month, what we want is
for an aggregation 7 days ago at 10th nov. 2021, I have to get result for a week at 3rd nov. 2021, etc.

we added the date histogram as follow:

POST _transform/_preview
{
  "source": {
    "index": "sourceIndex",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now/d-7d-7d",
                "lt": "now/d-7d"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destinationIndex",
   "pipeline": "add_ingestedAt"
  },
  "pivot": {
    "group_by": {
"timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "7d"
        }
      },
      "os": { "terms": { "field":"infos.os"}}
    },
    "aggs": {
      "mac_unique": {
        "cardinality": {
          "field": "mac_adress"
        }
      }
    }
  }
}

But I like the idea of providing our own _id for each document, but I don't know how. Could you please provide me with a link or an example to randomly generate an id with ingest pipeline?

I think this detail about Transform API that generates ids based on the value of group_by keys should be specified in the doc. Because it will save time for further users. I think ill do a PR for that for the doc what do you think?

Thanks :)

costin added a commit to costin/elasticsearch that referenced this issue Jan 2, 2022
As part of the effort of making JDBC driver self sufficient, remove the
ES lib geo dependencies without any replacement.
Currently the JDBC driver takes the WKT text and instantiates a geo
object based on the ES lib geo.
Moving forward the driver will return the WKT string representation
without any conversion letting the user pick the geo library desired.
That can be ES lib geo, jts, spatial4j or others.

Note this is a breaking change.

Relates elastic#80277
costin added a commit that referenced this issue Jan 4, 2022
As part of the effort of making JDBC driver self sufficient, remove the
ES lib geo dependencies without any replacement.
Currently the JDBC driver takes the WKT text and instantiates a geo
object based on the ES lib geo.
Moving forward the driver will return the WKT string representation
without any conversion letting the user pick the geo library desired.
That can be ES lib geo, jts, spatial4j or others.

Note this is a breaking change.

Relates #80277
costin added a commit to costin/elasticsearch that referenced this issue Jan 4, 2022
As part of the effort of making JDBC driver self sufficient, remove the
ES lib geo dependencies without any replacement.
Currently the JDBC driver takes the WKT text and instantiates a geo
object based on the ES lib geo.
Moving forward the driver will return the WKT string representation
without any conversion letting the user pick the geo library desired.
That can be ES lib geo, jts, spatial4j or others.

Note this is a breaking change.

Relates elastic#80277
elasticsearchmachine pushed a commit that referenced this issue Jan 4, 2022
As part of the effort of making JDBC driver self sufficient, remove the
ES lib geo dependencies without any replacement.
Currently the JDBC driver takes the WKT text and instantiates a geo
object based on the ES lib geo.
Moving forward the driver will return the WKT string representation
without any conversion letting the user pick the geo library desired.
That can be ES lib geo, jts, spatial4j or others.

Note this is a breaking change.

Relates #80277
costin added a commit to costin/elasticsearch that referenced this issue Jan 4, 2022
Remove JDBC driver dependency on XContent by cloning the necessary
classes into the driver.
This has the advantage of keeping the parsing/writing code style in sync
and hopefully making maintenance easier in the future at the cost
of bringing over a lot of code that is potentially unnecessary.
The imported classes were kept as close as possible to the original and
placed under a different package. Noteable exception is JDBC
XContentBuilder which exposes its internal generator to allow unwrapping
inside sql-action.

The bridging between XContent in ES and JDBC is done in sql-action
through ProtoShim which relies on delegation to allow ES XContent to be
used inside the JDBC classes.

Fix elastic#80277
costin added a commit that referenced this issue Jan 5, 2022
Remove JDBC driver dependency on XContent by cloning the necessary
classes into the driver.
This has the advantage of keeping the parsing/writing code style in sync
and hopefully making maintenance easier in the future at the cost
of bringing over a lot of code that is potentially unnecessary.
The imported classes were kept as close as possible to the original and
placed under a different package. Noteable exception is JDBC
XContentBuilder which exposes its internal generator to allow unwrapping
inside sql-action.

The bridging between XContent in ES and JDBC is done in sql-action
through ProtoShim which relies on delegation to allow ES XContent to be
used inside the JDBC classes.

Fix #80277
costin added a commit to costin/elasticsearch that referenced this issue Jan 5, 2022
Remove JDBC driver dependency on XContent by cloning the necessary
classes into the driver.
This has the advantage of keeping the parsing/writing code style in sync
and hopefully making maintenance easier in the future at the cost
of bringing over a lot of code that is potentially unnecessary.
The imported classes were kept as close as possible to the original and
placed under a different package. Noteable exception is JDBC
XContentBuilder which exposes its internal generator to allow unwrapping
inside sql-action.

The bridging between XContent in ES and JDBC is done in sql-action
through ProtoShim which relies on delegation to allow ES XContent to be
used inside the JDBC classes.

Fix elastic#80277
elasticsearchmachine pushed a commit that referenced this issue Jan 5, 2022
Remove JDBC driver dependency on XContent by cloning the necessary
classes into the driver.
This has the advantage of keeping the parsing/writing code style in sync
and hopefully making maintenance easier in the future at the cost
of bringing over a lot of code that is potentially unnecessary.
The imported classes were kept as close as possible to the original and
placed under a different package. Noteable exception is JDBC
XContentBuilder which exposes its internal generator to allow unwrapping
inside sql-action.

The bridging between XContent in ES and JDBC is done in sql-action
through ProtoShim which relies on delegation to allow ES XContent to be
used inside the JDBC classes.

Fix #80277
astefan pushed a commit to astefan/elasticsearch that referenced this issue Jan 7, 2022
As part of the effort of making JDBC driver self sufficient, remove the
ES lib geo dependencies without any replacement.
Currently the JDBC driver takes the WKT text and instantiates a geo
object based on the ES lib geo.
Moving forward the driver will return the WKT string representation
without any conversion letting the user pick the geo library desired.
That can be ES lib geo, jts, spatial4j or others.

Note this is a breaking change.

Relates elastic#80277
astefan pushed a commit to astefan/elasticsearch that referenced this issue Jan 7, 2022
Remove JDBC driver dependency on XContent by cloning the necessary
classes into the driver.
This has the advantage of keeping the parsing/writing code style in sync
and hopefully making maintenance easier in the future at the cost
of bringing over a lot of code that is potentially unnecessary.
The imported classes were kept as close as possible to the original and
placed under a different package. Noteable exception is JDBC
XContentBuilder which exposes its internal generator to allow unwrapping
inside sql-action.

The bridging between XContent in ES and JDBC is done in sql-action
through ProtoShim which relies on delegation to allow ES XContent to be
used inside the JDBC classes.

Fix elastic#80277
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >non-issue Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants