[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

benwtrent · 2019-06-10T21:43:18Z

Problem

Users have the ability to shoot themselves in the foot without much warning with Data Frames.

Example:

{
  "source": { "index": "my-index-*"},
  "dest"  : { "index": "my-data-frame"},
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m",
          "format": "yyyy-MM-dd HH:00"  <--- Uh OH!!!!
        }
      }
    },
    "aggregations": {
	    ....
    }
  }
}

This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:

The data frame will happily run forever. If there are enough documents to span an entire page, such that all the documents are in the same hour, the data frame will continue to request the same page of the composite aggregation infinitely.
Documents will overwrite each other. Since we generate the document _id values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same _id and only the very last bucket seen would be retained.

Solutions???

Check if the interval and the format have the same time fidelity

The format field allows any of our valid time formats. We may be able to look at the base of the calendar_interval (e.g. m => minutes, h => hours, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.

👍 Computationally efficient
👎 A tad complicated, logically

Run sample queries and see if there are repeated keys

Only the date_histogram group_by would have to be considered. If the date_histogram aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.

👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
example of different keys but invalid format:

GET kibana_sample_data_flights/_search
{
  "aggs": {
    "buckets": {
      "composite": {
        "sources": [
          {
            "time": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "second",
                "format": "yyyy-MM-dd HH:mm"
              }
            }
          }
        ]
      }
    }
  }
}
>"buckets" : [
        {
          "key" : {
            "time" : "2019-04-08 00:00"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:02"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:06"
          },
          "doc_count" : 1
        },...

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-10T21:43:19Z

Pinging @elastic/ml-core

droberts195 · 2019-06-12T14:52:23Z

An alternative approach for "Check if the interval and the format have the same time fidelity" might be to first format Instant.EPOCH to a string using the supplied date format. Then format Instant.EPOCH.plus(interval) (converting interval to a TemporalAmount to pass to plus) to a string using the supplied date format. If the two strings are the same then reject the config.

Doing it this way we would not need to understand the Java time format string.

jimczi · 2019-06-12T15:20:31Z

We don't separate the raw key from the string representation in the response of the composite aggregation. In the "normal" date_histogram we have key that contains the raw timestamp and key_as_string that contains the timestamp formatted with the provided format. We could have the same mechanism in composite but I wonder why we expose the format option in the data frame ? Is it just because it's available in the composite ?

benwtrent · 2019-06-12T20:32:46Z

@jimczi two reasons:

One is that it is supported in composite and in date_histogram.
Certain use cases do benefit from having a formatted time value. Especially in the data exploration scenario where the pivot is completed and they desire to see the resulting table. Having a formatted timestamp is much more human readable than a raw timestamp.

jimczi · 2019-06-13T08:54:08Z

Certain use cases do benefit from having a formatted time value. Especially in the data exploration scenario where the pivot is completed and they desire to see the resulting table. Having a formatted timestamp is much more human readable than a raw timestamp.

Ok so I guess this is for the preview page since the resulting table should be an index where the format should be compatible with the format defined on the target date field. Would it be possible to apply the formatting outside of the composite ? We can use DateFormatter.forPattern to create the DateFormatter and then apply something like formatter.format(resolution.toInstant(value).atZone(timeZone)) where the value is the raw timestamp ? So internally the composite aggregation would always return the raw timestamp but the preview page would apply some formatting to display the result ?

hendrikmuhs · 2019-06-26T09:42:03Z

I somewhat like the idea of the testing/querying approach, similar to what @droberts195 suggested and I wonder if we can create the data ourselves instead of querying the source (which might be sparse, empty, non-existent at the time of creating the transform). So basically generating

format(start)
format(start + interval)
format(start + interval * 2)
...

and throw if we create a duplicate.

Interestingly I wonder if this issue is always a bug:

 "calendar_interval": "1m",
 "format": "HH:mm"

is a nice way to implement a round-robin database with data for the last 24 hours.

jimczi · 2019-06-26T09:56:32Z

Another solution would be to only apply the format to the buckets and not the after key. We need the raw values there since they have to be unique and the format may break this assumption. The after map is always used as is to paginate over the buckets so it would be safer to always return the original values without any modification.

benwtrent · 2019-06-28T14:02:33Z

We opted to simply remove support for format in 7.3.0. We will also add the ability to add a human=true flag to the _preview response.

See more discussion here: elastic/kibana#39250

PR removing the format: #43659

benwtrent added team-discuss v8.0.0 :ml/Transform Transform v7.3.0 labels Jun 10, 2019

benwtrent mentioned this issue Jun 13, 2019

[ML] Mishandling of date format in data frame transform destination index elastic/kibana#38926

Closed

benwtrent self-assigned this Jun 17, 2019

benwtrent mentioned this issue Jun 17, 2019

[ML][Data Frame] validate the format is appropriate for interval on PUT #43292

Closed

benwtrent closed this as completed Jun 28, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

benwtrent commented Jun 10, 2019 •

edited

Loading

elasticmachine commented Jun 10, 2019

droberts195 commented Jun 12, 2019

jimczi commented Jun 12, 2019

benwtrent commented Jun 12, 2019

jimczi commented Jun 13, 2019

hendrikmuhs commented Jun 26, 2019

jimczi commented Jun 26, 2019

benwtrent commented Jun 28, 2019

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

Comments

benwtrent commented Jun 10, 2019 • edited Loading

Problem

Solutions???

Check if the interval and the format have the same time fidelity

Run sample queries and see if there are repeated keys

elasticmachine commented Jun 10, 2019

droberts195 commented Jun 12, 2019

jimczi commented Jun 12, 2019

benwtrent commented Jun 12, 2019

jimczi commented Jun 13, 2019

hendrikmuhs commented Jun 26, 2019

jimczi commented Jun 26, 2019

benwtrent commented Jun 28, 2019

benwtrent commented Jun 10, 2019 •

edited

Loading