Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

Closed
benwtrent opened this issue Jun 10, 2019 · 8 comments
Closed

Comments

@benwtrent
Copy link
Member

benwtrent commented Jun 10, 2019

Problem

Users have the ability to shoot themselves in the foot without much warning with Data Frames.

Example:

{
  "source": { "index": "my-index-*"},
  "dest"  : { "index": "my-data-frame"},
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m",
          "format": "yyyy-MM-dd HH:00"  <--- Uh OH!!!!
        }
      }
    },
    "aggregations": {
	    ....
    }
  }
}

This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:

  • The data frame will happily run forever. If there are enough documents to span an entire page, such that all the documents are in the same hour, the data frame will continue to request the same page of the composite aggregation infinitely.
  • Documents will overwrite each other. Since we generate the document _id values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same _id and only the very last bucket seen would be retained.

Solutions???

Check if the interval and the format have the same time fidelity

The format field allows any of our valid time formats. We may be able to look at the base of the calendar_interval (e.g. m => minutes, h => hours, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.

👍 Computationally efficient
👎 A tad complicated, logically

Run sample queries and see if there are repeated keys

Only the date_histogram group_by would have to be considered. If the date_histogram aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.

👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
example of different keys but invalid format:

GET kibana_sample_data_flights/_search
{
  "aggs": {
    "buckets": {
      "composite": {
        "sources": [
          {
            "time": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "second",
                "format": "yyyy-MM-dd HH:mm"
              }
            }
          }
        ]
      }
    }
  }
}
>"buckets" : [
        {
          "key" : {
            "time" : "2019-04-08 00:00"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:02"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:06"
          },
          "doc_count" : 1
        },...
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

An alternative approach for "Check if the interval and the format have the same time fidelity" might be to first format Instant.EPOCH to a string using the supplied date format. Then format Instant.EPOCH.plus(interval) (converting interval to a TemporalAmount to pass to plus) to a string using the supplied date format. If the two strings are the same then reject the config.

Doing it this way we would not need to understand the Java time format string.

@jimczi
Copy link
Contributor

jimczi commented Jun 12, 2019

We don't separate the raw key from the string representation in the response of the composite aggregation. In the "normal" date_histogram we have key that contains the raw timestamp and key_as_string that contains the timestamp formatted with the provided format. We could have the same mechanism in composite but I wonder why we expose the format option in the data frame ? Is it just because it's available in the composite ?

@benwtrent
Copy link
Member Author

@jimczi two reasons:

  • One is that it is supported in composite and in date_histogram.
  • Certain use cases do benefit from having a formatted time value. Especially in the data exploration scenario where the pivot is completed and they desire to see the resulting table. Having a formatted timestamp is much more human readable than a raw timestamp.

@jimczi
Copy link
Contributor

jimczi commented Jun 13, 2019

Certain use cases do benefit from having a formatted time value. Especially in the data exploration scenario where the pivot is completed and they desire to see the resulting table. Having a formatted timestamp is much more human readable than a raw timestamp.

Ok so I guess this is for the preview page since the resulting table should be an index where the format should be compatible with the format defined on the target date field. Would it be possible to apply the formatting outside of the composite ? We can use DateFormatter.forPattern to create the DateFormatter and then apply something like formatter.format(resolution.toInstant(value).atZone(timeZone)) where the value is the raw timestamp ? So internally the composite aggregation would always return the raw timestamp but the preview page would apply some formatting to display the result ?

@hendrikmuhs
Copy link

I somewhat like the idea of the testing/querying approach, similar to what @droberts195 suggested and I wonder if we can create the data ourselves instead of querying the source (which might be sparse, empty, non-existent at the time of creating the transform). So basically generating

format(start)
format(start + interval)
format(start + interval * 2)
...

and throw if we create a duplicate.

Interestingly I wonder if this issue is always a bug:

 "calendar_interval": "1m",
 "format": "HH:mm"

is a nice way to implement a round-robin database with data for the last 24 hours.

@jimczi
Copy link
Contributor

jimczi commented Jun 26, 2019

Another solution would be to only apply the format to the buckets and not the after key. We need the raw values there since they have to be unique and the format may break this assumption. The after map is always used as is to paginate over the buckets so it would be safer to always return the original values without any modification.

@benwtrent
Copy link
Member Author

We opted to simply remove support for format in 7.3.0. We will also add the ability to add a human=true flag to the _preview response.

See more discussion here: elastic/kibana#39250

PR removing the format: #43659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants