-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Frame] Bad date_histogram format causes infinitely running indexer #43068
Comments
Pinging @elastic/ml-core |
An alternative approach for "Check if the interval and the format have the same time fidelity" might be to first format Doing it this way we would not need to understand the Java time format string. |
We don't separate the raw key from the string representation in the response of the composite aggregation. In the "normal" |
@jimczi two reasons:
|
Ok so I guess this is for the preview page since the resulting table should be an index where the format should be compatible with the format defined on the target |
I somewhat like the idea of the testing/querying approach, similar to what @droberts195 suggested and I wonder if we can create the data ourselves instead of querying the source (which might be sparse, empty, non-existent at the time of creating the transform). So basically generating
and throw if we create a duplicate. Interestingly I wonder if this issue is always a bug:
is a nice way to implement a round-robin database with data for the last 24 hours. |
Another solution would be to only apply the |
We opted to simply remove support for See more discussion here: elastic/kibana#39250 PR removing the format: #43659 |
Problem
Users have the ability to shoot themselves in the foot without much warning with Data Frames.
Example:
This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:
_id
values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same_id
and only the very last bucket seen would be retained.Solutions???
Check if the interval and the format have the same time fidelity
The
format
field allows any of our valid time formats. We may be able to look at the base of thecalendar_interval
(e.g.m => minutes
,h => hours
, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.👍 Computationally efficient
👎 A tad complicated, logically
Run sample queries and see if there are repeated keys
Only the
date_histogram
group_by would have to be considered. If thedate_histogram
aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
example of different keys but invalid format:
The text was updated successfully, but these errors were encountered: