-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve file structure finder timestamp format determination #41948
Merged
droberts195
merged 17 commits into
elastic:master
from
droberts195:timestamp_finder_improvements
May 23, 2019
Merged
[ML] Improve file structure finder timestamp format determination #41948
droberts195
merged 17 commits into
elastic:master
from
droberts195:timestamp_finder_improvements
May 23, 2019
Commits on May 8, 2019
-
[ML] Improve file structure finder timestamp format determination
This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates elastic#38086 Closes elastic#35137 Closes elastic#35132
Configuration menu - View commit details
-
Copy full SHA for 2a19461 - Browse repository at this point
Copy the full SHA 2a19461View commit details
Commits on May 9, 2019
-
Improve timestamp format quick-rule-out functionality
Previously if a timestamp format was not quickly ruled out then we would search for it in the whole sample. Following this change the quick-rule-out patterns are used not only to completely rule out some formats but also to find the portion of the sample over which the format could possibly match. This helps a lot in the case of long lines that contain sections that nearly match one of our candidate timestamps but not quite (because regular expression matching is slowest in the case of patterns that nearly match).
Configuration menu - View commit details
-
Copy full SHA for 2a862fd - Browse repository at this point
Copy the full SHA 2a862fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 7692d23 - Browse repository at this point
Copy the full SHA 7692d23View commit details
Commits on May 10, 2019
-
Configuration menu - View commit details
-
Copy full SHA for d15182e - Browse repository at this point
Copy the full SHA d15182eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 43fa94d - Browse repository at this point
Copy the full SHA 43fa94dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0db4167 - Browse repository at this point
Copy the full SHA 0db4167View commit details
Commits on May 11, 2019
-
Configuration menu - View commit details
-
Copy full SHA for 6951d88 - Browse repository at this point
Copy the full SHA 6951d88View commit details
Commits on May 13, 2019
-
Configuration menu - View commit details
-
Copy full SHA for d08c8be - Browse repository at this point
Copy the full SHA d08c8beView commit details -
Bring secondary timestamps in Grok pattern creator in line with prima…
…ry timestamps in timestamp format finder
Configuration menu - View commit details
-
Copy full SHA for baf107c - Browse repository at this point
Copy the full SHA baf107cView commit details
Commits on May 15, 2019
-
Configuration menu - View commit details
-
Copy full SHA for b3fe912 - Browse repository at this point
Copy the full SHA b3fe912View commit details
Commits on May 22, 2019
-
1. Even though %{TIMESTAMP_ISO8601} cannot parse an ISO8601 date with no time, the ISO8601 date format can 2. The %{DATE} and %{DATESTAMP} Grok patterns accept a single digit month but not a single digit day
Configuration menu - View commit details
-
Copy full SHA for 893d99b - Browse repository at this point
Copy the full SHA 893d99bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2d1717e - Browse repository at this point
Copy the full SHA 2d1717eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1abe5fe - Browse repository at this point
Copy the full SHA 1abe5feView commit details
Commits on May 23, 2019
-
Configuration menu - View commit details
-
Copy full SHA for 143cf69 - Browse repository at this point
Copy the full SHA 143cf69View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9bca698 - Browse repository at this point
Copy the full SHA 9bca698View commit details -
Configuration menu - View commit details
-
Copy full SHA for e046918 - Browse repository at this point
Copy the full SHA e046918View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6c88cbd - Browse repository at this point
Copy the full SHA 6c88cbdView commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.