[ML] Improve file structure finder timestamp format determination #41948

droberts195 · 2019-05-08T13:40:04Z

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

Make TimestampFormatFinder an instantiable class
Overrides must be specified in Java date/time format - Joda
format is no longer accepted
Joda timestamp formats in outputs are now derived from the
determined or overridden Java timestamp formats, not stored
separately
Functionality for determining the "best" timestamp format in
a set of lines has been moved from TextLogFileStructureFinder
to TimestampFormatFinder, taking advantage of the fact that
TimestampFormatFinder is now an instantiable class with state
The functionality to quickly rule out some possible Grok
patterns when looking for timestamp formats has been changed
from using simple regular expressions to the much faster
approach of using the Shift-And method of sub-string search,
but using an "alphabet" consisting of just 1 (representing any
digit) and 0 (representing non-digits)
Timestamp format overrides are now much more flexible
Timestamp format overrides that do not correspond to a built-in
Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
whose definition is included within the date processor in the
ingest pipeline
Grok patterns that correspond to multiple Java date/time
patterns are now handled better - the Grok pattern is accepted
as matching broadly, and the required set of Java date/time
patterns is built up considering all observed samples
As a result of the more flexible acceptance of Grok patterns,
when looking for the "best" timestamp in a set of lines
timestamps are considered different if they are preceded by
a different sequence of punctuation characters (to prevent
timestamps far into some lines being considered similar to
timestamps near the beginning of other lines)
Out-of-the-box Grok patterns that are considered now include
%{DATE} and %{DATESTAMP}, which have indeterminate day/month
ordering
The order of day/month in formats with indeterminate day/month
order is determined by considering all observed samples (plus
the server locale if the observed samples still do not suggest
an ordering)

Relates #38086
Closes #35137
Closes #35132

This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates elastic#38086 Closes elastic#35137 Closes elastic#35132

elasticmachine · 2019-05-08T13:40:06Z

Pinging @elastic/ml-core

Previously if a timestamp format was not quickly ruled out then we would search for it in the whole sample. Following this change the quick-rule-out patterns are used not only to completely rule out some formats but also to find the portion of the sample over which the format could possibly match. This helps a lot in the case of long lines that contain sections that nearly match one of our candidate timestamps but not quite (because regular expression matching is slowest in the case of patterns that nearly match).

benwtrent

Some minor things on the first read through.

This is a ton to grok (pun intended, in fact, I think I used this same pun on the last time a huge PR was made for the fsf...).

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

benwtrent

Will definitely need a second set of 👀 :)

… formats

…ry timestamps in timestamp format finder

1. Even though %{TIMESTAMP_ISO8601} cannot parse an ISO8601 date with no time, the ISO8601 date format can 2. The %{DATE} and %{DATESTAMP} Grok patterns accept a single digit month but not a single digit day

dimitris-athanasiou

Left a few comments. There are some cool but complex ideas in here!

dimitris-athanasiou · 2019-05-23T08:24:45Z

docs/reference/ml/apis/find-file-structure.asciidoc

-* `yyyy-MM-dd'T'HH:mm:ss,SSSXX`
-* `yyyy-MM-dd'T'HH:mm:ss,SSSXXX`
-* `yyyyMMddHHmmss`
+full `grok_pattern`. Another is where the timestamp format is one that the


nit: where to when to match the previous sentence?

dimitris-athanasiou · 2019-05-23T08:30:53Z

docs/reference/ml/apis/find-file-structure.asciidoc

@@ -263,8 +252,18 @@ If the request does not encounter errors, you receive the following result:
  "charset" : "UTF-8", <4>
  "has_byte_order_marker" : false, <5>
  "format" : "ndjson", <6>
-  "need_client_timezone" : false, <7>
+  "timestamp_field" : "release_date",


These are not tagged with explanations like other fields around them. Is that on purpose?

dimitris-athanasiou · 2019-05-23T09:37:27Z

docs/reference/ml/apis/find-file-structure.asciidoc

+structure finder does not consider by default.
+
+If this parameter is not specified, the structure finder chooses the best
+format from a built in set.


nit: should it be built-in?

dimitris-athanasiou · 2019-05-23T10:18:31Z

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/GrokPatternCreator.java

-                    TimestampMatch timestampMatch = TimestampFormatFinder.findFirstFullMatch(values.iterator().next(), timeoutChecker);
-                    if (timestampMatch != null) {
-                        fullMappingType = timestampMatch.getEsDateMappingTypeWithFormat();
+                    TimestampFormatFinder timestampFormatFinder = new TimestampFormatFinder(explanation, true, true, true, timeoutChecker);


This for loop all the way to calling timestampFormatFinder.getEsDateMappingTypeWithFormat() is repeated in FileStructureUtils around line 277. I wonder if we could refactor that in a method.

dimitris-athanasiou · 2019-05-23T10:24:54Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

-        return findFirstMatch(text, 0, timeoutChecker);
+    public TimestampFormatFinder(List<String> explanation, String overrideFormat, boolean requireFullMatch, boolean errorOnNoTimestamp,
+                                 boolean errorOnMultiplePatterns, TimeoutChecker timeoutChecker) {
+        this.explanation = explanation;


add Objects.requireNonNull or @Nullable where suitable

dimitris-athanasiou · 2019-05-23T10:35:50Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

-                for (Integer quickRuleOutIndex : candidate.quickRuleOutIndices) {
-                    if (quickRuleoutMatches[quickRuleOutIndex] == null) {
-                        quickRuleoutMatches[quickRuleOutIndex] = QUICK_RULE_OUT_PATTERNS.get(quickRuleOutIndex).matcher(text).find();
+    private static TimestampMatch checkCandidate(CandidateTimestampFormat candidate, String text, BitSet numberPosBitSet,


add @Nullable where suitable? Especially in methods with optional arguments it helps a lot when reading.

dimitris-athanasiou · 2019-05-23T10:37:36Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

+            return;
+        }
+
+        int remainingMatches = matches.size();


This block seems a nice candidate to extract into a method that returns the weights.

dimitris-athanasiou · 2019-05-23T10:40:04Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

+
+        timeoutChecker.check("timestamp format determination");
+
+        double highestWeight = 0.0;


This could be a Tuple<Integer, Double> findHighestWeight(double[] weights) method.

dimitris-athanasiou · 2019-05-23T10:43:24Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

+        timeoutChecker.check("timestamp format determination");
+
+        // Is the selected format not already at the beginning of the list?
+        if (highestWeightFormatIndex > 0) {


And finally, this could be selectHighestWeightFormat.

The above would make the method roughly read like:

calcMatchWeights(); findHighestWeight(); selectHighestWeightFormat();

Take all these as a mere suggestion. I know at the end it's down to personal preference, but I find breaking down methods like this increase readability a lot as it prepares the reader to understand what the lower-level code is trying to achieve.

dimitris-athanasiou · 2019-05-23T10:48:52Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

+
+            if (onlyConsiderFormat == null || onlyConsiderFormat.canMergeWith(match.timestampFormat)) {
+
+                if (match.firstIndeterminateDateNumber > 0) {


What does the zero check represent here?

Also fixing a couple of typos

dimitris-athanasiou

LGTM Just left a typo in a comment

dimitris-athanasiou · 2019-05-23T15:26:38Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

@@ -718,7 +793,11 @@ Boolean guessIsDayFirstFromMatches(TimestampFormat onlyConsiderFormat) {

            if (onlyConsiderFormat == null || onlyConsiderFormat.canMergeWith(match.timestampFormat)) {

+                // Valid indetermine day/month numbers will be in the range 1 to 31.


typo: indeterminate; it is a confusing one to type :-)

Gah, I fixed the exact same typo in another place in another recent commit

droberts195 · 2019-05-23T19:01:01Z

Jenkins run elasticsearch-ci/1

…1948) This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates #38086 Closes #35137 Closes #35132

…astic#41948) This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates elastic#38086 Closes elastic#35137 Closes elastic#35132

droberts195 added >enhancement :ml Machine learning v8.0.0 v7.3.0 labels May 8, 2019

droberts195 mentioned this pull request May 8, 2019

[ML] find_file_structure endpoint override can change field type as an unexpected side effect #35132

Closed

droberts195 mentioned this pull request May 9, 2019

[ML] Slow performance of file structure finder with log message containing many dates #35137

Closed

benwtrent reviewed May 9, 2019

View reviewed changes

droberts195 added 2 commits May 9, 2019 18:27

Address some code review comments

7692d23

Change exceptions for programmer mistakes to assertions

d15182e

benwtrent approved these changes May 10, 2019

View reviewed changes

droberts195 added 9 commits May 10, 2019 14:38

Finish off the custom Grok pattern functionality for custom timestamp…

43fa94d

… formats

Add missing Javadoc params

0db4167

Add support for ISO8601 date with no time

6951d88

Use empty map rather than null when no custom Grok patterns

d08c8be

Bring secondary timestamps in Grok pattern creator in line with prima…

baf107c

…ry timestamps in timestamp format finder

Merge branch 'master' into timestamp_finder_improvements

b3fe912

Fix a couple of format quirks

893d99b

1. Even though %{TIMESTAMP_ISO8601} cannot parse an ISO8601 date with no time, the ISO8601 date format can 2. The %{DATE} and %{DATESTAMP} Grok patterns accept a single digit month but not a single digit day

Merge branch 'master' into timestamp_finder_improvements

2d1717e

Fixing docs test

1abe5fe

droberts195 mentioned this pull request May 22, 2019

[ML] Improved timestamp override functionality elastic/kibana#36880

Closed

dimitris-athanasiou reviewed May 23, 2019

View reviewed changes

droberts195 added 3 commits May 23, 2019 12:15

Memory optimisations

143cf69

Also fixing a couple of typos

Merge branch 'master' into timestamp_finder_improvements

9bca698

Address review comments

e046918

dimitris-athanasiou approved these changes May 23, 2019

View reviewed changes

Fix typo

6c88cbd

droberts195 merged commit a15f1ee into elastic:master May 23, 2019

droberts195 deleted the timestamp_finder_improvements branch May 23, 2019 20:06

droberts195 mentioned this pull request May 30, 2019

[ML] Improve file structure finder timestamp recognition #38086

Closed

5 tasks

droberts195 mentioned this pull request Jun 23, 2020

[ML] Need to be able to quickly/temporarily adjust timeout when uploading to File Data Visualizer elastic/kibana#69624

Open

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve file structure finder timestamp format determination #41948

[ML] Improve file structure finder timestamp format determination #41948

droberts195 commented May 8, 2019

elasticmachine commented May 8, 2019

benwtrent left a comment

benwtrent left a comment

dimitris-athanasiou left a comment

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou May 23, 2019

dimitris-athanasiou left a comment

dimitris-athanasiou May 23, 2019

droberts195 May 23, 2019

droberts195 commented May 23, 2019


		timeoutChecker.check("timestamp format determination");

		double highestWeight = 0.0;


		if (onlyConsiderFormat == null \|\| onlyConsiderFormat.canMergeWith(match.timestampFormat)) {

		if (match.firstIndeterminateDateNumber > 0) {

		@@ -718,7 +793,11 @@ Boolean guessIsDayFirstFromMatches(TimestampFormat onlyConsiderFormat) {

		if (onlyConsiderFormat == null \|\| onlyConsiderFormat.canMergeWith(match.timestampFormat)) {

		// Valid indetermine day/month numbers will be in the range 1 to 31.

[ML] Improve file structure finder timestamp format determination #41948

[ML] Improve file structure finder timestamp format determination #41948

Conversation

droberts195 commented May 8, 2019

elasticmachine commented May 8, 2019

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented May 23, 2019