[ML] Improve file structure finder timestamp format determination (el…

…astic#41948) This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates elastic#38086 Closes elastic#35137 Closes elastic#35132
gurkankaymak · May 27, 2019 · 24d2fe7 · 24d2fe7
1 parent 4e2ce09
commit 24d2fe7
Show file tree

Hide file tree

Showing 13 changed files with 2,907 additions and 1,179 deletions.
diff --git a/docs/reference/ml/apis/find-file-structure.asciidoc b/docs/reference/ml/apis/find-file-structure.asciidoc
@@ -147,57 +147,46 @@ is not compulsory to have a timestamp in the file.
 --
 
 `timestamp_format`::
-  (string) The time format of the timestamp field in the file. +
+  (string) The Java time format of the timestamp field in the file. +
 +
 --
-NOTE: Currently there is a limitation that this format must be one that the
-structure finder might choose by itself. The reason for this restriction is that
-to consistently set all the fields in the response the structure finder needs a
-corresponding Grok pattern name and simple regular expression for each timestamp
-format. Therefore, there is little value in specifying this parameter for
-structured file formats. If you know which field contains your primary timestamp,
-it is as good and less error-prone to just specify `timestamp_field`.
-
-The valuable use case for this parameter is when the format is semi-structured
+NOTE: Only a subset of Java time format letter groups are supported:
+
+* `a`
+* `d`
+* `dd`
+* `EEE`
+* `EEEE`
+* `H`
+* `HH`
+* `h`
+* `M`
+* `MM`
+* `MMM`
+* `MMMM`
+* `mm`
+* `ss`
+* `XX`
+* `XXX`
+* `yy`
+* `yyyy`
+* `zzz`
+
+Additionally `S` letter groups (fractional seconds) of length one to nine are
+supported providing they occur after `ss` and separated from the `ss` by a `.`,
+`,` or `:`. Spacing and punctuation is also permitted with the exception of `?`,
+newline and carriage return, together with literal text enclosed in single
+quotes. For example, `MM/dd HH.mm.ss,SSSSSS 'in' yyyy` is a valid override
+format.
+
+One valuable use case for this parameter is when the format is semi-structured
 text, there are multiple timestamp formats in the file, and you know which
 format corresponds to the primary timestamp, but you do not want to specify the
-full `grok_pattern`.
-
-If this parameter is not specified, the structure finder chooses the best format from
-the formats it knows, which are these Java time formats:
-
-* `dd/MMM/yyyy:HH:mm:ss XX`
-* `EEE MMM dd HH:mm zzz yyyy`
-* `EEE MMM dd HH:mm:ss yyyy`
-* `EEE MMM dd HH:mm:ss zzz yyyy`
-* `EEE MMM dd yyyy HH:mm zzz`
-* `EEE MMM dd yyyy HH:mm:ss zzz`
-* `EEE, dd MMM yyyy HH:mm XX`
-* `EEE, dd MMM yyyy HH:mm XXX`
-* `EEE, dd MMM yyyy HH:mm:ss XX`
-* `EEE, dd MMM yyyy HH:mm:ss XXX`
-* `ISO8601`
-* `MMM  d HH:mm:ss`
-* `MMM  d HH:mm:ss,SSS`
-* `MMM  d yyyy HH:mm:ss`
-* `MMM dd HH:mm:ss`
-* `MMM dd HH:mm:ss,SSS`
-* `MMM dd yyyy HH:mm:ss`
-* `MMM dd, yyyy h:mm:ss a`
-* `TAI64N`
-* `UNIX`
-* `UNIX_MS`
-* `yyyy-MM-dd HH:mm:ss`
-* `yyyy-MM-dd HH:mm:ss,SSS`
-* `yyyy-MM-dd HH:mm:ss,SSS XX`
-* `yyyy-MM-dd HH:mm:ss,SSSXX`
-* `yyyy-MM-dd HH:mm:ss,SSSXXX`
-* `yyyy-MM-dd HH:mm:ssXX`
-* `yyyy-MM-dd HH:mm:ssXXX`
-* `yyyy-MM-dd'T'HH:mm:ss,SSS`
-* `yyyy-MM-dd'T'HH:mm:ss,SSSXX`
-* `yyyy-MM-dd'T'HH:mm:ss,SSSXXX`
-* `yyyyMMddHHmmss`
+full `grok_pattern`. Another is when the timestamp format is one that the
+structure finder does not consider by default.
+
+If this parameter is not specified, the structure finder chooses the best
+format from a built-in set.
 
 --
 
@@ -263,8 +252,18 @@ If the request does not encounter errors, you receive the following result:
   "charset" : "UTF-8", <4>
   "has_byte_order_marker" : false, <5>
   "format" : "ndjson", <6>
-  "need_client_timezone" : false, <7>
-  "mappings" : { <8>
+  "timestamp_field" : "release_date", <7>
+  "joda_timestamp_formats" : [ <8>
+    "ISO8601"
+  ],
+  "java_timestamp_formats" : [ <9>
+    "ISO8601"
+  ],
+  "need_client_timezone" : true, <10>
+  "mappings" : { <11>
+    "@timestamp" : {
+      "type" : "date"
+    },
     "author" : {
       "type" : "keyword"
     },
@@ -275,10 +274,25 @@ If the request does not encounter errors, you receive the following result:
       "type" : "long"
     },
     "release_date" : {
-      "type" : "keyword"
+      "type" : "date",
+      "format" : "iso8601"
     }
   },
-  "field_stats" : { <9>
+  "ingest_pipeline" : {
+    "description" : "Ingest pipeline created by file structure finder",
+    "processors" : [
+      {
+        "date" : {
+          "field" : "release_date",
+          "timezone" : "{{ beat.timezone }}",
+          "formats" : [
+            "ISO8601"
+          ]
+        }
+      }
+    ]
+  },
+  "field_stats" : { <12>
     "author" : {
       "count" : 24,
       "cardinality" : 20,
@@ -484,17 +498,22 @@ If the request does not encounter errors, you receive the following result:
 <5> For UTF character encodings, `has_byte_order_marker` indicates whether the
     file begins with a byte order marker.
 <6> `format` is one of `ndjson`, `xml`, `delimited` or `semi_structured_text`.
-<7> If a timestamp format is detected that does not include a timezone,
-    `need_client_timezone` will be `true`. The server that parses the file must
-    therefore be told the correct timezone by the client.
-<8> `mappings` contains some suitable mappings for an index into which the data
-    could be ingested. In this case, the `release_date` field has been given a
-    `keyword` type as it is not considered specific enough to convert to the
-    `date` type.
-<9> `field_stats` contains the most common values of each field, plus basic
-    numeric statistics for the numeric `page_count` field. This information
-    may provide clues that the data needs to be cleaned or transformed prior
-    to use by other {ml} functionality.
+<7> The `timestamp_field` names the field considered most likely to be the
+    primary timestamp of each document.
+<8> `joda_timestamp_formats` are used to tell Logstash how to parse timestamps.
+<9> `java_timestamp_formats` are the Java time formats recognized in the time
+    fields. Elasticsearch mappings and Ingest pipeline use this format.
+<10> If a timestamp format is detected that does not include a timezone,
+     `need_client_timezone` will be `true`. The server that parses the file must
+     therefore be told the correct timezone by the client.
+<11> `mappings` contains some suitable mappings for an index into which the data
+     could be ingested. In this case, the `release_date` field has been given a
+     `keyword` type as it is not considered specific enough to convert to the
+     `date` type.
+<12> `field_stats` contains the most common values of each field, plus basic
+     numeric statistics for the numeric `page_count` field. This information
+     may provide clues that the data needs to be cleaned or transformed prior
+     to use by other {ml} functionality.
 
 The next example shows how it's possible to find the structure of some New York
 City yellow cab trip data. The first `curl` command downloads the data, the
@@ -526,7 +545,7 @@ If the request does not encounter errors, you receive the following result:
   "charset" : "UTF-8",
   "has_byte_order_marker" : false,
   "format" : "delimited", <2>
-  "multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",
+  "multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
   "exclude_lines_pattern" : "^\"?VendorID\"?,\"?tpep_pickup_datetime\"?,\"?tpep_dropoff_datetime\"?,\"?passenger_count\"?,\"?trip_distance\"?,\"?RatecodeID\"?,\"?store_and_fwd_flag\"?,\"?PULocationID\"?,\"?DOLocationID\"?,\"?payment_type\"?,\"?fare_amount\"?,\"?extra\"?,\"?mta_tax\"?,\"?tip_amount\"?,\"?tolls_amount\"?,\"?improvement_surcharge\"?,\"?total_amount\"?",
   "column_names" : [ <3>
     "VendorID",
@@ -1361,14 +1380,14 @@ this:
   "charset" : "UTF-8",
   "has_byte_order_marker" : false,
   "format" : "semi_structured_text", <1>
-  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2},\\d{3}", <2>
+  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", <2>
   "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", <3>
   "timestamp_field" : "timestamp",
   "joda_timestamp_formats" : [
     "ISO8601"
   ],
   "java_timestamp_formats" : [
-    "yyyy-MM-dd'T'HH:mm:ss,SSS"
+    "ISO8601"
   ],
   "need_client_timezone" : true,
   "mappings" : {
@@ -1398,7 +1417,7 @@ this:
           "field" : "timestamp",
           "timezone" : "{{ beat.timezone }}",
           "formats" : [
-            "yyyy-MM-dd'T'HH:mm:ss,SSS"
+            "ISO8601"
           ]
         }
       },
@@ -1515,14 +1534,14 @@ this:
   "charset" : "UTF-8",
   "has_byte_order_marker" : false,
   "format" : "semi_structured_text",
-  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2},\\d{3}",
+  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
   "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", <1>
   "timestamp_field" : "timestamp",
   "joda_timestamp_formats" : [
     "ISO8601"
   ],
   "java_timestamp_formats" : [
-    "yyyy-MM-dd'T'HH:mm:ss,SSS"
+    "ISO8601"
   ],
   "need_client_timezone" : true,
   "mappings" : {
@@ -1558,7 +1577,7 @@ this:
           "field" : "timestamp",
           "timezone" : "{{ beat.timezone }}",
           "formats" : [
-            "yyyy-MM-dd'T'HH:mm:ss,SSS"
+            "ISO8601"
           ]
         }
       },

diff --git a/...ain/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java b/...ain/java/org/elasticsearch/xpack/ml/filestructurefinder/DelimitedFileStructureFinder.java
@@ -8,7 +8,6 @@
 import org.elasticsearch.common.collect.Tuple;
 import org.elasticsearch.xpack.core.ml.filestructurefinder.FieldStats;
 import org.elasticsearch.xpack.core.ml.filestructurefinder.FileStructure;
-import org.elasticsearch.xpack.ml.filestructurefinder.TimestampFormatFinder.TimestampMatch;
 import org.supercsv.exception.SuperCsvException;
 import org.supercsv.io.CsvListReader;
 import org.supercsv.prefs.CsvPreference;
@@ -27,7 +26,6 @@
 import java.util.Map;
 import java.util.Random;
 import java.util.SortedMap;
-import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;
 
@@ -62,7 +60,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
                 throw new IllegalArgumentException("[" + overriddenColumnNames.size() + "] column names were specified [" +
                     String.join(",", overriddenColumnNames) + "] but there are [" + header.length + "] columns in the sample");
             }
-            columnNames = overriddenColumnNames.toArray(new String[overriddenColumnNames.size()]);
+            columnNames = overriddenColumnNames.toArray(new String[0]);
         } else {
             // The column names are the header names but with blanks named column1, column2, etc.
             columnNames = new String[header.length];
@@ -85,11 +83,14 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
                 trimFields ? row.stream().map(field -> (field == null) ? null : field.trim()).collect(Collectors.toList()) : row);
             sampleRecords.add(sampleRecord);
             sampleMessages.add(
-                sampleLines.subList(prevMessageEndLineNumber + 1, lineNumbers.get(index)).stream().collect(Collectors.joining("\n")));
+                String.join("\n", sampleLines.subList(prevMessageEndLineNumber + 1, lineNumbers.get(index))));
             prevMessageEndLineNumber = lineNumber;
         }
 
-        String preamble = Pattern.compile("\n").splitAsStream(sample).limit(lineNumbers.get(1)).collect(Collectors.joining("\n", "", "\n"));
+        String preamble = String.join("\n", sampleLines.subList(0, lineNumbers.get(1))) + "\n";
+
+        // null to allow GC before timestamp search
+        sampleLines = null;
 
         char delimiter = (char) csvPreference.getDelimiterChar();
         FileStructure.Builder structureBuilder = new FileStructure.Builder(FileStructure.Format.DELIMITED)
@@ -107,7 +108,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
             structureBuilder.setShouldTrimFields(true);
         }
 
-        Tuple<String, TimestampMatch> timeField = FileStructureUtils.guessTimestampField(explanation, sampleRecords, overrides,
+        Tuple<String, TimestampFormatFinder> timeField = FileStructureUtils.guessTimestampField(explanation, sampleRecords, overrides,
             timeoutChecker);
         if (timeField != null) {
             String timeLineRegex = null;
@@ -119,7 +120,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
             for (String column : Arrays.asList(columnNames).subList(0, columnNames.length - 1)) {
                 if (timeField.v1().equals(column)) {
                     builder.append("\"?");
-                    String simpleTimePattern = timeField.v2().simplePattern.pattern();
+                    String simpleTimePattern = timeField.v2().getSimplePattern().pattern();
                     builder.append(simpleTimePattern.startsWith("\\b") ? simpleTimePattern.substring(2) : simpleTimePattern);
                     timeLineRegex = builder.toString();
                     break;
@@ -145,11 +146,11 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
             boolean needClientTimeZone = timeField.v2().hasTimezoneDependentParsing();
 
             structureBuilder.setTimestampField(timeField.v1())
-                .setJodaTimestampFormats(timeField.v2().jodaTimestampFormats)
-                .setJavaTimestampFormats(timeField.v2().javaTimestampFormats)
+                .setJodaTimestampFormats(timeField.v2().getJodaTimestampFormats())
+                .setJavaTimestampFormats(timeField.v2().getJavaTimestampFormats())
                 .setNeedClientTimezone(needClientTimeZone)
-                .setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, timeField.v1(),
-                    timeField.v2().javaTimestampFormats, needClientTimeZone))
+                .setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, Collections.emptyMap(), timeField.v1(),
+                    timeField.v2().getJavaTimestampFormats(), needClientTimeZone))
                 .setMultilineStartPattern(timeLineRegex);
         }