Skip to content

Commit

Permalink
[ML] Improve file structure finder timestamp format determination (el…
Browse files Browse the repository at this point in the history
…astic#41948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates elastic#38086
Closes elastic#35137
Closes elastic#35132
  • Loading branch information
droberts195 authored and Gurkan Kaymak committed May 27, 2019
1 parent 4e2ce09 commit 24d2fe7
Show file tree
Hide file tree
Showing 13 changed files with 2,907 additions and 1,179 deletions.
157 changes: 88 additions & 69 deletions docs/reference/ml/apis/find-file-structure.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -147,57 +147,46 @@ is not compulsory to have a timestamp in the file.
--

`timestamp_format`::
(string) The time format of the timestamp field in the file. +
(string) The Java time format of the timestamp field in the file. +
+
--
NOTE: Currently there is a limitation that this format must be one that the
structure finder might choose by itself. The reason for this restriction is that
to consistently set all the fields in the response the structure finder needs a
corresponding Grok pattern name and simple regular expression for each timestamp
format. Therefore, there is little value in specifying this parameter for
structured file formats. If you know which field contains your primary timestamp,
it is as good and less error-prone to just specify `timestamp_field`.

The valuable use case for this parameter is when the format is semi-structured
NOTE: Only a subset of Java time format letter groups are supported:

* `a`
* `d`
* `dd`
* `EEE`
* `EEEE`
* `H`
* `HH`
* `h`
* `M`
* `MM`
* `MMM`
* `MMMM`
* `mm`
* `ss`
* `XX`
* `XXX`
* `yy`
* `yyyy`
* `zzz`

Additionally `S` letter groups (fractional seconds) of length one to nine are
supported providing they occur after `ss` and separated from the `ss` by a `.`,
`,` or `:`. Spacing and punctuation is also permitted with the exception of `?`,
newline and carriage return, together with literal text enclosed in single
quotes. For example, `MM/dd HH.mm.ss,SSSSSS 'in' yyyy` is a valid override
format.

One valuable use case for this parameter is when the format is semi-structured
text, there are multiple timestamp formats in the file, and you know which
format corresponds to the primary timestamp, but you do not want to specify the
full `grok_pattern`.

If this parameter is not specified, the structure finder chooses the best format from
the formats it knows, which are these Java time formats:

* `dd/MMM/yyyy:HH:mm:ss XX`
* `EEE MMM dd HH:mm zzz yyyy`
* `EEE MMM dd HH:mm:ss yyyy`
* `EEE MMM dd HH:mm:ss zzz yyyy`
* `EEE MMM dd yyyy HH:mm zzz`
* `EEE MMM dd yyyy HH:mm:ss zzz`
* `EEE, dd MMM yyyy HH:mm XX`
* `EEE, dd MMM yyyy HH:mm XXX`
* `EEE, dd MMM yyyy HH:mm:ss XX`
* `EEE, dd MMM yyyy HH:mm:ss XXX`
* `ISO8601`
* `MMM d HH:mm:ss`
* `MMM d HH:mm:ss,SSS`
* `MMM d yyyy HH:mm:ss`
* `MMM dd HH:mm:ss`
* `MMM dd HH:mm:ss,SSS`
* `MMM dd yyyy HH:mm:ss`
* `MMM dd, yyyy h:mm:ss a`
* `TAI64N`
* `UNIX`
* `UNIX_MS`
* `yyyy-MM-dd HH:mm:ss`
* `yyyy-MM-dd HH:mm:ss,SSS`
* `yyyy-MM-dd HH:mm:ss,SSS XX`
* `yyyy-MM-dd HH:mm:ss,SSSXX`
* `yyyy-MM-dd HH:mm:ss,SSSXXX`
* `yyyy-MM-dd HH:mm:ssXX`
* `yyyy-MM-dd HH:mm:ssXXX`
* `yyyy-MM-dd'T'HH:mm:ss,SSS`
* `yyyy-MM-dd'T'HH:mm:ss,SSSXX`
* `yyyy-MM-dd'T'HH:mm:ss,SSSXXX`
* `yyyyMMddHHmmss`
full `grok_pattern`. Another is when the timestamp format is one that the
structure finder does not consider by default.

If this parameter is not specified, the structure finder chooses the best
format from a built-in set.

--

Expand Down Expand Up @@ -263,8 +252,18 @@ If the request does not encounter errors, you receive the following result:
"charset" : "UTF-8", <4>
"has_byte_order_marker" : false, <5>
"format" : "ndjson", <6>
"need_client_timezone" : false, <7>
"mappings" : { <8>
"timestamp_field" : "release_date", <7>
"joda_timestamp_formats" : [ <8>
"ISO8601"
],
"java_timestamp_formats" : [ <9>
"ISO8601"
],
"need_client_timezone" : true, <10>
"mappings" : { <11>
"@timestamp" : {
"type" : "date"
},
"author" : {
"type" : "keyword"
},
Expand All @@ -275,10 +274,25 @@ If the request does not encounter errors, you receive the following result:
"type" : "long"
},
"release_date" : {
"type" : "keyword"
"type" : "date",
"format" : "iso8601"
}
},
"field_stats" : { <9>
"ingest_pipeline" : {
"description" : "Ingest pipeline created by file structure finder",
"processors" : [
{
"date" : {
"field" : "release_date",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"ISO8601"
]
}
}
]
},
"field_stats" : { <12>
"author" : {
"count" : 24,
"cardinality" : 20,
Expand Down Expand Up @@ -484,17 +498,22 @@ If the request does not encounter errors, you receive the following result:
<5> For UTF character encodings, `has_byte_order_marker` indicates whether the
file begins with a byte order marker.
<6> `format` is one of `ndjson`, `xml`, `delimited` or `semi_structured_text`.
<7> If a timestamp format is detected that does not include a timezone,
`need_client_timezone` will be `true`. The server that parses the file must
therefore be told the correct timezone by the client.
<8> `mappings` contains some suitable mappings for an index into which the data
could be ingested. In this case, the `release_date` field has been given a
`keyword` type as it is not considered specific enough to convert to the
`date` type.
<9> `field_stats` contains the most common values of each field, plus basic
numeric statistics for the numeric `page_count` field. This information
may provide clues that the data needs to be cleaned or transformed prior
to use by other {ml} functionality.
<7> The `timestamp_field` names the field considered most likely to be the
primary timestamp of each document.
<8> `joda_timestamp_formats` are used to tell Logstash how to parse timestamps.
<9> `java_timestamp_formats` are the Java time formats recognized in the time
fields. Elasticsearch mappings and Ingest pipeline use this format.
<10> If a timestamp format is detected that does not include a timezone,
`need_client_timezone` will be `true`. The server that parses the file must
therefore be told the correct timezone by the client.
<11> `mappings` contains some suitable mappings for an index into which the data
could be ingested. In this case, the `release_date` field has been given a
`keyword` type as it is not considered specific enough to convert to the
`date` type.
<12> `field_stats` contains the most common values of each field, plus basic
numeric statistics for the numeric `page_count` field. This information
may provide clues that the data needs to be cleaned or transformed prior
to use by other {ml} functionality.

The next example shows how it's possible to find the structure of some New York
City yellow cab trip data. The first `curl` command downloads the data, the
Expand Down Expand Up @@ -526,7 +545,7 @@ If the request does not encounter errors, you receive the following result:
"charset" : "UTF-8",
"has_byte_order_marker" : false,
"format" : "delimited", <2>
"multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",
"multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
"exclude_lines_pattern" : "^\"?VendorID\"?,\"?tpep_pickup_datetime\"?,\"?tpep_dropoff_datetime\"?,\"?passenger_count\"?,\"?trip_distance\"?,\"?RatecodeID\"?,\"?store_and_fwd_flag\"?,\"?PULocationID\"?,\"?DOLocationID\"?,\"?payment_type\"?,\"?fare_amount\"?,\"?extra\"?,\"?mta_tax\"?,\"?tip_amount\"?,\"?tolls_amount\"?,\"?improvement_surcharge\"?,\"?total_amount\"?",
"column_names" : [ <3>
"VendorID",
Expand Down Expand Up @@ -1361,14 +1380,14 @@ this:
"charset" : "UTF-8",
"has_byte_order_marker" : false,
"format" : "semi_structured_text", <1>
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2},\\d{3}", <2>
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", <2>
"grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", <3>
"timestamp_field" : "timestamp",
"joda_timestamp_formats" : [
"ISO8601"
],
"java_timestamp_formats" : [
"yyyy-MM-dd'T'HH:mm:ss,SSS"
"ISO8601"
],
"need_client_timezone" : true,
"mappings" : {
Expand Down Expand Up @@ -1398,7 +1417,7 @@ this:
"field" : "timestamp",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"yyyy-MM-dd'T'HH:mm:ss,SSS"
"ISO8601"
]
}
},
Expand Down Expand Up @@ -1515,14 +1534,14 @@ this:
"charset" : "UTF-8",
"has_byte_order_marker" : false,
"format" : "semi_structured_text",
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2},\\d{3}",
"multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
"grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", <1>
"timestamp_field" : "timestamp",
"joda_timestamp_formats" : [
"ISO8601"
],
"java_timestamp_formats" : [
"yyyy-MM-dd'T'HH:mm:ss,SSS"
"ISO8601"
],
"need_client_timezone" : true,
"mappings" : {
Expand Down Expand Up @@ -1558,7 +1577,7 @@ this:
"field" : "timestamp",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"yyyy-MM-dd'T'HH:mm:ss,SSS"
"ISO8601"
]
}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
import org.elasticsearch.common.collect.Tuple;
import org.elasticsearch.xpack.core.ml.filestructurefinder.FieldStats;
import org.elasticsearch.xpack.core.ml.filestructurefinder.FileStructure;
import org.elasticsearch.xpack.ml.filestructurefinder.TimestampFormatFinder.TimestampMatch;
import org.supercsv.exception.SuperCsvException;
import org.supercsv.io.CsvListReader;
import org.supercsv.prefs.CsvPreference;
Expand All @@ -27,7 +26,6 @@
import java.util.Map;
import java.util.Random;
import java.util.SortedMap;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

Expand Down Expand Up @@ -62,7 +60,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
throw new IllegalArgumentException("[" + overriddenColumnNames.size() + "] column names were specified [" +
String.join(",", overriddenColumnNames) + "] but there are [" + header.length + "] columns in the sample");
}
columnNames = overriddenColumnNames.toArray(new String[overriddenColumnNames.size()]);
columnNames = overriddenColumnNames.toArray(new String[0]);
} else {
// The column names are the header names but with blanks named column1, column2, etc.
columnNames = new String[header.length];
Expand All @@ -85,11 +83,14 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
trimFields ? row.stream().map(field -> (field == null) ? null : field.trim()).collect(Collectors.toList()) : row);
sampleRecords.add(sampleRecord);
sampleMessages.add(
sampleLines.subList(prevMessageEndLineNumber + 1, lineNumbers.get(index)).stream().collect(Collectors.joining("\n")));
String.join("\n", sampleLines.subList(prevMessageEndLineNumber + 1, lineNumbers.get(index))));
prevMessageEndLineNumber = lineNumber;
}

String preamble = Pattern.compile("\n").splitAsStream(sample).limit(lineNumbers.get(1)).collect(Collectors.joining("\n", "", "\n"));
String preamble = String.join("\n", sampleLines.subList(0, lineNumbers.get(1))) + "\n";

// null to allow GC before timestamp search
sampleLines = null;

char delimiter = (char) csvPreference.getDelimiterChar();
FileStructure.Builder structureBuilder = new FileStructure.Builder(FileStructure.Format.DELIMITED)
Expand All @@ -107,7 +108,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
structureBuilder.setShouldTrimFields(true);
}

Tuple<String, TimestampMatch> timeField = FileStructureUtils.guessTimestampField(explanation, sampleRecords, overrides,
Tuple<String, TimestampFormatFinder> timeField = FileStructureUtils.guessTimestampField(explanation, sampleRecords, overrides,
timeoutChecker);
if (timeField != null) {
String timeLineRegex = null;
Expand All @@ -119,7 +120,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
for (String column : Arrays.asList(columnNames).subList(0, columnNames.length - 1)) {
if (timeField.v1().equals(column)) {
builder.append("\"?");
String simpleTimePattern = timeField.v2().simplePattern.pattern();
String simpleTimePattern = timeField.v2().getSimplePattern().pattern();
builder.append(simpleTimePattern.startsWith("\\b") ? simpleTimePattern.substring(2) : simpleTimePattern);
timeLineRegex = builder.toString();
break;
Expand All @@ -145,11 +146,11 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
boolean needClientTimeZone = timeField.v2().hasTimezoneDependentParsing();

structureBuilder.setTimestampField(timeField.v1())
.setJodaTimestampFormats(timeField.v2().jodaTimestampFormats)
.setJavaTimestampFormats(timeField.v2().javaTimestampFormats)
.setJodaTimestampFormats(timeField.v2().getJodaTimestampFormats())
.setJavaTimestampFormats(timeField.v2().getJavaTimestampFormats())
.setNeedClientTimezone(needClientTimeZone)
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, timeField.v1(),
timeField.v2().javaTimestampFormats, needClientTimeZone))
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, Collections.emptyMap(), timeField.v1(),
timeField.v2().getJavaTimestampFormats(), needClientTimeZone))
.setMultilineStartPattern(timeLineRegex);
}

Expand Down
Loading

0 comments on commit 24d2fe7

Please sign in to comment.