Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve file structure finder timestamp format determination #41948

Merged
merged 17 commits into from
May 23, 2019

Conversation

droberts195
Copy link
Contributor

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

  • Make TimestampFormatFinder an instantiable class
  • Overrides must be specified in Java date/time format - Joda
    format is no longer accepted
  • Joda timestamp formats in outputs are now derived from the
    determined or overridden Java timestamp formats, not stored
    separately
  • Functionality for determining the "best" timestamp format in
    a set of lines has been moved from TextLogFileStructureFinder
    to TimestampFormatFinder, taking advantage of the fact that
    TimestampFormatFinder is now an instantiable class with state
  • The functionality to quickly rule out some possible Grok
    patterns when looking for timestamp formats has been changed
    from using simple regular expressions to the much faster
    approach of using the Shift-And method of sub-string search,
    but using an "alphabet" consisting of just 1 (representing any
    digit) and 0 (representing non-digits)
  • Timestamp format overrides are now much more flexible
  • Timestamp format overrides that do not correspond to a built-in
    Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
    whose definition is included within the date processor in the
    ingest pipeline
  • Grok patterns that correspond to multiple Java date/time
    patterns are now handled better - the Grok pattern is accepted
    as matching broadly, and the required set of Java date/time
    patterns is built up considering all observed samples
  • As a result of the more flexible acceptance of Grok patterns,
    when looking for the "best" timestamp in a set of lines
    timestamps are considered different if they are preceded by
    a different sequence of punctuation characters (to prevent
    timestamps far into some lines being considered similar to
    timestamps near the beginning of other lines)
  • Out-of-the-box Grok patterns that are considered now include
    %{DATE} and %{DATESTAMP}, which have indeterminate day/month
    ordering
  • The order of day/month in formats with indeterminate day/month
    order is determined by considering all observed samples (plus
    the server locale if the observed samples still do not suggest
    an ordering)

Relates #38086
Closes #35137
Closes #35132

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates elastic#38086
Closes elastic#35137
Closes elastic#35132
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Previously if a timestamp format was not quickly ruled
out then we would search for it in the whole sample.
Following this change the quick-rule-out patterns are
used not only to completely rule out some formats but
also to find the portion of the sample over which the
format could possibly match.  This helps a lot in the
case of long lines that contain sections that nearly
match one of our candidate timestamps but not quite
(because regular expression matching is slowest in the
case of patterns that nearly match).
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things on the first read through.

This is a ton to grok (pun intended, in fact, I think I used this same pun on the last time a huge PR was made for the fsf...).

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will definitely need a second set of 👀 :)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. There are some cool but complex ideas in here!

* `yyyy-MM-dd'T'HH:mm:ss,SSSXX`
* `yyyy-MM-dd'T'HH:mm:ss,SSSXXX`
* `yyyyMMddHHmmss`
full `grok_pattern`. Another is where the timestamp format is one that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: where to when to match the previous sentence?

@@ -263,8 +252,18 @@ If the request does not encounter errors, you receive the following result:
"charset" : "UTF-8", <4>
"has_byte_order_marker" : false, <5>
"format" : "ndjson", <6>
"need_client_timezone" : false, <7>
"timestamp_field" : "release_date",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not tagged with explanations like other fields around them. Is that on purpose?

structure finder does not consider by default.

If this parameter is not specified, the structure finder chooses the best
format from a built in set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it be built-in?

TimestampMatch timestampMatch = TimestampFormatFinder.findFirstFullMatch(values.iterator().next(), timeoutChecker);
if (timestampMatch != null) {
fullMappingType = timestampMatch.getEsDateMappingTypeWithFormat();
TimestampFormatFinder timestampFormatFinder = new TimestampFormatFinder(explanation, true, true, true, timeoutChecker);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop all the way to calling timestampFormatFinder.getEsDateMappingTypeWithFormat() is repeated in FileStructureUtils around line 277. I wonder if we could refactor that in a method.

return findFirstMatch(text, 0, timeoutChecker);
public TimestampFormatFinder(List<String> explanation, String overrideFormat, boolean requireFullMatch, boolean errorOnNoTimestamp,
boolean errorOnMultiplePatterns, TimeoutChecker timeoutChecker) {
this.explanation = explanation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Objects.requireNonNull or @Nullable where suitable

for (Integer quickRuleOutIndex : candidate.quickRuleOutIndices) {
if (quickRuleoutMatches[quickRuleOutIndex] == null) {
quickRuleoutMatches[quickRuleOutIndex] = QUICK_RULE_OUT_PATTERNS.get(quickRuleOutIndex).matcher(text).find();
private static TimestampMatch checkCandidate(CandidateTimestampFormat candidate, String text, BitSet numberPosBitSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @Nullable where suitable? Especially in methods with optional arguments it helps a lot when reading.

return;
}

int remainingMatches = matches.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block seems a nice candidate to extract into a method that returns the weights.


timeoutChecker.check("timestamp format determination");

double highestWeight = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a Tuple<Integer, Double> findHighestWeight(double[] weights) method.

timeoutChecker.check("timestamp format determination");

// Is the selected format not already at the beginning of the list?
if (highestWeightFormatIndex > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And finally, this could be selectHighestWeightFormat.

The above would make the method roughly read like:

calcMatchWeights();
findHighestWeight();
selectHighestWeightFormat();

Take all these as a mere suggestion. I know at the end it's down to personal preference, but I find breaking down methods like this increase readability a lot as it prepares the reader to understand what the lower-level code is trying to achieve.


if (onlyConsiderFormat == null || onlyConsiderFormat.canMergeWith(match.timestampFormat)) {

if (match.firstIndeterminateDateNumber > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the zero check represent here?

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Just left a typo in a comment

@@ -718,7 +793,11 @@ Boolean guessIsDayFirstFromMatches(TimestampFormat onlyConsiderFormat) {

if (onlyConsiderFormat == null || onlyConsiderFormat.canMergeWith(match.timestampFormat)) {

// Valid indetermine day/month numbers will be in the range 1 to 31.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: indeterminate; it is a confusing one to type :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, I fixed the exact same typo in another place in another recent commit

@droberts195
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1

@droberts195 droberts195 merged commit a15f1ee into elastic:master May 23, 2019
@droberts195 droberts195 deleted the timestamp_finder_improvements branch May 23, 2019 20:06
droberts195 added a commit that referenced this pull request May 24, 2019
…1948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates #38086
Closes #35137
Closes #35132
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…astic#41948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates elastic#38086
Closes elastic#35137
Closes elastic#35132
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants