Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve file structure finder timestamp format determination #41948

Merged
merged 17 commits into from
May 23, 2019

Commits on May 8, 2019

  1. [ML] Improve file structure finder timestamp format determination

    This change contains a major refactoring of the timestamp
    format determination code used by the ML find file structure
    endpoint.
    
    Previously timestamp format determination was done separately
    for each piece of text supplied to the timestamp format finder.
    This had the drawback that it was not possible to distinguish
    dd/MM and MM/dd in the case where both numbers were 12 or less.
    In order to do this sensibly it is best to look across all the
    available timestamps and see if one of the numbers is greater
    than 12 in any of them.  This necessitates making the timestamp
    format finder an instantiable class that can accumulate evidence
    over time.
    
    Another problem with the previous approach was that it was only
    possible to override the timestamp format to one of a limited
    set of timestamp formats.  There was no way out if a file to be
    analysed had a timestamp that was sane yet not in the supported
    set.  This is now changed to allow any timestamp format that can
    be parsed by a combination of these Java date/time formats:
    yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
    a, XX, XXX, zzz
    Additionally S letter groups (fractional seconds) are supported
    providing they occur after ss and separated from the ss by a dot,
    comma or colon.  Spacing and punctuation is also permitted with
    the exception of the question mark, newline and carriage return
    characters, together with literal text enclosed in single quotes.
    
    The full list of changes/improvements in this refactor is:
    
    - Make TimestampFormatFinder an instantiable class
    - Overrides must be specified in Java date/time format - Joda
      format is no longer accepted
    - Joda timestamp formats in outputs are now derived from the
      determined or overridden Java timestamp formats, not stored
      separately
    - Functionality for determining the "best" timestamp format in
      a set of lines has been moved from TextLogFileStructureFinder
      to TimestampFormatFinder, taking advantage of the fact that
      TimestampFormatFinder is now an instantiable class with state
    - The functionality to quickly rule out some possible Grok
      patterns when looking for timestamp formats has been changed
      from using simple regular expressions to the much faster
      approach of using the Shift-And method of sub-string search,
      but using an "alphabet" consisting of just 1 (representing any
      digit) and 0 (representing non-digits)
    - Timestamp format overrides are now much more flexible
    - Timestamp format overrides that do not correspond to a built-in
      Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
      whose definition is included within the date processor in the
      ingest pipeline
    - Grok patterns that correspond to multiple Java date/time
      patterns are now handled better - the Grok pattern is accepted
      as matching broadly, and the required set of Java date/time
      patterns is built up considering all observed samples
    - As a result of the more flexible acceptance of Grok patterns,
      when looking for the "best" timestamp in a set of lines
      timestamps are considered different if they are preceded by
      a different sequence of punctuation characters (to prevent
      timestamps far into some lines being considered similar to
      timestamps near the beginning of other lines)
    - Out-of-the-box Grok patterns that are considered now include
      %{DATE} and %{DATESTAMP}, which have indeterminate day/month
      ordering
    - The order of day/month in formats with indeterminate day/month
      order is determined by considering all observed samples (plus
      the server locale if the observed samples still do not suggest
      an ordering)
    
    Relates elastic#38086
    Closes elastic#35137
    Closes elastic#35132
    droberts195 committed May 8, 2019
    Configuration menu
    Copy the full SHA
    2a19461 View commit details
    Browse the repository at this point in the history

Commits on May 9, 2019

  1. Improve timestamp format quick-rule-out functionality

    Previously if a timestamp format was not quickly ruled
    out then we would search for it in the whole sample.
    Following this change the quick-rule-out patterns are
    used not only to completely rule out some formats but
    also to find the portion of the sample over which the
    format could possibly match.  This helps a lot in the
    case of long lines that contain sections that nearly
    match one of our candidate timestamps but not quite
    (because regular expression matching is slowest in the
    case of patterns that nearly match).
    droberts195 committed May 9, 2019
    Configuration menu
    Copy the full SHA
    2a862fd View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    7692d23 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2019

  1. Configuration menu
    Copy the full SHA
    d15182e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    43fa94d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0db4167 View commit details
    Browse the repository at this point in the history

Commits on May 11, 2019

  1. Configuration menu
    Copy the full SHA
    6951d88 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2019

  1. Configuration menu
    Copy the full SHA
    d08c8be View commit details
    Browse the repository at this point in the history
  2. Bring secondary timestamps in Grok pattern creator in line with prima…

    …ry timestamps in timestamp format finder
    droberts195 committed May 13, 2019
    Configuration menu
    Copy the full SHA
    baf107c View commit details
    Browse the repository at this point in the history

Commits on May 15, 2019

  1. Configuration menu
    Copy the full SHA
    b3fe912 View commit details
    Browse the repository at this point in the history

Commits on May 22, 2019

  1. Fix a couple of format quirks

    1. Even though %{TIMESTAMP_ISO8601} cannot parse an ISO8601 date
       with no time, the ISO8601 date format can
    2. The %{DATE} and %{DATESTAMP} Grok patterns accept a single
       digit month but not a single digit day
    droberts195 committed May 22, 2019
    Configuration menu
    Copy the full SHA
    893d99b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2d1717e View commit details
    Browse the repository at this point in the history
  3. Fixing docs test

    droberts195 committed May 22, 2019
    Configuration menu
    Copy the full SHA
    1abe5fe View commit details
    Browse the repository at this point in the history

Commits on May 23, 2019

  1. Memory optimisations

    Also fixing a couple of typos
    droberts195 committed May 23, 2019
    Configuration menu
    Copy the full SHA
    143cf69 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9bca698 View commit details
    Browse the repository at this point in the history
  3. Address review comments

    droberts195 committed May 23, 2019
    Configuration menu
    Copy the full SHA
    e046918 View commit details
    Browse the repository at this point in the history
  4. Fix typo

    droberts195 committed May 23, 2019
    Configuration menu
    Copy the full SHA
    6c88cbd View commit details
    Browse the repository at this point in the history