Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Slow performance of file structure finder with log message containing many dates #35137

Closed
droberts195 opened this issue Oct 31, 2018 · 2 comments · Fixed by #41948
Closed
Assignees
Labels
>bug :ml Machine learning

Comments

@droberts195
Copy link
Contributor

@grabowskit noticed that the find_file_structure endpoint timed out on a log file that contained messages from a Cloud allocator.

I debugged this and found that the messages that cause a particular problem for the file structure finder are those that contain many dates, for example:

[2018-10-05 11:05:17,526][INFO ][no.found.runner.allocation.elasticsearch.ElasticsearchDockerContainer] Creating container from [ElasticsearchDockerContainer(ElasticsearchContainerId(11111111111111111111111111111111,instance-0000000000),-,List(),{"data":{"name":"admin-console-elasticsearch"},"kind":"elasticsearch","resources":{"cpu":{"hard_limit":false,"boost":true}},"shield":{"system_key":"-","found_users":{"found-internal-constructor":{"roles":["found-internal-admin","superuser"],"username":"found-internal-constructor","password":"-","password_hash":"-","valid_until":"2023-10-05T11:05:15.879Z"},"found-internal-system":{"roles":["found-internal-constructor","superuser"],"username":"found-internal-system","password":"-","password_hash":"-","valid_until":"2060-10-05T11:05:10.624Z"},"found-internal-admin-proxy":{"roles":["found-internal-admin"],"username":"found-internal-admin-proxy","password_hash":"-","valid_until":"2060-10-05T11:05:10.546Z"},"ece-admin-cluster-indexer":{"roles":["found-internal-admin"],"username":"ece-admin-cluster-indexer","password_hash":"-"},"found-internal-monitor":{"roles":["found-internal-monitor"],"username":"found-internal-monitor","password_hash":"-","valid_until":"2060-10-05T11:05:10.701Z"},"found-local-allocator":{"password_hash":"-","roles":["found-internal-admin"]},"ec-local-beats-monitor":{"password_hash":"-","roles":["found-internal-monitor"]}}},"snapshot":{"enabled":false,"suspended":{"constructor":true}},"user_id":"root","system_owned":true,"source":{"facilitator":"bootstrap","action":"create-admin-console-backend","date":"2018-10-05T11:05:10.535Z"}},{"instance_capacity":4096,"instance_configuration_id":"data.default","allocator_filter":{"match_all":{}},"node_types":["data","master","ingest"],"elasticsearch":{"node":{"data":true,"master":true,"ingest":true},"plugins":["repository-s3","found-elasticsearch","x-pack"],"gateway_type":"local","user_settings_overrides":{},"user_bundles":[],"version":"5.6.12","confirmed_election":false,"user_settings":{"script.engine.expression.file":true,"script.engine.expression.inline":true,"script.engine.expression.stored":true,"script.engine.mustache.file":true,"script.engine.mustache.stored":true,"script.engine.painless.file":true,"script.engine.painless.inline":true,"script.engine.painless.stored":true,"script.file":false,"script.inline":false,"script.stored":false},"user_plugins":[],"seed_hosts":[]},"resources":{"quota":{"fs_multiplier":32}},"ssd":false,"plan_version":"1.0","logical_zone_name":"zone-0","start":true,"quorum_size":1,"maintenance":true,"source":{"facilitator":"constructor","action":"elasticsearch.create-instances","date":"2018-10-05T11:05:15.963Z"},"allocator_name":"10.10.10.10","allocator_id":"10.10.10.10","allocator_zone_name":"ece-zone-1"},None,None,{"forward_auth_headers":true,"shield":{"allow_anonymous":false,"esusers":{"roles":{"anonymous":{}},"users":{},"users_roles":{}}},"source":{"facilitator":"bootstrap","action":"create-admin-console-backend","date":"2018-10-05T11:05:10.535Z"}},AllocatorShieldUser(ec-local-beats-monitor,-,List(found-internal-monitor)),AllocatorShieldUser(found-local-allocator,-,List(found-internal-admin)),no.found.template.ZooKeeperResourceLocator@1f41214d,DefaultAllocatorAllocationEnvironment(438182,10.10.10.10,ece-zone-1,10.10.10.10,Map(kibana -> found/{kind}:{version}, apm -> found/{kind}:{version}, elasticsearch -> found/{kind}:{version}),Map(kibana -> 4.0, apm -> 4.0, elasticsearch -> 32.0),Map(kibana -> 5 seconds, apm -> 30000 milliseconds, elasticsearch -> 5 seconds),/usr/elastic/data/elastic/10.10.10.10/services/allocator/containers,/app/state,/app/state,state/runner-initial-containers-content.json,no.found.zookeeper.VersionCache@2b367d62,no.found.http.ActorHttpClient@3d6b7a00,no.found.runner.managers.ContainerIdsManager@1b8f1b99,no.found.runner.managers.XFSQuotaSupport@2291e278,no.found.filewatch.FileWatchService@3ad44625,no.found.template.VersionedTemplateManager@5c098005,no.found.runner.managers.ZooKeeperManager@3f390bce,no.found.runner.managers.NetworkResourcesManager@1296ac73,no.found.runner.settings.AllocatorSettings@aeb256a,Map(kibana -> 0.0, apm -> 0.0, elasticsearch -> 0.0)),no.found.featureflags.CloudFeatureManager@1af2959a)]: [{"Image":"docker.hs.coop.ch/cloud-assets/elasticsearch:5.6.12-0","Labels":{"co.elastic.cloud.allocator.id":"10.10.10.10","co.elastic.cloud.allocator.zone":"ece-zone-1","co.elastic.cloud.allocator.instance_id":"instance-0000000000","co.elastic.cloud.allocator.cluster_id":"11111111111111111111111111111111","co.elastic.cloud.allocator.user_id":"root","co.elastic.cloud.allocator.kind":"elasticsearch","co.elastic.cloud.allocator.type":"elasticsearch","co.elastic.cloud.allocator.type_version":"5.6.12"},"ExposedPorts":{"11111/tcp":{},"22222/tcp":{}},"HostConfig":{"CpuShares":11,"Memory":4294967296,"MemorySwap":4294967296,"BlkioWeight":11,"CpuPeriod":100000,"CpuQuota":4800000,"Binds":["/usr/elastic/data/elastic/10.10.10.10/services/allocator/containers/elasticsearch/11111111111111111111111111111111/instance-0000000000:/app"],"RestartPolicy":{"Name":"always"},"ExtraHosts":["containerhost:${ALLOCATOR_HOST_IP}","dockerhost:${ALLOCATOR_HOST_IP}"],"PortBindings":{"11111/tcp":[{"HostIp":"0.0.0.0","HostPort":"11111"}],"22222/tcp":[{"HostIp":"0.0.0.0","HostPort":"22222"}]},"CapAdd":["SETUID","SETGID","CHOWN","DAC_OVERRIDE"],"CapDrop":["ALL"]},"Volumes":{"/app":{}},"Env":["ELASTIC_UID=33333","ELASTIC_GID=33333"]}] {"ec_container_kind":"elasticsearch","ec_container_group":"11111111111111111111111111111111","ec_container_name":"instance-0000000000"}

(Some of the details in that sample have been altered to avoid disclosing real values from Cloud.)

Some temporary debug showed the following for this message:

[2018-10-31T16:22:20,667][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] quick rule out 0
[2018-10-31T16:22:20,667][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done
[2018-10-31T16:22:20,667][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] quick rule out 1
[2018-10-31T16:22:20,667][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done
[2018-10-31T16:22:20,667][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] trying [YYYY-MM-dd HH:mm:ss,SSS Z]
[2018-10-31T16:22:21,306][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done
[2018-10-31T16:22:21,306][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] trying [YYYY-MM-dd HH:mm:ss,SSSZ]
[2018-10-31T16:22:22,021][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done
[2018-10-31T16:22:22,021][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] trying [YYYY-MM-dd HH:mm:ss,SSSZZ]
[2018-10-31T16:22:22,727][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done
[2018-10-31T16:22:22,727][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] trying [YYYY-MM-dd HH:mm:ss,SSS]
[2018-10-31T16:22:22,728][INFO ][o.e.x.m.f.TimestampFormatFinder] [node-0] done

The problems are:

  1. Because there are multiple timestamps in different formats the functionality to quickly rule out some timestamp patterns doesn't work.
  2. Then patterns that nearly match but not quite cause excessive backtracking and searching in the Grok pattern matcher due to the fact that a large prefix of the pattern matches in several places - around 2/3rds of a second per nearly matching pattern, just for this one message.
@droberts195 droberts195 added >bug :ml Machine learning labels Oct 31, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195 droberts195 self-assigned this Apr 16, 2019
droberts195 added a commit to droberts195/elasticsearch that referenced this issue May 8, 2019
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates elastic#38086
Closes elastic#35137
Closes elastic#35132
@droberts195
Copy link
Contributor Author

The log file that exhibited this problem is still tricky to deal with, but following the changes of the second commit of #41948 it has at least moved on from a file that was unviable to analyse to one that's just slow to analyse.

At the time this issue was raised the problem report was:

Looks like it works for the first 300 lines, but not for 400. It hits the 25s timeout sometime with more than 300 lines of this log file.

Following the improvements it works for the first 3400 lines, but not for 3500 (i.e. hitting the default 25s timeout somewhere between 3400 and 3500 lines). This still isn't great but is a useful improvement, especially as the default lines_to_sample is 1000, so now it works with the default settings.

droberts195 added a commit that referenced this issue May 23, 2019
…1948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates #38086
Closes #35137
Closes #35132
droberts195 added a commit that referenced this issue May 24, 2019
…1948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates #38086
Closes #35137
Closes #35132
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
…astic#41948)

This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates elastic#38086
Closes elastic#35137
Closes elastic#35132
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants