Improved timeout implementation #9156

markharwood · 2015-01-06T10:37:15Z

The problem

Currently search timeout checks are only performed using Lucene's TimeLimitingCollector class as each matching document is collected. This means in a search there are un-timed sections of code that have the potential to over-run e.g:

The loop evaluating regular expressions in terms aggregation with include/exclude regex clauses
"Rewrite" methods for certain expensive Lucene queries
Any expensive queries (e.g. with scripted scoring) that don't produce any matches and therefore don't call TimeLimitingCollector

The previous attempt at solving the problem

The implementation proposed in #4586 has these issues:
a) It introduces a new time-tracking class, ActivityTimeMonitor and the need to pass this as context for a thread - but we already have a "Counter" object available in the existing SearchContext that can be re-used.
b) New timeout checks were applied liberally by wrapping all Lucene low-level file accesses - this was seen as introducing overhead.

This proposal

The approach proposed here is based on the following changes:

Reuse of the existing "Counter" functionality for tracking time cheaply based on estimates
The introduction of an "isTimedOut" check to SearchContext that is aware of the time already spent in servicing requests - potentially across multiple phases. Start time will be recorded when the SearchContext is first established and all over-runs are calculated as time elapsed from this point rather than the point at which a particular phase e.g. collection is started.
Selective addition of "isTimedOut" checks to sections of existing code with the potential to overrun e.g. IncludeExclude.java. This may also extend into a Lucene change to handle expensive internal operations like query rewrites.
Rejection of search requests that include a timeout setting that is less than the granularity of time intervals we track in the Counter class in 1). This helps set expectations about the level of accuracy we have in our timeout logic (see Clarification on setTimeout and isTimeout Java API methods #9092 ). This introduces a breaking change to the API which is why this change is targeted for 2.0
The ability to change the update interval of the Counter in 1) from its default of 200ms to overcome rejections introduced by 4).

Timer accuracy

Technically any timeout setting sent in a search request has to be at least double that of the Counter update interval to avoid false positives (so, using the defaults, this would be > 400ms). This is because a search taking 200.1 milliseconds could actually look to span 3 Counter time intervals of 0, 200, 400 if the timer checks were unlucky enough to be made at 199.05 (still in interval 0-200) and then 400.05 (just ticked into interval 400-600). So the estimated time of this 200.1ms query is 400 minus 0 = 400ms.

Changing timer accuracy

For timeouts < 400ms the default interval used by the internal estimated time Counter must be reconfigured. Unfortunately this cannot be done using the existing implementation - ThreadPool.java looks for a threadpool.estimated_time_interval setting from configuration but earlier sections of the code insist that ALL threadpool.* settings are of the 3-depth form e.g. threadpool.search.size and errors if this 2-depth threadpool.estimated_time_interval is set. So I don't believe it is possible to set this interval given the current impl and so we may want to take this opportunity to rename it - e.g. internal_clock.estimated_time_interval?

The text was updated successfully, but these errors were encountered:

markharwood · 2015-01-07T17:56:33Z

I've discovered another "hotspot" that can massively overrun a timeout setting is parent/child queries on an index that is receiving updates and has the default configuration of not eagerly loading global ordinals.

While we could weave some timeout checks into the fielddata-loading loop to abort earlier I think the right solution in this scenario is for the administrator to reconfigure their system with background loading of global ordinals as per the recommendation. This problem scenario is not the fault of a badly formed query but a badly configured platform so I feel we should not introduce extra timeout checking logic here for what is an administrator issue.

martijnvg · 2015-01-08T10:13:44Z

I feel the same way about global ordinals loading, if a platform relies on search requests to be fast via global ordinals then global ordinals should be eagerly loaded. This applies not only to parent/child, but also to the terms bucket aggregator.

andrassy · 2015-02-04T09:18:39Z

+1

markharwood · 2015-05-06T18:12:50Z

Update: I tried rebasing this on master recently and it got pretty messy.
Aside from basic code merging issues there were these concerns:

The code that allowed configuration of custom timer intervals was redundant (I hope) as the system for loading config had been changed by Simon recently
The main culprit which was missing a timer check (regexes in terms agg 'include" clauses) had also undergone change based on use of automata in Lucene which is hopefully faster.

I didn't manage to complete the rebase in the time I had set aside to look at this.

amontalenti · 2015-10-28T20:46:24Z

+1

clintongormley · 2015-12-03T19:41:45Z

Too much time has passed and, with recent versions, this seems to be less of an issue. Closing

markharwood added >enhancement v2.0.0-beta1 :Search/Search Search-related issues that do not fall into other categories labels Jan 6, 2015

This was referenced Jan 6, 2015

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

Closed

Improved search timeout checking capabilities: #9168

Closed

markharwood mentioned this issue Jan 9, 2015

ThreadPool.java looks for a config setting it itself prohibits #9216

Closed

markharwood self-assigned this Jun 9, 2015

clintongormley added v2.0.0 v2.1.0 and removed v2.0.0-beta1 v2.0.0 labels Aug 13, 2015

clintongormley added v2.2.0 and removed v2.1.0 labels Nov 20, 2015

clintongormley closed this as completed Dec 3, 2015

clintongormley removed the v2.2.0 label Jan 22, 2016

Asimov4 mentioned this issue Jan 26, 2016

Expensive include regex is not captured by the circuit breaker logic in ES 1.7.1 #16201

Closed

clintongormley mentioned this issue Feb 15, 2016

Query TimeAllowed in Elastic Server Side #16410

Closed

mpfz0r mentioned this issue May 6, 2020

Specify ES search timeout to match the HTTP request timeout Graylog2/graylog2-server#7607

Closed

9 tasks

boicehuang mentioned this issue Jul 22, 2020

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved timeout implementation #9156

Improved timeout implementation #9156

markharwood commented Jan 6, 2015

markharwood commented Jan 7, 2015

martijnvg commented Jan 8, 2015

andrassy commented Feb 4, 2015

markharwood commented May 6, 2015

amontalenti commented Oct 28, 2015

clintongormley commented Dec 3, 2015

Improved timeout implementation #9156

Improved timeout implementation #9156

Comments

markharwood commented Jan 6, 2015

The problem

The previous attempt at solving the problem

This proposal

Timer accuracy

Changing timer accuracy

markharwood commented Jan 7, 2015

martijnvg commented Jan 8, 2015

andrassy commented Feb 4, 2015

markharwood commented May 6, 2015

amontalenti commented Oct 28, 2015

clintongormley commented Dec 3, 2015