Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved timeout implementation #9156

Closed
markharwood opened this issue Jan 6, 2015 · 6 comments
Closed

Improved timeout implementation #9156

markharwood opened this issue Jan 6, 2015 · 6 comments
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories

Comments

@markharwood
Copy link
Contributor

The problem

Currently search timeout checks are only performed using Lucene's TimeLimitingCollector class as each matching document is collected. This means in a search there are un-timed sections of code that have the potential to over-run e.g:

  1. The loop evaluating regular expressions in terms aggregation with include/exclude regex clauses
  2. "Rewrite" methods for certain expensive Lucene queries
  3. Any expensive queries (e.g. with scripted scoring) that don't produce any matches and therefore don't call TimeLimitingCollector

The previous attempt at solving the problem

The implementation proposed in #4586 has these issues:
a) It introduces a new time-tracking class, ActivityTimeMonitor and the need to pass this as context for a thread - but we already have a "Counter" object available in the existing SearchContext that can be re-used.
b) New timeout checks were applied liberally by wrapping all Lucene low-level file accesses - this was seen as introducing overhead.

This proposal

The approach proposed here is based on the following changes:

  1. Reuse of the existing "Counter" functionality for tracking time cheaply based on estimates
  2. The introduction of an "isTimedOut" check to SearchContext that is aware of the time already spent in servicing requests - potentially across multiple phases. Start time will be recorded when the SearchContext is first established and all over-runs are calculated as time elapsed from this point rather than the point at which a particular phase e.g. collection is started.
  3. Selective addition of "isTimedOut" checks to sections of existing code with the potential to overrun e.g. IncludeExclude.java. This may also extend into a Lucene change to handle expensive internal operations like query rewrites.
  4. Rejection of search requests that include a timeout setting that is less than the granularity of time intervals we track in the Counter class in 1). This helps set expectations about the level of accuracy we have in our timeout logic (see Clarification on setTimeout and isTimeout Java API methods #9092 ). This introduces a breaking change to the API which is why this change is targeted for 2.0
  5. The ability to change the update interval of the Counter in 1) from its default of 200ms to overcome rejections introduced by 4).

Timer accuracy

Technically any timeout setting sent in a search request has to be at least double that of the Counter update interval to avoid false positives (so, using the defaults, this would be > 400ms). This is because a search taking 200.1 milliseconds could actually look to span 3 Counter time intervals of 0, 200, 400 if the timer checks were unlucky enough to be made at 199.05 (still in interval 0-200) and then 400.05 (just ticked into interval 400-600). So the estimated time of this 200.1ms query is 400 minus 0 = 400ms.

Changing timer accuracy

For timeouts < 400ms the default interval used by the internal estimated time Counter must be reconfigured. Unfortunately this cannot be done using the existing implementation - ThreadPool.java looks for a threadpool.estimated_time_interval setting from configuration but earlier sections of the code insist that ALL threadpool.* settings are of the 3-depth form e.g. threadpool.search.size and errors if this 2-depth threadpool.estimated_time_interval is set. So I don't believe it is possible to set this interval given the current impl and so we may want to take this opportunity to rename it - e.g. internal_clock.estimated_time_interval?

@markharwood
Copy link
Contributor Author

I've discovered another "hotspot" that can massively overrun a timeout setting is parent/child queries on an index that is receiving updates and has the default configuration of not eagerly loading global ordinals.

While we could weave some timeout checks into the fielddata-loading loop to abort earlier I think the right solution in this scenario is for the administrator to reconfigure their system with background loading of global ordinals as per the recommendation. This problem scenario is not the fault of a badly formed query but a badly configured platform so I feel we should not introduce extra timeout checking logic here for what is an administrator issue.

@martijnvg
Copy link
Member

I feel the same way about global ordinals loading, if a platform relies on search requests to be fast via global ordinals then global ordinals should be eagerly loaded. This applies not only to parent/child, but also to the terms bucket aggregator.

@andrassy
Copy link

andrassy commented Feb 4, 2015

+1

@markharwood
Copy link
Contributor Author

Update: I tried rebasing this on master recently and it got pretty messy.
Aside from basic code merging issues there were these concerns:

  1. The code that allowed configuration of custom timer intervals was redundant (I hope) as the system for loading config had been changed by Simon recently
  2. The main culprit which was missing a timer check (regexes in terms agg 'include" clauses) had also undergone change based on use of automata in Lucene which is hopefully faster.

I didn't manage to complete the rebase in the time I had set aside to look at this.

@amontalenti
Copy link

+1

@clintongormley
Copy link
Contributor

Too much time has passed and, with recent versions, this seems to be less of an issue. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants