-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved timeout implementation #9156
Comments
I've discovered another "hotspot" that can massively overrun a timeout setting is parent/child queries on an index that is receiving updates and has the default configuration of not eagerly loading global ordinals. While we could weave some timeout checks into the fielddata-loading loop to abort earlier I think the right solution in this scenario is for the administrator to reconfigure their system with background loading of global ordinals as per the recommendation. This problem scenario is not the fault of a badly formed query but a badly configured platform so I feel we should not introduce extra timeout checking logic here for what is an administrator issue. |
I feel the same way about global ordinals loading, if a platform relies on search requests to be fast via global ordinals then global ordinals should be eagerly loaded. This applies not only to parent/child, but also to the terms bucket aggregator. |
+1 |
Update: I tried rebasing this on master recently and it got pretty messy.
I didn't manage to complete the rebase in the time I had set aside to look at this. |
+1 |
Too much time has passed and, with recent versions, this seems to be less of an issue. Closing |
The problem
Currently search timeout checks are only performed using Lucene's TimeLimitingCollector class as each matching document is collected. This means in a search there are un-timed sections of code that have the potential to over-run e.g:
The previous attempt at solving the problem
The implementation proposed in #4586 has these issues:
a) It introduces a new time-tracking class, ActivityTimeMonitor and the need to pass this as context for a thread - but we already have a "Counter" object available in the existing SearchContext that can be re-used.
b) New timeout checks were applied liberally by wrapping all Lucene low-level file accesses - this was seen as introducing overhead.
This proposal
The approach proposed here is based on the following changes:
Timer accuracy
Technically any timeout setting sent in a search request has to be at least double that of the Counter update interval to avoid false positives (so, using the defaults, this would be > 400ms). This is because a search taking 200.1 milliseconds could actually look to span 3 Counter time intervals of 0, 200, 400 if the timer checks were unlucky enough to be made at 199.05 (still in interval 0-200) and then 400.05 (just ticked into interval 400-600). So the estimated time of this 200.1ms query is 400 minus 0 = 400ms.
Changing timer accuracy
For timeouts < 400ms the default interval used by the internal estimated time Counter must be reconfigured. Unfortunately this cannot be done using the existing implementation - ThreadPool.java looks for a
threadpool.estimated_time_interval
setting from configuration but earlier sections of the code insist that ALLthreadpool.*
settings are of the 3-depth form e.g.threadpool.search.size
and errors if this 2-depththreadpool.estimated_time_interval
is set. So I don't believe it is possible to set this interval given the current impl and so we may want to take this opportunity to rename it - e.g.internal_clock.estimated_time_interval
?The text was updated successfully, but these errors were encountered: