-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new generic timeout handler and changes to existing search classes to use it #4586
Conversation
synchronized (timeLimitedThreads) { | ||
// find out which is the next candidate for failure | ||
|
||
// TODO Not the fastest conceivable data structure to achieve this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not one of the priority queue style data structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to consider any concrete recommendations.
I need 1) fast random lookups on an object key (Thread) 2) weak references on keys 3) sorted iteration over values (timestamps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we use message passing here and make all writes private to the timeout thread we can just use multiple datastructures and priority queues. I really think that there should only be a single thread modifying all datatstructures here and that thread pulls commands from a queue also in a single threaded fashion sending results back via callbacks - this goes together with my suggestion above related to ThreadTimeoutState
So I just found some weird runaway highlighting process that I found a request that took 57 minutes. So +1 for this. |
@nik9000 I don't think we can prevent this unless we fix the highlighter ;/ |
Even though we've got a fix for the highlighter I still think it'd be worth checking the timeout during the highlight process in case other bugs like this come up. No reason it has to be part of this pull request though. |
@markharwood what's the status on this one? |
Need to gather confidence about impl performance via benchmarking - following that we should consider where to apply these timeout checks across the platform |
Big +1 for this! |
Can the benchmark include docvalues access as well? Because the change wraps, but doesnt override randomAccessSlice, I'm afraid it would effectively undo many performance improvements from lucene 4.9 |
By wrapping all indexinputs we also lose a good deal of NIOFS optimizations (e.g. reducing bounds checks for readlong/readvint and so on). |
@rmuir Thanks for taking a look. I wrapped the RandomAccessInputs and was re-running benchmarks with doc values. I got side-tracked though because the benchmarks revealed a stack overflow issue. It took a while to track down but the misbehaving query is this one with multiple scripts: https://gist.github.com/markharwood/0747e741b6fed9bbb32b#file-stack-trace |
I talked to @rmuir about the query killer and his concerns are as follows (at least as I understand them):
Who needs this feature? Really, it's for the user who is running Elasticsearch as a service and who doesn't have control over the types of queries run against the cluster. The query killer is intended as a safety net. However, enabling this feature by default means that the user who does have control over their queries (or who doesn't want to use timeouts) will pay the cost of poorer performance regardless. The query killer will be useful functionality for a subset of users, but should be disabled by default, and only enabled by those who need it. Which begs the question of how to implement this. It could be a config setting that is applied only at startup, or it could be a per-index setting (via an alternate directory implementation) that can be enabled only on specific indices. Changing the setting would probably require closing and reopening the index. The latter would be my preference. Thoughts? |
I would be really happy with any implementation. I like the per-index idea too. If I can just add the setting in the mapping template, that's an easy win. As for who needs it: Yesterday one of my users ran a simply query against the |
@avleen agreed - for those who need it (like yourself) this functionality would be very very helpful. just as a side note: you know that you can disable the loading of field data on certain fields? That would at least help you to prevent problems like the one you experienced: |
…ffectively time-limiting search requests. Special runtime exceptions can now short-cut the execution of calls to Lucene and are caught and reported back, not as a fatal error but the existing “timedOut” flag in results. Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error. ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM. The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor. Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput. ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene or script calls and report them as timeouts along with any partial results. FetchPhase is similarly modified to deal with the possibility of timeout errors.
…nt in disk accesses
Coding something up with this new index setting (requires to be set on closed index):
Default is false |
…ing requests “thorough” timeout checking. Avoids paying the performance penalty of timeout checks if not required
…o doesn’t trigger the Lucene disk accesses that contain timeout checks
Thanks Clinton! I'm turning that on :-) |
Is there a target release for this? Or is there any way I could help, this is one of my all time desired features. Has anyone given this a whirl in production? |
Closing based on this comment: #9168 (comment) |
A more effective approach to time-limiting activities such as search requests. Special runtime exceptions can now short-cut the execution of long-running calls to Lucene classes and are caught and reported back, not as a fatal error but using the existing “timedOut” flag in results.
Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error.
ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM.
The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor.
Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput and IndexInputSlicer.
ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene calls and report them as timeouts along with any partial results.
FetchPhase is similarly modified to deal with the possibility of timeout errors.