-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[o.e.b.ElasticsearchUncaughtExceptionHandler] [node01-es-dev] fatal error in thread [elasticsearch[node01-es-dev][search][T#6]], exiting java.lang.OutOfMemoryError: Java heap space #26525
Comments
I believe this is related to #26012 |
This is the eclipse MAT report on the heap usage when elastic search crashed. Yes it looks like a dupe to #26012. |
I changed the "min_doc_count" to avoid empty bucks in date histogram. I still run into the OOM problem. So I am actually not sure if solution discussed in #26012 would solve this problem. @colings86 @jimczi, could you comment? Note I also tried out the "global_ordinals_hash" mentioned in #24359 and it didn't help, either GET nusights_metric_cfs_collector_memcpu_stats_2017_*/_search |
Noticed this cardinality aggregator memory usage model on current release.
If this is the case, memory estimate for default 'prevision_threshold' of 30000 * 8 ~= 240KB. This is way bigger than the 5KB default weight in AggregatorBase. Does it mean we need to tweak the circuit breaker adjustment in CardinalityAggregator as well? |
Also feel free to close this issue if you guys think it is a dup. I went through some of the initial issues report in GitHub and Elastic discussions before opening up this ticket. Saw something similar but not sure they are exactly the same. |
The I believe this particular issue has to do with top-level histogram aggregations (of course I could be mistaken) |
Adding @hexinw could you run the following request and paste the output on this issue?
|
I don't think this is related to #26012. The circuit breaker for requests ( |
The default value for the |
@colings86 Yes there are more than 500 distinct nodes in the indices. The output to the node cardinality aggregation query: { |
@jimczi Yes I have a pretty small elastic search single node cluster with small memory footprint (8G memory and 4G heap for ES), just for experiment on the date histogram and cardinality aggregation stuff. Not for production yet. Just a bit unprepared to realize that the system can consume a large memory consumption by the aggregations. Sorry for the typo, with 3000 precision, 24KB per bucket consumption is not bad to me. I think reducing indices.breaker.request.limit should help for the circuit breaker to come in. Let me confirm. |
I changed indices.breaker.request.limit to 30% and it triggered the circuit break. { |
I enabled the trace log to understand how the request circuit breaker is doing the heap usage estimation. I noticed that the OOM seems to happen way after the heap usage estimation going down to 0gb. Noticed the heap estimation by circuit breaker reaches peak 3.9gb around 10:49:11. It goes down to Does it mean some other circuit breaker logic is required to catch this case?
[2017-09-14T10:49:11,391][TRACE][o.e.i.b.request ] [request] Adding [568b][<reused_arrays>] to used bytes [new used: [3.9gb], limit: 5112122572 [4.7gb], estimate: 4188695184 [3.9gb]]
|
Aggregations are performed in two phases. The first phase runs the aggregation on all shards and the second phase reduces the shard responses to a single result. The circuit breaker checks the memory during the first phase but it is not used for the reduce phase in the coordinating node. In your example the node ran out of memory during the reduce phase where memory is not checked. |
What is the reason that we don't put circuit breaker in the reduced phase? The batched_reduce_size parameter is a pretty coarse parameter as different data sets and query type can allow different number of shards to be reduced at a same time. |
Do you guys see any issue for the following code change to catch the OOM in the reduce phase? ======> git diff core/src/main/java/org/elasticsearch/action/search/FetchSearchPhase.java
====> Now OutOfMemoryError is caught without causing ES node to go down.
|
Sorry I am late to this. Can you elaborate the JVM statement? Do you mean that GC is slow in recycling the memory? Coming from C/C++ world, I'd think it normal for application to handle memory malloc failure. Is there any doc or discussion thread that talk about elastic search memory management model in general? Just to recap, I run into two OOM problems in my test.
Both OOMs are bad as they bring down the ES node and cause the ES cluster to not quite usable. And to me it'd be better to fail the query rather than crashing the ES node if we calculate the memory requirement upfront or react to the OOM dynamically. |
An out of memory exception can be thrown on any thread, whether executing:
We have zero guarantees that a thread suffering an out of memory exception can gracefully cleanup any shared data structures it was in the middle of manipulating, there might not even be enough memory to gracefully carry out such cleanup. With no guarantees, we have to assume that the state of the JVM is suspect at this point. It gets worse. Imagine threads waiting on an object monitor or a lock held by a thread that suffers an out of memory exception. The offending thread might not be able to notify the monitor or unlock these threads leaving them deadlocked. Trying to deal with this comes at the cost of significantly more complicated code. In fact, even if you tried to employ a strategy for dealing with this, because of the above remarks on shared data structures, there's no guarantee that this can be done safely; we can not unblock a waiting thread and let it start messing with a shared data structure when it's not in a good state. Similar reasoning applies to other errors (stack overflow, etc.). In such a stage, the JVM simply must die with dignity instead of limping along in an unknown unreliable state. |
@jasontedor I agreed with you on the OOM decision. Is there any improvement in the talk to pre-estimate memory consumption (I understand it is still a best effort) and fail the query request rather than letting OOM kick in to bring down the ES node? Also with respect to the circuit breaker setting, do you guys see any value in making the static limit a dynamic setting so the limit is dynamically decided by the time how much heap is available? |
We're definitely always interested in improving the pre-estimation of memory consumption, and adding more circuit breakers as needed.
I think this would be too hard to debug since it would vary widely whether the node could handle the request. It's also impossible to know whether a heap at 79% usage could easily be GC'd to 20% usage, or whether all the objects are live and cannot be GC'd. I'm in favor of a static (albeit configurable) limit on breakers. |
Thanks @dakrone What was the reason we don't have circuit breaker in reduced phase currently? Actually why is reducing phase taking more memory? |
It simply hasn't been added, I don't think there's a reason anyone purposely didn't add one there.
The reducing phase usually does not take more memory, this is likely the reason why a circuit breaker hasn't been proposed for this prior to this. I haven't seen any other instances where a node was getting an OOME during the reduction phase. |
The stack trace I posted 11 days back is a OOM in reduced phase.
|
Elasticsearch version (
bin/elasticsearch --version
):Version: 5.4.2, Build: 929b078/2017-06-15T02:29:28.122Z, JVM: 1.8.0_131
Plugins installed: []
JVM version (
java -version
):openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux dev-v1.0-akshay-es-1 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Circuit break exception fails to guard against some aggregation search, which causes elasticsearch out of memory exception and brings down elasticsearch node.
Stack trace of the out of memory exception:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Provide logs (if relevant):
Query DSL that brings down the elasticsearch node.
The text was updated successfully, but these errors were encountered: