-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After Upgrade to 6.2.4 Circuitbreaker Exception Data too large #31197
Comments
Pinging @elastic/es-core-infra |
My cluster have the same issue |
I had this problem too. I work around it by increasing breaker limit on the fly:
By default it's 70%. |
My cluster is also affected by this bug. Workaround helps only for a few days, then circuit breaker hits the limit again, and I bump it up few percent more. Rolling restart doesn't help. |
Elasticsearch 6.2.0 introduced the accounting circuit breaker to account for segment memory usage: #27116. This means that for installations having a large number of shards, it would be expected that this breaker would start tripping on them when it previously was not. All that we have done here is do a better job of accounting for memory and can therefore break now in more situations. This is a good thing, it prevents us from going out of memory and completely blowing up. In 7.0.0 we are making further enhancements to the circuit breaker to account for the real memory usage: #33125 I do not see a bug here, I think that Elasticsearch and the circuit breaker are behaving as expected. |
Maybe I don't understand something, but how simple request can allocate so much memory: curl -X GET 'http://localhost:9200/_cat/health?v' {"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]","bytes_wanted":18213502664,"bytes_limit":18193693409}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]","bytes_wanted":18213502664,"bytes_limit":18193693409},"status":503} |
@hrr888 it is not that this particular request allocates so much memory. Circuit breakers are hierarchical so the "parent" breaker accumulates the reserved memory of the other circuit breakers. It is likely that one of the other circuit breakers (see docs) is already reserving a lot of memory and this request just pushes it over the limit and the parent circuit breaker trips due to totally reserved memory usage of all other circuit breakers. You can use the node stats API (i.e. |
Ok, thanks for explanation. But then, this error message is somewhat misleading. |
@hrr888 We are open to suggestions on how we can improve any aspect of the system. Can you help us understand what is would make it better? Starting with:
There was an This means it is the The status code on the HTTP response is 503, internal server error (see #31986). Finally, |
Since upgrading to 6.4.0 I have no longer seen this issue. 6.4.0 appears to be much more stable. |
@jasontedor Explanation you gave make it all clear. Maybe it's my poor understanding of English. The docs say
"Across" is for me as "if any breaker has own limit greater than parent, then parent has precedence", but it not implies that parent is applied to sum of all allocated memory, as @danielmitterdorfer explained (I found the root cause of the problem. Thanks, @danielmitterdorfer :) ). Maybe cited sentence should be rephrased or exdended. |
After upgrade to ES7, I get this error just to any request (including health status) after a few minutes of monitoring by cerebro UI and no queries nor other requests at all. Heap is set to 16GB and it gets almost full just after initialisation of the 15 indexes (each having 5 shards). When setting the heap size to the default 1GB, ES does not even finish initialisation. So, I have to raise even higher than 16GB? |
same in my case. i set a cluster up with 3 data, 3 master, and 3 ingest nodes. |
Getting same error with a "toy" example - this is so strange - I would expect Elastic to just "work". |
@avloss I'm sorry that you're experiencing trouble, but without details of the issue that you're experiencing it is not actionable for us. We would love to understand the issue that you're facing, and supply a fix if appropriate. Help us help you and the rest of our userbase. |
Still the same with 7.1. And I only have one single node. After raising the Java heap to 48GB, ES seems to manage common traffic well. On the other hand, there is one great improvement since ES6, though: no more crazy garbage collection causing heavy load on all the 112 cores of the server every 1 second, even when the server is completely idle...!?! :-) |
I have the same issue. I think that the parent memory calculation is wrong with 7.x (doesn't happen with version 6) |
same issues in 7.1.0, there is a internal server error in /_stats/fielddata
|
Same issue in 7.1.1. After read this thread, I know that it is the expected behavior of ES. Is there any suggestion to avoid run into this issue again? Or, is there any walkaround? Thanks! @jasontedor |
I got the same issue after upgrading from 6.6.0 to 7.1.0 on a single node cluster with 3Gb of heap. Every 3 request (these are really small requests with very little data to retreive), i hit the Heap Limit and have to wait. But i also got another ES cluster with 3 nodes, fresh install of he 7.1.1 and i never got this problem. |
@Theoooooo |
just use: since the default heap is 1G, if your data is big ,you should set it bigger |
@e-orz What's GC Type ? @wuxiangli91 And what if you have limited ressources available ? That's making non sense but it's also right in the same time. After testing many different configuration later (in term of ressources), well the answer to the problem is just to allocate more RAM. But with very small query that demand very small ressources and almost no cross-indices search, it's just a problem right now |
@Theoooooo I meant the garbage collection type. The default is CMS (concurrent mark and sweep) and the newer one is G1GC that should require less tweaks but might only be suited for larger heaps. |
@e-orz How do you check which version is used by java with elasticsearch ? ( i assume it's java ^^) |
@Theoooooo you can check the |
@e-orz |
I have the same issue. |
We just upgraded ElasticSearch and Kibana to 7.7 from 7.6.2 and now Kibana won't start with the same message.
|
After smoothly working for more than 10 months, I start getting the same error on production suddenly while doing simple queries. GET:
while simplest elasticsearch query, i.e elasticsearch metadata: Here is my breaker info
What would be the better approach to solve this, disabling circuit breaker or increasing breaker limit? |
When i use this , it shows "./bin/elasticsearch-env: line 81: /etc/sysconfig/elasticsearch: Permission denied" error. |
Elasticsearch version (
bin/elasticsearch --version
): Version: 6.2.4, Build: ccec39f/2018-04-12T20:37:28.497551Z, JVM: 1.8.0_151Plugins installed: discovery-file and xpack
JVM version (
java -version
): java version "1.8.0_151"Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version (
uname -a
if on a Unix-like system): Linux 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior: After upgrade and restart of the cluster I get the following message:
[2018-06-08T06:51:56,825][WARN ][o.e.c.a.s.ShardStateAction] [a11-es0] [logstash-bro-2018.05.31][3] received shard failed for shard id [[logstash-bro-2018.05.31][3]], allocation id [dwSADk_rRhSqdRqqlG5_Qw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-bro-2018.05.31][3]: Recovery failed from {a16-es1}{YRp5l6bMSJKSjVD7_hC8aQ}{8tIAsMsKT_m5h5PP7aDjWg}{192.168.1.16}{192.168.1.16:9301}{box_type=hot} into {a25}{CNJlJBE0TQK2zYzu0Jcsxg}{Vs4C9bEWTtyF7CTOyvnbYA}{192.168.1.25}{192.168.1.25:9300}{box_type=warm}]; nested: RemoteTransportException[[a16-es1][192.168.1.16:9301][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [61] files with total size of [75.1gb]]; nested: RemoteTransportException[[a25][192.168.1.25:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [23219239233/21.6gb], which is larger than the limit of [23178077798/21.5gb]]; ]
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
1.Upgraded from 6.1.1 to 6.2.4.
2.I have previously restarted the cluster without having this issue.
3.I attempted to increase the heap size to 40GB as a temporary solution to allow the reallocation to occure but still seeing the issue.
Provide logs (if relevant):
[2018-06-08T06:51:56,825][WARN ][o.e.c.a.s.ShardStateAction] [a11-es0] [logstash-bro-2018.05.31][3] received shard failed for shard id [[logstash-bro-2018.05.31][3]], allocation id [dwSADk_rRhSqdRqqlG5_Qw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-bro-2018.05.31][3]: Recovery failed from {sa16-es1}{YRp5l6bMSJKSjVD7_hC8aQ}{8tIAsMsKT_m5h5PP7aDjWg}{192.168.1.16}{192.168.1.16:9301}{box_type=hot} into {a25}{CNJlJBE0TQK2zYzu0Jcsxg}{Vs4C9bEWTtyF7CTOyvnbYA}{192.168.1.25}{192.168.1.25:9300}{box_type=warm}]; nested: RemoteTransportException[[a16-es1][192.168.1.16:9301][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [61] files with total size of [75.1gb]]; nested: RemoteTransportException[[a25][192.168.1.25:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [23219239233/21.6gb], which is larger than the limit of [23178077798/21.5gb]]; ]
The text was updated successfully, but these errors were encountered: