Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After Upgrade to 6.2.4 Circuitbreaker Exception Data too large #31197

Closed
r32rtb opened this issue Jun 8, 2018 · 32 comments
Closed

After Upgrade to 6.2.4 Circuitbreaker Exception Data too large #31197

r32rtb opened this issue Jun 8, 2018 · 32 comments
Labels
>bug :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload v6.2.4

Comments

@r32rtb
Copy link

r32rtb commented Jun 8, 2018

Elasticsearch version (bin/elasticsearch --version): Version: 6.2.4, Build: ccec39f/2018-04-12T20:37:28.497551Z, JVM: 1.8.0_151

Plugins installed: discovery-file and xpack

JVM version (java -version): java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version (uname -a if on a Unix-like system): Linux 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: After upgrade and restart of the cluster I get the following message:
[2018-06-08T06:51:56,825][WARN ][o.e.c.a.s.ShardStateAction] [a11-es0] [logstash-bro-2018.05.31][3] received shard failed for shard id [[logstash-bro-2018.05.31][3]], allocation id [dwSADk_rRhSqdRqqlG5_Qw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-bro-2018.05.31][3]: Recovery failed from {a16-es1}{YRp5l6bMSJKSjVD7_hC8aQ}{8tIAsMsKT_m5h5PP7aDjWg}{192.168.1.16}{192.168.1.16:9301}{box_type=hot} into {a25}{CNJlJBE0TQK2zYzu0Jcsxg}{Vs4C9bEWTtyF7CTOyvnbYA}{192.168.1.25}{192.168.1.25:9300}{box_type=warm}]; nested: RemoteTransportException[[a16-es1][192.168.1.16:9301][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [61] files with total size of [75.1gb]]; nested: RemoteTransportException[[a25][192.168.1.25:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [23219239233/21.6gb], which is larger than the limit of [23178077798/21.5gb]]; ]

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

1.Upgraded from 6.1.1 to 6.2.4.
2.I have previously restarted the cluster without having this issue.
3.I attempted to increase the heap size to 40GB as a temporary solution to allow the reallocation to occure but still seeing the issue.

Provide logs (if relevant):
[2018-06-08T06:51:56,825][WARN ][o.e.c.a.s.ShardStateAction] [a11-es0] [logstash-bro-2018.05.31][3] received shard failed for shard id [[logstash-bro-2018.05.31][3]], allocation id [dwSADk_rRhSqdRqqlG5_Qw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-bro-2018.05.31][3]: Recovery failed from {sa16-es1}{YRp5l6bMSJKSjVD7_hC8aQ}{8tIAsMsKT_m5h5PP7aDjWg}{192.168.1.16}{192.168.1.16:9301}{box_type=hot} into {a25}{CNJlJBE0TQK2zYzu0Jcsxg}{Vs4C9bEWTtyF7CTOyvnbYA}{192.168.1.25}{192.168.1.25:9300}{box_type=warm}]; nested: RemoteTransportException[[a16-es1][192.168.1.16:9301][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [61] files with total size of [75.1gb]]; nested: RemoteTransportException[[a25][192.168.1.25:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [23219239233/21.6gb], which is larger than the limit of [23178077798/21.5gb]]; ]

@albertzaharovits albertzaharovits added >bug :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload v6.2.4 labels Jun 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@adol-ch
Copy link

adol-ch commented Jun 11, 2018

My cluster have the same issue

@furkalor
Copy link

I had this problem too. I work around it by increasing breaker limit on the fly:

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
 {
     "transient" : {
         "indices.breaker.total.limit" : "80%"
     }
 }
 '

By default it's 70%.

@hrr888
Copy link

hrr888 commented Aug 29, 2018

My cluster is also affected by this bug. Workaround helps only for a few days, then circuit breaker hits the limit again, and I bump it up few percent more. Rolling restart doesn't help.

@jasontedor
Copy link
Member

Elasticsearch 6.2.0 introduced the accounting circuit breaker to account for segment memory usage: #27116. This means that for installations having a large number of shards, it would be expected that this breaker would start tripping on them when it previously was not. All that we have done here is do a better job of accounting for memory and can therefore break now in more situations. This is a good thing, it prevents us from going out of memory and completely blowing up. In 7.0.0 we are making further enhancements to the circuit breaker to account for the real memory usage: #33125

I do not see a bug here, I think that Elasticsearch and the circuit breaker are behaving as expected.

@hrr888
Copy link

hrr888 commented Aug 30, 2018

Maybe I don't understand something, but how simple request can allocate so much memory:

curl -X GET 'http://localhost:9200/_cat/health?v'

{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]","bytes_wanted":18213502664,"bytes_limit":18193693409}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]","bytes_wanted":18213502664,"bytes_limit":18193693409},"status":503}

@danielmitterdorfer
Copy link
Member

@hrr888 it is not that this particular request allocates so much memory. Circuit breakers are hierarchical so the "parent" breaker accumulates the reserved memory of the other circuit breakers. It is likely that one of the other circuit breakers (see docs) is already reserving a lot of memory and this request just pushes it over the limit and the parent circuit breaker trips due to totally reserved memory usage of all other circuit breakers. You can use the node stats API (i.e. GET /_nodes/stats/breaker) to inspect the current circuit breaker status.

@hrr888
Copy link

hrr888 commented Aug 30, 2018

Ok, thanks for explanation. But then, this error message is somewhat misleading.

@jasontedor
Copy link
Member

jasontedor commented Aug 30, 2018

@hrr888 We are open to suggestions on how we can improve any aspect of the system. Can you help us understand what is would make it better?

Starting with:

{
  "error": {
    "root_cause": [
      {
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]",
        "bytes_wanted": 18213502664,
        "bytes_limit": 18193693409
      }
    ],
    "type": "circuit_breaking_exception",
    "reason": "[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb]",
    "bytes_wanted": 18213502664,
    "bytes_limit": 18193693409
  },
  "status": 503
}

There was an error, its root_cause is a circuit_breaking_exception and the reason is because
[parent] Data too large, data for [<http_request>] would be [18213502664/16.9gb], which is larger than the limit of [18193693409/16.9gb].

This means it is the parent circuit breaker that tripped, it is an HTTP request that tripped it, if we accepted the HTTP request than the breaker would be at 18213502664 bytes (bytes_wanted) which is larger than the limit of 18193693409 bytes (bytes_limit). Therefore, the circuit breaker trips and the request is rejected.

The status code on the HTTP response is 503, internal server error (see #31986).

Finally, root_cause and the top-level exception are the same (they would not be in the case of, say, a remote exception where on the remote side the cause was, for example, and illegal argument exception).

@r32rtb
Copy link
Author

r32rtb commented Aug 31, 2018

Since upgrading to 6.4.0 I have no longer seen this issue. 6.4.0 appears to be much more stable.

@hrr888
Copy link

hrr888 commented Aug 31, 2018

@jasontedor Explanation you gave make it all clear. Maybe it's my poor understanding of English. The docs say

Each breaker specifies a limit for how much memory it can use. Additionally, there is a parent-level breaker that specifies the total amount of memory that can be used across all breakers.

"Across" is for me as "if any breaker has own limit greater than parent, then parent has precedence", but it not implies that parent is applied to sum of all allocated memory, as @danielmitterdorfer explained (I found the root cause of the problem. Thanks, @danielmitterdorfer :) ).

Maybe cited sentence should be rephrased or exdended.

@wanthalf
Copy link

After upgrade to ES7, I get this error just to any request (including health status) after a few minutes of monitoring by cerebro UI and no queries nor other requests at all. Heap is set to 16GB and it gets almost full just after initialisation of the 15 indexes (each having 5 shards). When setting the heap size to the default 1GB, ES does not even finish initialisation. So, I have to raise even higher than 16GB?

@virtuman
Copy link

same in my case. i set a cluster up with 3 data, 3 master, and 3 ingest nodes.

@avloss
Copy link

avloss commented May 29, 2019

Getting same error with a "toy" example - this is so strange - I would expect Elastic to just "work".

@jasontedor
Copy link
Member

@avloss I'm sorry that you're experiencing trouble, but without details of the issue that you're experiencing it is not actionable for us. We would love to understand the issue that you're facing, and supply a fix if appropriate. Help us help you and the rest of our userbase.

@wanthalf
Copy link

wanthalf commented May 29, 2019

Still the same with 7.1. And I only have one single node. After raising the Java heap to 48GB, ES seems to manage common traffic well.

On the other hand, there is one great improvement since ES6, though: no more crazy garbage collection causing heavy load on all the 112 cores of the server every 1 second, even when the server is completely idle...!?! :-)

@e-orz
Copy link

e-orz commented May 30, 2019

I have the same issue. I think that the parent memory calculation is wrong with 7.x (doesn't happen with version 6)
see this discussion:
https://discuss.elastic.co/t/parent-circuit-breaker-calculation-seems-to-wrong-with-version-7-x/183530

@hackerwin7
Copy link

same issues in 7.1.0, there is a internal server error in /_stats/fielddata

"failures":[  
         {  
            "shard":6,
            "index":"ad_stat_sum_extend-2019.05.08",
            "status":"INTERNAL_SERVER_ERROR",
            "reason":{  
               "type":"failed_node_exception",
               "reason":"Failed node [PMeDcHfMS4ilMZxQbrSt1w]",
               "node_id":"PMeDcHfMS4ilMZxQbrSt1w",
               "caused_by":{  
                  "type":"circuit_breaking_exception",
                  "reason":"[parent] Data too large, data for [<transport_request>] would be [26445028500/24.6gb], which is larger than the limit of [25769803776/24gb], real usage: [26445014040/24.6gb], new bytes reserved: [14460/14.1kb]",
                  "bytes_wanted":26445028500,
                  "bytes_limit":25769803776,
                  "durability":"PERMANENT"
               }
            }
         },

@blueabysm
Copy link

Same issue in 7.1.1.
I am new to Elasticsearch. I use ES cluster as a log center. Yesterday this issue made ES lose logs, many alerts related to this were triggered. I deleted indices 5 days ago (actually my cluster has just run a week) this morning, ES cluster finally receives logs normally.

After read this thread, I know that it is the expected behavior of ES. Is there any suggestion to avoid run into this issue again? Or, is there any walkaround? Thanks! @jasontedor

@Theoooooo
Copy link

Theoooooo commented Jun 13, 2019

I got the same issue after upgrading from 6.6.0 to 7.1.0 on a single node cluster with 3Gb of heap. Every 3 request (these are really small requests with very little data to retreive), i hit the Heap Limit and have to wait.

But i also got another ES cluster with 3 nodes, fresh install of he 7.1.1 and i never got this problem.
Seems like this issue appear with the upgrade i did.

@e-orz
Copy link

e-orz commented Jun 16, 2019

@Theoooooo
With my case it happens with fresh install.
What is the GC type, CMS or G1GC?

@wuxiangli91
Copy link

just use:
ES_JAVA_OPTS="-Xms10g -Xmx10g" ./bin/elasticsearch

since the default heap is 1G, if your data is big ,you should set it bigger

@Theoooooo
Copy link

Theoooooo commented Jun 18, 2019

@e-orz What's GC Type ?

@wuxiangli91 And what if you have limited ressources available ? That's making non sense but it's also right in the same time.

After testing many different configuration later (in term of ressources), well the answer to the problem is just to allocate more RAM. But with very small query that demand very small ressources and almost no cross-indices search, it's just a problem right now

@e-orz
Copy link

e-orz commented Jun 19, 2019

@Theoooooo I meant the garbage collection type. The default is CMS (concurrent mark and sweep) and the newer one is G1GC that should require less tweaks but might only be suited for larger heaps.

@Theoooooo
Copy link

@e-orz How do you check which version is used by java with elasticsearch ? ( i assume it's java ^^)

@e-orz
Copy link

e-orz commented Jun 20, 2019

@Theoooooo you can check the jvm.options file.

@Theoooooo
Copy link

Theoooooo commented Jun 21, 2019

@e-orz
image
It's the same content in the default jvm.options generated by elasticsearch at the installation

@32328254
Copy link

I have the same issue.
i tested CMS GC way and G1GC way,But there are still the same problems。

@cawoodm
Copy link
Contributor

cawoodm commented May 15, 2020

We just upgraded ElasticSearch and Kibana to 7.7 from 7.6.2 and now Kibana won't start with the same message.

FATAL [circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [987270888/941.5mb]...

@raghuchahar007
Copy link

After smoothly working for more than 10 months, I start getting the same error on production suddenly while doing simple queries.

GET: http://localhost:8080/?pretty
Output:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "circuit_breaking_exception",
        "reason" : "[parent] Data too large, data for [<http_request>] would be [745522124/710.9mb], which is larger than the limit of [745517875/710.9mb]",
        "bytes_wanted" : 745522124,
        "bytes_limit" : 745517875
      }
    ],
    "type" : "circuit_breaking_exception",
    "reason" : "[parent] Data too large, data for [<http_request>] would be [745522124/710.9mb], which is larger than the limit of [745517875/710.9mb]",
    "bytes_wanted" : 745522124,
    "bytes_limit" : 745517875
  },
  "status" : 503
}

while simplest elasticsearch query, i.e elasticsearch metadata:

Here is my breaker info

{
 "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 639015321,
          "limit_size" : "609.4mb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 639015321,
          "limit_size" : "609.4mb",
          "estimated_size_in_bytes" : 406826332,
          "estimated_size" : "387.9mb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 1065025536,
          "limit_size" : "1015.6mb",
          "estimated_size_in_bytes" : 560,
          "estimated_size" : "560b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 1065025536,
          "limit_size" : "1015.6mb",
          "estimated_size_in_bytes" : 146387859,
          "estimated_size" : "139.6mb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 745517875,
          "limit_size" : "710.9mb",
          "estimated_size_in_bytes" : 553214751,
          "estimated_size" : "527.5mb",
          "overhead" : 1.0,
          "tripped" : 0
        }
}

What would be the better approach to solve this, disabling circuit breaker or increasing breaker limit?

@senvardarsemih
Copy link

ES_JAVA_OPTS="-Xms10g -Xmx10g" ./bin/elasticsearch

When i use this , it shows "./bin/elasticsearch-env: line 81: /etc/sysconfig/elasticsearch: Permission denied" error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload v6.2.4
Projects
None yet
Development

No branches or pull requests