ES java very high CPU usage #4288

ydnitin · 2013-11-28T04:02:46Z

I am having similar issues as described on this thread. #1940
Although leap second already happened in June but I tried resetting the time anyway. Didn't work.
This is causing a lot of frustration. Any help is much appreciated.

I have upgraded ES to the latest (elasticsearch-0.90.7-1). Have restarted ES, front end, logstash. no joy at all.

java -version
java version "1.7.0_45"
OpenJDK Runtime Environment (rhel-2.4.3.2.el6_4-x86_64 u45-b15)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

After the upgrade I am seeing these in the logs, which seems like a totally different issues.

[2013-11-28 03:48:32,796][WARN ][discovery.zen ] [LA Logger] received a join request for an existing node [[Night Thrasher][ecqZvZhDSTGiVkrjj6G_hw][inet[/192.168.1 28.146:9300]]{client=true, data=false}]
[2013-11-28 03:48:36,006][WARN ][discovery.zen ] [LA Logger] received a join request for an existing node [[Night Thrasher][ecqZvZhDSTGiVkrjj6G_hw][inet[/192.168.1 28.146:9300]]{client=true, data=false}]

Hot thread output: https://gist.github.com/ydnitin/7687098

spinscale · 2013-11-28T08:55:14Z

Hey there,

checking out your hot threads output, the last third of the output is the important part, as you can clearly see, that a lot of queries are eating up your CPU.

Do you run some complex queries including fuzzy support. Is your system acting different if you do not query it at all?

And last question to get a better overview: What is your setup? number of nodes, etc? Looks like the cluster state might not be up to date on that node, where the client is trying to rejoin (due to load).. but this is just a wild assumption, which is not backed by facts.

s1monw · 2013-11-28T08:57:12Z

I agree with @spinscale this seems very much like you shooting some queries against elasticsearch and they cause load? Can you identify if that is abnormal?

ydnitin · 2013-11-28T20:17:44Z

I suspected the same so I dropped all traffic to ES and the load came down right away and then enabled the traffic again load jumped up. My users query ES via kibana interface. I am not familiar with fuzzy query concept. ES is totally new for me.
Load came down after I posted this last night. I guess user closed his kibana session?
Is there a tool to look at the running queries, like pgmonitor?
Is there a simplified query syntax documentation out there? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html this one contains too much information for my users.

I am running just one node, two shards.
Server: Physical
RAM: 24G
HeapSize set: 8G
CPU: 8

I am to add another node to ES. Any suggestions as to how I should profile my ES setup to order new hardware?

spinscale · 2013-11-29T08:13:38Z

Hey,

there is a slow search log mechanism in elasticsearch. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-slowlog.html#search-slow-log

Furthermore you should talk to the kibana users and see what kind of queries they execute (almost every kibana widget can reveal the query being executed). Also make sure, that they are not reloading their data in kibana too often, maybe that already helps a bit (without getting to the cause of the problem, maybe it is just one query, you would have to try out the queries one by one).

There is no possibility to give any hardware recommendations without knowing data, configuration, mapping, documents, query load, index load etc...

Adding more nodes and replicated the same data to them most likely will allow to at least spread the load.

s3k · 2014-02-12T12:55:04Z

I have same problem with high cpu usage.
(mb pro, osx, standard java 7, 2 core, 2.5Ghz, i5)

Here some tips:
On my local machine i set in config/elasticsearch.yml

index.number_of_shards: 1
index.number_of_replicas: 0

For 1 index with 185k docs my cpu load is 2.5-5% for ES java process

Also plugins makes HUGE performance reduce.
For me that is Marvel and analysis-morphology plugins. When i turned it off, that has given me additional cpu resources for 2 times, maybe 4.

I think that yml config is the solution.

P.S Also i have server (Ubuntu 12.04, openjdk 7, 2 xeon cores for 2.4Ghz and 2Gb ram) with default config, with plugins (marvel and analysis-morphology) which works fine for 1 500 000 docs on index.

davidgraca · 2014-07-11T13:17:44Z

Don't forget that if you installed Marvel to delete all indexes after you removed, since it doesn't do that.

gentunian · 2014-12-09T17:46:06Z

Same issue with marvel plugin. I kept getting out of heap space at a minute of starting (1GB heap). I removed the plugin and all indices and now cpu is fine, here a graph showing the improved cpu USER time:

bleskes · 2014-12-10T00:44:54Z

@gentunian did you have the errors while the marvel UI was open in a browser, or also without. Would be great to get the out put of http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads to see what the node is doing.

gentunian · 2014-12-10T01:10:41Z

@bleskes on both cases I ran out of heap space. I could see in elasticsearch.log every 2 or 3 seconds what I assume was GC calls (I can revise the log tomorrow at work if you want to).

curl -s -XGET 'localhost:9200/_cat/thread_pool?v' was all 0s for every column.

I had a lot of shards, maybe more than 500 (logstash defaults to 5 shards for every index and I have more than 100).

In the marvel stats, the marvel indices were on top of each column (index rate, search query, etc. Sorry if the names are wrong, I'm typing this from memory).

All this in QA. Tomorrow after finding out the CPU usage, I'm planning a production deployment.

bleskes · 2014-12-11T14:21:24Z

@gentunian I can think of a couple of things that may be going on. Maybe open a new issue with your details, as your primary concern is memory and not CPU?

ACV2 · 2015-05-19T18:15:28Z

I'm using elasticsearch 1.5

and it is working perfectly the most part of the time, but everyday at the same time it becomes crazy, CPU % goes to ~70% when the average is around 3-5% there are SUPER servers with 32GB reserved for lucene, swap it is lock and clearing the cache doesn't solve the problem (it doesn't take down the heap mem)

Settings:

3 servers (nodes) 32 cores and 128GB RAM each
2 buckets (indices) one with ~18 million documents (this one doesn't receive updates pretty often just indexing new docs) the other one have around 7-8 million documents but we are constantly bombarding it with updates search delete and indexing as well

The best distribution for our structure, was to have only 1 shard per node with not replicas, we can afford to have a % of the data off for few seconds, that will be back as soon as the server get online again, and this process is fast enough since it doesn't need to relocate anything. previously we used to have 3 shards with 1 replica, but the issue mentioned above occurs as well, so is easy to figure it out that the problem is not related with the distribution.

Things that I already tried,

Merging, i try to use the Optimize API trying to give less load to the schedule merge, but actually the merging process takes a lot of R/W of the disk but it doesn't affect substantially the mem or the CPU load.

Flushing, I tried to flush with long and shot intervals, and the results were the same nothing

changed, since flushing affects directly the merging process and as mentioned above, merging process doesn't takes that much of the CPU or mem usage.

managing the cache, clearing it manually but it doesn't seems to take the cpu load to normal state not even for a moment.

Here is the most of the elasticsearch.yml configs

Force all memory to be locked, forcing the JVM to never swap

bootstrap.mlockall: true

Threadpool Settings

Search pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 200

Bulk pool

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 3000

Index pool

threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

Cache Sizes

indices.fielddata.cache.size: 30%
#indices.fielddata.cache.expire: 6h #will be depreciated & Dev recomend not to use it
indices.cache.filter.size: 30%
#indices.cache.filter.expire: 6h #will be depreciated & Dev recomend not to use it

Indexing Settings for Writes

index.refresh_interval: 30s
#index.translog.flush_threshold_ops: 50000
#index.translog.flush_threshold_size: 1024mb
index.translog.flush_threshold_period: 5m
index.merge.scheduler.max_thread_count: 1

here is the stats when the server is in a normal state:
node_stats_normal.txt

Node stats during the problem.
node_stats.txt

I will appreciate any help or discussion that can point me in the right direction to get rid of this behavior

thanks in advance..

Regards,

Daniel

clintongormley · 2015-05-25T12:34:02Z

Hi @ACV2

This issue is closed. I suggest you follow the advice given above (ie check the hot threads output, enable the slow query log), and see what happens at the time of day when your cluster goes wild. If the answer isn't immediately obvious, I'd suggest asking about it in the forum: discuss.elastic.co

travisbell · 2015-06-07T20:56:50Z

@AVC2 Did you learn anything since your post about this? We're seeing more or less, the same problem. Sometimes daily, sometimes only one a week (it's that random) one of our ES hosts decides to go crazy. We usually restart the instance when we see this because it affects our end users response times but it's starting to be a problem.

In our situation, these hosts aren't idle, they're trucking along with 25-30% CPU and then BAM, 70-100%. Sometimes they correct themselves after 24-48 hours, other times they don't.

I was just curious is you learned anything that could help point us in the right direction. I'll see about checking the hot threads output and slowlogs.

ACV2 · 2015-06-12T11:08:53Z

Hey there, @travis. Actually it was a journey... Elastic Search is everything but human friendly, after this post where I actually checked every single configuration trying to optimize the performance to avoid this behavior, with not success... I reduce the query time from 12 sec to 2-3 in a period of 10-15 minutes still huge and unacceptable, however we are working to optimize our back end to stabilize the performance, that being said.. My findings:

-merge process
-flush process
-background process (script) bombarding the server with queries slowing it down

Can you post your server configurations so I can help you a little bit more :)

Regards
Daniel

On 7/6/2015, at 16:27, Travis Bell [email protected] wrote:

@AVC2 Did you learn anything since your post about this? We're seeing more or less, the same problem. Sometimes daily, sometimes only one a week (it's that random) one of our ES hosts decides to go crazy. We usually restart the instance when we see this because it affects our end users response times but it's starting to be a problem.

I was just curious is you learned anything that could help point us in the right direction. I'll see about checking the hot threads output and slowlogs.

—
Reply to this email directly or view it on GitHub.

travisbell · 2015-06-24T15:32:22Z

Hi @ACV2

Thanks for the reply. We noticed this was usually tied to when we ran some big re-indexes, when the indexes were merging lots. During the re-indexes, we still have a full request load.

At the end of the re-index I added a task to run an optimize the index and low and behold, we haven't seen this problem crop up in the past ~12 days. Forcing an optimize to fix the runaway CPU problem doesn't feel like an actual fix but it does alleviate the issue for us. ES seems to get stuck in some kind of a weird state after these re-indexes and the optimize seems to kick it out of that state.

suminda123 · 2015-09-08T05:16:54Z

I had a same issue, high cpu and high memory, for me it was marvel creating a new big index everyday.
so I deleted all marvel indices and ran _flush, it cleared my memory and cpu came down fro 90% to 4%.

travisbell · 2015-09-08T16:54:44Z

Yes, the daily optimize reduced our CPU in a similar amount (~60% ➞ ~10%). We still run our daily optimize task and haven't seen this issue since enabling it at the beginning of June.

Alino · 2016-06-02T09:14:59Z

have someone tried to upgrade to ES 2.+ ? does it fix this cpu issue?

travisbell · 2016-06-02T14:28:49Z

Hi @Alino, for us the problem was actually the fact we were using G1GC. As soon as we switched to CMS the problem went away (our ES cluster has been up for 51 days now!) What I mentioned in my last post ^ was still true, it greatly reduced the times this happened but the runaway CPU problems continued to happen every ~7 days or so.

There's a long and interesting thread of some other users and I discussing this on the Elastic discussion board here: https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229

speedplane · 2016-06-23T18:09:28Z

@travisbell Hi Travis, I've been seeing very similar patters. Everything runs perfectly for a full week, then all of a sudden we get really high CPU usage, and elsaticsearch is unresponsive. Most operations fail. Can you expand a bit on what you did to fix this issue? How do you switch from G1GC to CMS?

I have a feeling that this is caused by a node running out of java heap space. But I would imagine that would lead to a crash, not 100% CPU and I don't see OutOfMemory errors in the logs, so I'm not sure.

speedplane · 2016-06-23T18:12:24Z

I see the following in the logs, right before I get 100% CPU, which is why I suspect it's an out of memory issue.

[2016-06-23 17:44:23,621][WARN ][monitor.jvm] 
[elasticsearch1-elasticsearch-5-vm] [gc][old][218473][270] 
duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.7m], 
memory [25.2gb]->[25.2gb]/[25.9gb], 
all_pools {[young] [196.5mb]->[141.6mb]/[532.5mb]}{[survivor] 
[66.5mb]->[0b]/[66.5mb]}{[old] [25gb]->[25gb]/[25.3gb]}

(formatted for easier reading)

jasontedor · 2016-06-23T18:25:52Z

Not quite, but very close.

[...] duration [1.2m] [...] memory [25.2gb]->[25.2gb]/[25.9gb] [...]

This indicates a completely ineffective garbage collection cycle lasting for over a minute. The JVM is consuming CPU trying to collect when collecting is ineffective. You'll likely hit GC overhead limits soon at which point you will see OutOfMemoryErrors.

It's best to open a new post on the Elastic Discourse forum.

speedplane · 2016-06-23T19:12:12Z

@jasontedor I've started a discussion here. Any help on the matter would be appreciated.

neumachen · 2016-11-03T15:12:02Z

@ACV2 what tool are you using for the visual monitoring data?

ianchan0817 · 2018-05-02T10:36:17Z

@ACV2 this issue also happen to me when search *../../../../../../../../../../../../../../../../.*
it never timeout

indices:data/read/search - transport 1525253037206 02:23:57 13.9m x.x.x.x U5w91H7 indices[products], types[product], search_type[QUERY_THEN_FETCH], source[{"from":0,"size":24,"query":{"bool":{"must":[{"match":{"owner_id":{"query":"","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"status":{"query":"active","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}}],"must_not":[{"match":{"status":{"query":"removed","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"type":{"query":"addon","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"type":{"query":"gift","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}}],"should":[{"regexp":{"title_translations.zh-hant":{"value":".*../../../../../../../../../../../../../../../../.*","flags_value":65535,"max_determinized_states":10000,"boost":10.0}}},{"regexp":{"title_translations.en":{"value":".*../../../../../../../../../../../../../../../../.*","flags_value":65535,"max_determinized_states":10000,"boost":10.0}}},{"match":{"sku":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":100.0}}},{"match":{"title_translations.zh-hant":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":50.0}}},{"match":{"title_translations.en":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":50.0}}},{"match":{"categories.name_translations.en":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":20.0}}},{"match":{"categories.name_translations.zh-hant":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":20.0}}},{"match":{"variations.sku":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":100.0}}}],"disable_coord":false,"adjust_pure_negative":true,"minimum_should_match":"1","boost":1.0}},"sort":[{"_score":{"order":"desc"}},{"created_at":{"order":"desc"}}],"ext":{}}]

Btw, I am using ElasticSearch Service in AWS
No configuration there

flat010 · 2018-05-25T08:03:29Z

i also meet the problem,and the _nodes/hot_threads command return as flowing:
::: {dmsnode-192}{q8IITWyORxO8a0zZ5rTIXA}{192.168.90.192}{192.168.90.192:9300}{rack=r1}
Hot threads at 2018-05-25T07:54:39.738Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

62.4% (312ms out of 500ms) cpu usage by thread 'elasticsearch[dmsnode-192][transport_client_worker][T#15]{New I/O worker #15}'
2/10 snapshots sharing following 10 elements
org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

ghost assigned uboness and spinscale Nov 28, 2013

spinscale closed this as completed Feb 22, 2014

v5tech mentioned this issue Aug 26, 2021

ES性能优化 v5tech/ELK#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES java very high CPU usage #4288

ES java very high CPU usage #4288

ydnitin commented Nov 28, 2013

spinscale commented Nov 28, 2013

s1monw commented Nov 28, 2013

ydnitin commented Nov 28, 2013

spinscale commented Nov 29, 2013

s3k commented Feb 12, 2014

davidgraca commented Jul 11, 2014

gentunian commented Dec 9, 2014

bleskes commented Dec 10, 2014

gentunian commented Dec 10, 2014

bleskes commented Dec 11, 2014

ACV2 commented May 19, 2015

clintongormley commented May 25, 2015

travisbell commented Jun 7, 2015

ACV2 commented Jun 12, 2015

travisbell commented Jun 24, 2015

suminda123 commented Sep 8, 2015

travisbell commented Sep 8, 2015

Alino commented Jun 2, 2016

travisbell commented Jun 2, 2016 •

edited

Loading

speedplane commented Jun 23, 2016

speedplane commented Jun 23, 2016 •

edited

Loading

jasontedor commented Jun 23, 2016 •

edited

Loading

speedplane commented Jun 23, 2016

neumachen commented Nov 3, 2016

ianchan0817 commented May 2, 2018 •

edited

Loading

flat010 commented May 25, 2018

ES java very high CPU usage #4288

ES java very high CPU usage #4288

Comments

ydnitin commented Nov 28, 2013

spinscale commented Nov 28, 2013

s1monw commented Nov 28, 2013

ydnitin commented Nov 28, 2013

spinscale commented Nov 29, 2013

s3k commented Feb 12, 2014

davidgraca commented Jul 11, 2014

gentunian commented Dec 9, 2014

bleskes commented Dec 10, 2014

gentunian commented Dec 10, 2014

bleskes commented Dec 11, 2014

ACV2 commented May 19, 2015

Force all memory to be locked, forcing the JVM to never swap

Threadpool Settings

Search pool

Bulk pool

Index pool

Indices settings

Cache Sizes

Indexing Settings for Writes

clintongormley commented May 25, 2015

travisbell commented Jun 7, 2015

ACV2 commented Jun 12, 2015

travisbell commented Jun 24, 2015

suminda123 commented Sep 8, 2015

travisbell commented Sep 8, 2015

Alino commented Jun 2, 2016

travisbell commented Jun 2, 2016 • edited Loading

speedplane commented Jun 23, 2016

speedplane commented Jun 23, 2016 • edited Loading

jasontedor commented Jun 23, 2016 • edited Loading

speedplane commented Jun 23, 2016

neumachen commented Nov 3, 2016

ianchan0817 commented May 2, 2018 • edited Loading

flat010 commented May 25, 2018

travisbell commented Jun 2, 2016 •

edited

Loading

speedplane commented Jun 23, 2016 •

edited

Loading

jasontedor commented Jun 23, 2016 •

edited

Loading

ianchan0817 commented May 2, 2018 •

edited

Loading