-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd 3.2.1 Troubling CPU Usage Pattern #8491
Comments
@iherbmatt , there is a report about odd memory usage pattern #8472 do you see similar behavior on your setup? |
From startup to a few days later most RAM is used. Some is moved to cached
memory while others are taken by other processed - mostly Docker and etcd.
…On Sep 4, 2017 12:08 PM, "Maxim Ivanov" ***@***.***> wrote:
@iherbmatt <https://github.com/iherbmatt> , there is a report about odd
memory usage pattern #8472 <#8472>
do you see similar behavior on your setup?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8491 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWH4rpn-WHQHEkL-OgeI-XsI9_dsWuo9ks5sfEq2gaJpZM4PLUQb>
.
--
*The information contained in this message is the sole and exclusive
property of **iHerb Inc.** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of **iHerb Inc.* *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
|
@iherbmatt what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs? Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads |
Hi Anthony,
I'm not extremely familiar with etcd. How can I get this information for
you? Also, I'm wondering if it's the automatic snapshots that are causing
the issue. I'm testing another cluster with automatic snapshots and
automatic recovery disabled. For about 2 hours I'm seeing the CPUs for each
of 3 etcd nodes hovering around 1% - previously they were about 5% (v.
3.2.1) and ~20% (v. 3.1.3).
Thanks,
Matt
*Matt Poland | Software Developer*
*iHerb Inc - Natural Products & More*
*www.iherb.com <http://www.iherb.com> | [email protected] <[email protected]>*
…On Tue, Sep 5, 2017 at 12:22 AM, Anthony Romano ***@***.***> wrote:
@iherbmatt <https://github.com/iherbmatt> what is the RPC rate over time
for 3.2? What is the memory utilization over time? Are there any errors or
warnings in the etcd server logs?
Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable
reads
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8491 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWH4rvc2RTjaVAO4MlVIz1Q0NWVSfexIks5sfPaegaJpZM4PLUQb>
.
--
*The information contained in this message is the sole and exclusive
property of **iHerb Inc.** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of **iHerb Inc.* *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
|
It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-etcd#L231 This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009, where the etcd backend needs to be defragmented when there are frequent snapshots. |
When I disable automatic snapshots and disaster recovery the cpu remains
around 1-1.5%. It's obvious there's a bug in that logic somewhere.
Thank you!
*Matt Poland | Software Developer*
*iHerb Inc - Natural Products & More*
*www.iherb.com <http://www.iherb.com> | [email protected] <[email protected]>*
…On Tue, Sep 5, 2017 at 1:58 AM, Anthony Romano ***@***.***> wrote:
It looks like kube-aws is taking snapshots every minute on every member
according to https://github.com/kubernetes-incubator/kube-aws/blob/
master/core/controlplane/config/templates/cloud-config-etcd#L231
This is about 90x more frequent than the etcd-operator default policy and
might account for the increased CPU load. It could be triggering #8009
<#8009>, where the etcd backend
needs to be defragmented when there are frequent snapshots.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8491 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWH4rgTpMKuWanQrHJZXBYC5N2giKAG6ks5sfQ1MgaJpZM4PLUQb>
.
--
*The information contained in this message is the sole and exclusive
property of **iHerb Inc.** and may be privileged and confidential. It may
not be disseminated or distributed to persons or entities other than the
ones intended without the written authority of **iHerb Inc.* *If you have
received this e-mail in error or are not the intended recipient, you may
not use, copy, disseminate or distribute it. Do not open any attachments.
Please delete it immediately from your system and notify the sender
promptly by e-mail that you have done so.*
|
Based on input from @heyitsanthony, I updated |
thanks for the update. Closing this issue. |
Hello,
We have recently upgraded to kube-aws 0.9.8 and are utilizing etcd 3.2.1 and have tested etcd 3.2.6, both versions have been installed with a 3-node etcd cluster with nodes having 2 cores and 8GB of RAM.
What's troubling is that we are only running a single application on the cluster and it's using more and more CPU as time goes by. Here is a sample showing the last week from the date of cluster start-up to now:
As you can see the CPU has not fluctuated - it has only increased steadily over the last few days. This is troubling because we have older clusters running etcd 3.1.3 and they are increasing faster. We figured we would test with a cluster using etcd 3.2.1 to see if that would fix the problem, but it doesn't - it just postponed the inevitable: an unstable cluster.
In order to fix the problem we need to terminate the nodes and let them rebuild and resync with the other members, or reboot them.
We created the K8s cluster with the following etcd configs:
3 etcd nodes
m4.large (2 Cores, 8GB RAM)
50GB root volume [general ssd (gp2)]
200GB data volume [general ssd (gp2)]
auto-recovery: true
auto-snapshot: true
encrypted volumes: true
Please somebody help us with this.
Thank you,
Matt
The text was updated successfully, but these errors were encountered: