-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD database space / quota exceeded, goes into maintenance mode #4005
Comments
So the apiserver issues a compaction every 5 minutes (IIRC). I don't understand exactly the cause, but it looks like an etcd bug. Related: kubernetes/kubernetes#45037 It sounds like an etcd bug, @lavalamp asked for a backport and was told "no", but the fix will be in etcd 3.3. |
Cheers, that's probably the cause of it then! In regards to apiserver doing the compaction every 5 minutes, shouldn't this mean that the other 4 nodes with disk space remaining should have remained operational? Or maybe we still needed to do the defrag to reclaim the free space on the members / clear the alarms that had triggered? |
If anyone runs into the above issue, you can attempt to follow the below very rough recovery steps that I took (tested on CoreOS). Run this on each of the members affected, which still have available space on the etcd volume:
For any members that have experienced the above mentioned bug, where volume is at 100% (not entirely sure whether steps 2-5 are necessary in all cases):
The above steps were modified slightly from following this guide: https://github.com/kubernetes/kops/blob/master/docs/single-to-multi-master.md#4---add-the-third-master |
v3.3.0 has officially been released. The following PR should correct issues with logging and pick up version changes for a rolling update: #4371 I'll be testing this out and will see how it goes! |
Tempted to close this issue now.. ETCD v3.3.0 appears to resolve this issue. I'm running a cluster on the newer version and haven't noticed any problems so far (including the PR referenced above). Just a note, with kops you'll need to define the new version as follows in your kops spec:
The @justinsb anything more you think we need to do here, or happy to close this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Sorry to resurrect this old issue. I just fell into it. When I did the Setting the flag Hope that I could save someone some time. |
Does KOPS support this |
You can specify which ENV vars to pass on to etcd: https://kops.sigs.k8s.io/cluster_spec/#etcdclusters |
Thanks @olemarkus |
Kops Version: kops v1.8.0-beta.2
Kubernetes Version: kubernetes v1.8.2
ETCD Version: v3.0.17 (TLS enabled)
Cloud Provider: AWS
Steps to recreate (will take time):
After some operation time, you may begin to see warnings such as below in the logs:
Check ETCD Status:
According to the ETCD Maintenance Docs the cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.
Recovery: History compaction needs to occur (and then possible defragmentation to release the free storage space for use) for it to be operational again, the steps for this are in the above docs link.
There are possible options we could supply to etcd via kops which will hopefully mitigate this issue and reduce manual user maintenance required (although I don't know much about etcd to be sure):
ETCD_QUOTA_BACKEND_BYTES
to be configurable, so a higher value can be set rather than the default of 0 (0 defaults to low space quota
)ETCD_AUTO_COMPACTION_RETENTION
to be configurable, so it can trigger automatically without user intervention.EDIT: 1 of the 5 nodes had etcd volume maxed out at 100%, due to a dodgy deployment. The other 4 were only 3% utilised as shown in the above log snippets.
Ping @gambol99 @justinsb @chrislovecnm
The text was updated successfully, but these errors were encountered: