-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Failed backups to not block incoming traffic and trigger high prio alert instead #147
Comments
@amshuman-kr You have mentioned internal references in the public. Please check. |
2 similar comments
@amshuman-kr You have mentioned internal references in the public. Please check. |
@amshuman-kr You have mentioned internal references in the public. Please check. |
As we have seen lately with the off-by-one (32/33 chunks) on GCP, doesn't it make sense to give this one a higher prio @amshuman-kr? |
Just wanted to mention one point: In multi-node etcd we have a plan to use Ephemeral volume, if we choose to go with Ephemeral volume then we might lose the data in worst case scenario as data wasn’t backed up. |
Wouldn't it make sense to consider the ephemeral volume use-case as an optimized option for the future in order to keep things simple for now? The multi-node project involves more urgent points and requirements like these will only delay a feature roll-out. |
fully agree. eph volumes needs I think a certain confidence in overall etcd cluster stability and probably a substantial amount of perf testing and tuning (i.e. what does it mean to now put 20 etcd's data on one single volume and thus share its IOPS). |
agreed |
Yes, I would also vote to focus on multi-node/clustered ETCD in the form that can be achieved "the best" (low complexity, low coupling). Ephemeral volumes, as @dguendisch pointed out, require having a solution first and gain trust next, before going there last. Already including that in the challenging task we have at hand, letting it pull in the backup question, raising thereby complexity/coupling even more (leader with failed backups losing leadership sounds like another level of complexity/coupling), sounds like a bit too much too early. |
@timuthy You have mentioned internal references in the public. Please check. |
We will follow up with #280 for the readiness probe. |
Feature (What you would like to be added):
Currently, the health check of the etcd pods is linked to the backup health (last backup upload succeeded) in addition to just etcd health. But as long as etcd data is backed by persistent volumes (it is now), we can afford for etcd to continue serve income requests even when backup upload fails as long as high priority alerts are triggered when backup upload fails and follow up is done to resolve the issue.
Motivation (Why is this needed?):
Avoid bringing down the whole shoot cluster control-plane when backup upload fails as that basically brings the cluster to a grinding halt. This might be affordable if etcd data is backed by persistent volumes because for data loss to occur a further data corruption in the persistent volumes is required (while backup upload is failing) to cause a data loss.
See also https://github.tools.sap/kubernetes-canary/issues-canary/issues/599
Approach/Hint to the implement solution (optional):
The following tasks might have to checked/evaluated.
Enhance the multi-node etcd proposal to address this new requirement.The text was updated successfully, but these errors were encountered: