You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a copy from vmware-tanzu/velero#7132 therefore please also check the conversation there for further input.
We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.
Checking again on the implemented PrometheusRule it is as described in vmware-tanzu/velero#2725 but is this the right choice? The metric velero_backup_attempt_total, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".
Assuming you run 20 backups per day for a specific schedule and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.
I am not sure what the best approach would be, but it might be either using an increase() over a shorter time instead of "the whole time the pod is running" or using velero_backup_last_successful_timestamp or other things instead.
My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?
The text was updated successfully, but these errors were encountered:
This is a copy from vmware-tanzu/velero#7132 therefore please also check the conversation there for further input.
The text was updated successfully, but these errors were encountered: