Example Prometheus Rule to monitor Velero seems bad #562

savar · 2024-04-04T11:35:32Z

This is a copy from vmware-tanzu/velero#7132 therefore please also check the conversation there for further input.

We saw in our different clusters different reporting even though Velero was affected the same in all three clusters.

Checking again on the implemented PrometheusRule it is as described in vmware-tanzu/velero#2725 but is this the right choice? The metric velero_backup_attempt_total, as long as no restart happened, is only growing. So using a ratio of failed attempts to an ever growing total should almost make you blind for failures after a long time of "all is good".

Assuming you run 20 backups per day for a specific schedule and this works fine for a year and your Pod isn't restarting in that time (and that is possible), you would have 7300 successful attempts. If you would do the example query and check for a failure rate of more than 25% you would need 2434 failed attemps before you hit that mark or ~122 days before you even realize that your backups aren't working anymore.

I am not sure what the best approach would be, but it might be either using an increase() over a shorter time instead of "the whole time the pod is running" or using velero_backup_last_successful_timestamp or other things instead.

My bug here is mainly: should we change the example (even though it is just a comment) to avoid people simply copy and pasting this and using it as a way to monitor velero backups?

jenting added the velero label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Prometheus Rule to monitor Velero seems bad #562

Example Prometheus Rule to monitor Velero seems bad #562

savar commented Apr 4, 2024

Example Prometheus Rule to monitor Velero seems bad #562

Example Prometheus Rule to monitor Velero seems bad #562

Comments

savar commented Apr 4, 2024