-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore after a node failover #273
Comments
Hi @dr-zeta, Sorry for the delay. I don't know how I can miss that issue 😞. The backup mechanism first make backup and if it has been successfully than update the Cheers |
The backup mechanism will be extended with the third command which will return the latest backup number. |
I can at least confirm we have seen this issue multiple times as well. Until now I didn't report the issue as I wasn't sure wether this is tied to our setup using custom backup/restore scripts (just slightly adopted from shipped versions). Trying to nail things down we have added some debugging into our backups and are noticing a behaviour that really puzzles us. The same backup-number is sometimes reused. Unfortunatly there is no obivous pattern to this happening. Could you maybe check wether this is happening for you as well?
Sample output on our side
As we can see 6510 and 6515 have been used twice. Is this just happening on our side? |
Well, actually there is some kind of pattern involved. We have a backup-schedule of 900 seconds. Whenever we see second occurance of the same backup number it is triggered "of schedule", that is exactly after our backup has finished. See timestamps in our debug.log, whenever the backup is triggered with same number a second time it is "of schedule" rigth after the backup-job has finished.
My guess is this issue is related to the restore issue, sry if this is wrong. |
@pniederlag Thanks for the detailed logs. I'm surprised that the backup number is used twice 😵. We have to narrow it and fix in the operator, but as a fast workaround, I can suggest skipping backup if the backup file already exists. We will implement it in the PVC backup for sure. |
"skipping backup in case of duplicated backup number". Well, been there, done that, made it even worse for us. I had an "exit 1" in the backup script in case of hitting an existing backup number. For some weird reasons in this scenario the backup number never incremented anymore. So we ended up running a couple of days without backups until we spotted the problem. First of all it would be important to verify that the issue of duplicated backup numbers is a general problem and not something that is specific on our side. @tomaszsek can you verify this oddity on the backup number? I just ran into the issue again, that the author of this issue reported. What I can see:
quite strange to me. How is backup handled in case it is ongoing/running at the time the pod is recreated? |
If you will exit with 0 it will work, but backup shouldn't be triggered twice with the same number.
If the
I will try to reproduce the issue. |
I confirm that backup is called twice with the same number on termination pod when
The PVC backup provider works because the backup gets overridden. |
Thx, as our problem is not only occuring in relation to pod deletion but also during regular operations my collegue now has put our issue on a separate ticket (#396). Thx for your efforts and work! |
Hi everybody.
I found a restore issue after a failover test on a worker node that hosts the operator pods.
The operator cannot restore the last backup because it doesn't exist.
This is the log error:
In this case you have to manually insert the recoveryOnce value to 3633 (last minus 1) to recover the last good backup.
I think that you increase the last backup version before that you have correctly store it.
Is it possible?
Thanks
The text was updated successfully, but these errors were encountered: