[BUG] Etcdbrctl restore throws FATAL - results in event revision 1 higher than expected snapshot revision verification failed for delta snapshot
#583
Labels
snapshot revision verification failed for delta snapshot
#583
Etcdbrctl
snapshot revision verification failed for delta snapshot
Describe the bug:
When doing a restore from a full backup and incremental backups, we encountered a failure mid-restore:
This is after restoring ~15 of ~30 incremental backups.
It actually occurred at different places across multiple attempts to restore:
Full backup schedule is:
*/30 * * * *
Incremental backup schedule:
60s
Storage retention policy is
Exponential
Expected behavior:
I'd either expect the restoration to complete with the same event revision, or if it's okay that the restored revision is a smidge higher, that it could info-log or warn-log and continue.
Ultimately, I'd expect the restoration to work from these incremental backups.
The main question is: is it okay for the revision in the etcd process we're using for the restore for the increment to be slightly higher? Is that actually a fatal error, or something expected when we've restored a 6.7 GB Full Backup, and ~15 or so 10 mb incremental snapshots?
If it's expected, can we info/warn and continue, or is this actually fatal?
How To Reproduce (as minimally and precisely as possible):
This is the first time I've seen this, but it's been on an Etcd cluster with 3 members, with ~750 namespaces and a typical Full snapshot size of 6 GB, and incremental snapshots between 10 MB and 20 MB.
After taking etcd offline, and deleting the member and WAL file directories and restoring, somewhere in the middle of those incremental backups will end up with a event revision that is 1 higher than expected.
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
Azure
Anything else we need to know?:
We are using etcdbrctl just as a backup agent most times with a sort of hand-rolled etcd systemd quorum. We've been using it successfully for several years, but wanted to improve our mean time to recovery with a snapshotting and restore tool.
I really wanted to use it in the Server mode, but that would have required completely refactoring or even rewriting our etcd bootstrap to harmonize with the Server so, it's a systemd unit that backs up, and when it has to restore, we then run the restore ourselves after our Prometheus alerts tell us the quorum is bogus.
The text was updated successfully, but these errors were encountered: