-
Notifications
You must be signed in to change notification settings - Fork 740
backup-operator: Support periodically backup (#1841) #2028
Conversation
This commit added BackupIntervalInSecond in BackupPolicy, which perform periodic backup as coreos#1841 issue describe. This commit is part of coreos#1841. By specifying BackupIntervalInSecond, user can let etcd-backup-operator do periodic backup. The specification of BackupIntervalInSecond is following - unit in sec - >0 means interval. - =0 means explicit disable interval backup, it will just do one time backup.
This commit implement validation of BackupIntervalInSecond. After this commit, backup-operator will make sure BackupIntervalInSecond follow following restrictions - <0 is not allowed, failed validation - 0 is valid and disable periodic backup - >0 is valid and means interval
Current backup status is only designed for one-shot snapshot. Always it show lastest results but it would be nice if we could record the last time to successfully take a snapshot.
This commit added MaxBackups attributs which let backup-operator delete older snapshots if the number of snapshots exceeded than MaxBackups. Specification of MaxBackups is following - <0 is not allowed, which cause validation failure - =0 is to indicate MaxBackups is infinite, which let not operator delete any exisiting snapshots - >0 is to indicate the max number of snapshot
After support periodic backup, backup-operator added revision number and date to s3 path as following <bucket name>/<object name>_v<rev>_<date>. This behaviour has been applied even if backup is one-shot backup, therfore this change broke exisiting behaviour. This commit brough back original behaviour which use s3 path without adding anything <bucket name>/<object name>, if backup is not periodic
Can one of the admins verify this patch? |
2 similar comments
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
@etcd-bot ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ukinau thanks for the PR few nits, also can you update the codegen files that should fix your failed unit test.
./hack/k8s/codegen/update-generated.sh
@etcd-bot retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more nits and we are having multiple failures on CI if you could look into that I would appreciate it.
@ukinau good progress, thanks! We need to get to the bottom of this CI failure can you take a look please?
|
06fc1a9
to
f84a3d5
Compare
After I generated the code based on k8s object (zz_generated.deepcopy.go), we happened to be in failing to build. This is because all k8s custom resource's fileds should implement DeepCopyInto but time.Time we added doesn't implement it. For this purpose we should have used meta/v1.Time which is the implementation to implement all necessary function for k8s object and same function of time.Time. And also this commit include some refactoring which is pointed out in code-review
f84a3d5
to
d1ae2c4
Compare
@hexfusion |
@hasbro17 PTAL |
@hexfusion Is there anything I can do to make this PR merged faster? |
@ukinau thank you for your patience I will get a final review in hopefully by the end of the week. |
f0ecf00
to
ab9b78e
Compare
@hexfusion @hasbro17
but CI for e2eslow test seems having something other trouble which is not related to this change and seems related to environment issue. Coud you check it out? There seem be something untracked files preventing from performing e2e test
|
@ukinau thanks for the update and your patience I will take a hard look at this soon with the hopes of getting it merged before the end of the week. |
@etcd-bot retest all |
@etcd-bot retest this please |
@ukinau This seems good to me. Apologies for the delay but we've been having some issues with our jenkins instances. Once we resolve that we should be good to run the CI and merge. @hexfusion We can sync up offline but it's an older issue that we've seen before: our instance is out of memory. |
@hasbro17 I should be able to resolve Jenkins today. |
@etcd-bot retest this please |
@@ -78,6 +83,7 @@ func TestBackupAndRestore(t *testing.T) { | |||
s3Path := path.Join(os.Getenv("TEST_S3_BUCKET"), "jenkins", suffix, time.Now().Format(time.RFC3339), "etcd.backup") | |||
|
|||
testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) | |||
testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ukinau I think I know why the test is failing. Previously we had the two subtests:
testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path)
The first test creates the backup file at s3Path
and the second expects the backup file to be present in order to restore a cluster.
With the periodic backup sub test now in between the two we are now deleting all the backup files in a defer statement. So testEtcdRestoreOperatorForS3Source
will fail because the previous test deleted the backup file.
func testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) {
...
// create etcdbackup resource
eb, err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Create(backupCR)
if err != nil {
t.Fatalf("failed to create etcd back cr: %v", err)
}
defer func() {
if err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Delete(eb.Name, nil); err != nil {
t.Fatalf("failed to delete etcd backup cr: %v", err)
}
// cleanup backup files
allBackups, err = wr.List(context.Background(), backupS3Source.Path)
if err != nil {
t.Fatalf("failed to list backup files: %v", err)
}
if err := e2eutil.DeleteBackupFiles(wr, allBackups); err != nil {
t.Fatalf("failed to cleanup backup files: %v", err)
}
}()
...
}
Can you please move testEtcdBackupOperatorForPeriodicS3Backup
at the end so it doesn't interfere with the restore test.
// Backup then restore tests
testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path)
// Periodic backup test
testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
@ukinau One more thing I noticed when trying out your PR is that for some reason the periodic runner isn't deleted when we delete the EtcdBackup CR.
Can you please try this out and confirm? |
@ukinau Just double checking if you still have time to work on the changes. If not I can probably take over this PR. |
@hasbro17 Sorry for late response. I will spend more time on this weekend.
|
restore test expected backup file to be present but periodic backup test actually cleanup backup file to be created so we failed to perform restore test because of that. that's why we moved periodic test after restore test
@etcd-bot retest this please |
It seems now everything is fine. |
var err error | ||
for { | ||
for i := 1; i < 6; i++ { | ||
latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{}) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ukinau Sorry for the delay. Yes think this should fix the infinite loop.
One more thing that I would improve on this is that we should not retry 5 times if the EtcdBackup CR is deleted or not found because that's when we want to stop the periodic backup. We only want to retry if it's some transient error.
for i := 1; i < 6; i++ {
latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{})
if err != nil {
// Stop backup if CR not found
if apierrors.IsNotFound(err) {
b.logger.Infof("Could not find EtcdBackup. Stopping periodic backup for EtcdBackup CR %v",
eb.Name)
break
}
b.logger.Warningf("[Attempt: %d/5] Failed to get latest EtcdBackup %v : (%v)",
i, eb.Name, err)
time.Sleep(1)
continue
}
break
}
if err == nil {
// Perform backup
bs, err = b.handleBackup(&ctx, &latestEb.Spec, true)
}
But that's just a minor change and since this PR has been out for a while I can make the above change as a follow up if you won't get the time for it anytime soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. I'm gonna fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the retry improvement.
LGTM We'll need to follow this up to add a doc to walkthrough the example of a periodic backup. Maybe an additional section in https://github.com/coreos/etcd-operator/blob/master/doc/user/walkthrough/backup-operator.md |
I am finding this does not always clean up, I have Max backups 60, it got to 60 and now has stopped backing up.. Even though the status of the CR says succeeded, there are no new files.
no errors or anything either;
|
I found my issue; #2099 |
This PR implemented periodically backup which is discussed in #1841 .
The specification is mean to be almost same as what #1841 concluded.
After merged this PR, the user can enable periodically backup as following
As you can see above, 2 new fields are added
maxBackups
, backup-operator delete oldest backup file which have same prefixclose #1841