Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error handling for initial delta snapshot #165

Merged

Conversation

shreyas-s-rao
Copy link
Collaborator

Signed-off-by: Shreyas Rao [email protected]

What this PR does / why we need it:
This PR fixes the error handling while taking initial delta snapshot. Currently, if the initial delta snapshot fails, we retry the etcd probe and try taking the initial delta snapshot again, repeatedly until it succeeds. In the case where backup sidecar had already taken some snapshots and then stopped for some time while etcd is still running and is also compacted in that interval, when sidecar starts again, there's a chance the watch for initial delta snapshot will fail because the latest revision from the snapstore is not available anymore on etcd (as it was compacted). This throws backup sidecar into an infinite failure loop.
This PR fixes this behavior by taking a full snapshot on an initial delta snapshot error, instead of retrying delta snapshot again. Error handling for the full snapshot is as expected, ie, it will be retried after subsequent etcd probe.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

In the case that initial delta snapshot fails, a full snapshot is tried instead.

@shreyas-s-rao shreyas-s-rao added needs/review Needs review component/etcd-backup-restore ETCD Backup & Restore platform/all needs/lgtm Needs approval for merging area/backup Backup related needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 7, 2019
@shreyas-s-rao shreyas-s-rao added this to the 0.7.0 milestone Jun 7, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 7, 2019
Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we call TakeFullSnapshot followed by a TakeFullSnapshotAndResetTimer if the first call returns error instead of a single call to TakeFullSnapshotAndResetTimer?

cmd/server.go Outdated Show resolved Hide resolved
cmd/server.go Outdated Show resolved Hide resolved
cmd/server.go Outdated Show resolved Hide resolved
cmd/server.go Outdated Show resolved Hide resolved
cmd/server.go Outdated Show resolved Hide resolved
@PadmaB PadmaB modified the milestones: 0.7.0, 1906a Jun 10, 2019
@shreyas-s-rao shreyas-s-rao force-pushed the fix/initial-delta-error-handling branch from bbed3e5 to 02e74a7 Compare June 10, 2019 10:49
@shreyas-s-rao
Copy link
Collaborator Author

@swapnilgm @amshuman-kr I have addressed the review comments. PTAL.

Copy link
Collaborator

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. LGTM

Copy link
Contributor

@swapnilgm swapnilgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank for addressing the changes. I have added one more small suggestion. Please address it.

cmd/server.go Outdated Show resolved Hide resolved
@shreyas-s-rao shreyas-s-rao force-pushed the fix/initial-delta-error-handling branch from 02e74a7 to 36bd479 Compare June 11, 2019 05:34
@shreyas-s-rao
Copy link
Collaborator Author

@swapnilgm I've addressed the review comment. PTAL.

Copy link
Contributor

@swapnilgm swapnilgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@swapnilgm swapnilgm added reviewed/lgtm Has approval for merging reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/lgtm Needs approval for merging needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 11, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 11, 2019
@swapnilgm swapnilgm merged commit 7d15184 into gardener:master Jun 11, 2019
@shreyas-s-rao shreyas-s-rao deleted the fix/initial-delta-error-handling branch June 11, 2019 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related component/etcd-backup-restore ETCD Backup & Restore needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review platform/all reviewed/lgtm Has approval for merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants