Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize etcd readiness probe #151

Merged

Conversation

shreyas-s-rao
Copy link
Collaborator

What this PR does / why we need it:
This PR optimizes readiness probe on etcd by marking the health endpoint as ready before taking the first full snapshot rather than after, as was being done previously. This reduces etcd downtime during cluster updates.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Release note:

Reduced etcd downtime by optimizing readiness probe.

@shreyas-s-rao shreyas-s-rao added kind/enhancement Enhancement, improvement, extension needs/review Needs review component/etcd-backup-restore ETCD Backup & Restore platform/all needs/lgtm Needs approval for merging needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) area/high-availability High availability related labels Apr 26, 2019
@shreyas-s-rao shreyas-s-rao added this to the 0.7.0 milestone Apr 26, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 26, 2019
@shreyas-s-rao shreyas-s-rao force-pushed the optimize-readiness-probe branch from 39d120c to 1fe1e87 Compare April 29, 2019 09:45
@shreyas-s-rao shreyas-s-rao added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019
@shreyas-s-rao shreyas-s-rao added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 29, 2019
Copy link
Contributor

@georgekuruvillak georgekuruvillak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@swapnilgm
Copy link
Contributor

As i was on leave, i wasn't part of this thread. So, I might be missing discussion behind this. Here is my opinion on this.

Are we trading the correctness here with performance? The initial idea behind the making etcd unready (or internally backup-restore sidecar health false) until first snapshot was, basically if we face some issue with snapshot upload the sidecar will fail to upload. It will go in multiple restarts without any snapshot uploads. At the same time with current PR since we are allowing etcd to be ready to serve, it will continue accepting new user data. So, till the time somebody actually resolves (may be manual intervention or gardener reconciliation required cause, most probable reason for failing to upload snapshot is wrong credentials) the issue with backup-restore-sidecar, etcd revision might progressed a lot ahead of last successful snapshot. So, directly making etcd ready doesn't sound's good to me.

Said that, I agree we have to and we could optimise on this at some level by playing it little more intelligently. Like

  • if restoration happened then no need wait until first full snapshot and mark backup-restore healthy and make etcd ready.
  • if there isn't any restoration but it restarted because of failed to upload last snapshot, pod reschedule or version update, we can check the difference between current etcd revision and last snapshot upload revision and if this is beyond some threshold only then take full snapshot before marking sidecar healthy or making etcd ready. The threshold here can be configured based on the delta snapshot revision change observation.

P.S.: Sorry for commenting on merged PR. Don't want to loose context here.

@amshuman-kr
Copy link
Collaborator

@swapnilgm Here the readiness is set only after the sanity check is over and etcd is up. Sanity check should validate connectivity to the blobstore. So, the likelihood of the following full backup upload failing is low.

Besides, if the full back upload actually fails then the next readinessProbe should fail, cutting off any further inbound traffic. I see the risk of data loss here as no more than a full snapshot upload failing at any other point in time (other than during a restart). Do you see otherwise?

Also, in #150 @shreyas-s-rao has proposed an approach to try a delta snapshot if possible instead of a full snapshot. That should, in theory, minimize the time for a feedback about if the backup could be uploaded. WDYT?

@swapnilgm
Copy link
Contributor

@swapnilgm Here the readiness is set only after the sanity check is over and etcd is up. Sanity check should validate connectivity to the blobstore. So, the likelihood of the following full backup upload failing is low.

Agreed. But there could be network issue or context timeouts happening because of large full snapshot uploads which won't be grabbed in sanity check.

Besides, if the full back upload actually fails then the next readinessProbe should fail, cutting off any further inbound traffic. This will be very rare but backup-restore is supposed to handle rare cases.

Next readiness probe will fail after readiness probe timeout. for single failure it looks minimal window, but over consecutive failure of initial full snapshot upload readiness probe will pass and fail into multiple time window. Cumulatively readiness probe passing time window, will result in etcd availability for long time.

I see the risk of data loss here as no more than a full snapshot upload failing at any other point in time (other than during a restart). Do you see otherwise?

No, data loss because of failure of first full snapshot and later full snapshot is different. Consecutive failure of initial full snapshots will result in etcd revision progressing a lot ahead of last successful snapshot because etcd being ready for long time (not necessarily continuous because of readiness probe but because of cumulative time window explained above). On other hand, failure full snapshot/delta snapshot at other point in time will be not result in data loss at all cause that case is taken care by or supposed to be taken care by initial full snapshot approach.

Also, in #150 @shreyas-s-rao has proposed an approach to try a delta snapshot if possible instead of a full snapshot. That should, in theory, minimise the time for a feedback about if the backup could be uploaded. WDYT?

I agree with proposal of starting watch and delta snapshots. What i didn't get is, why aren't we implementing it? I don't think implementing it involves large efforts/timeline.

@amshuman-kr
Copy link
Collaborator

Next readiness probe will fail after readiness probe timeout. for single failure it looks minimal window, but over consecutive failure of initial full snapshot upload readiness probe will pass and fail into multiple time window. Cumulatively readiness probe passing time window, will result in etcd availability for long time.

Won't the same thing happen if the full snapshot failed during any other time too? Am I missing something here?

I agree with proposal of starting watch and delta snapshots. What i didn't get is, why aren't we implementing it? I don't think implementing it involves large efforts/timeline.

We want to implement it on priority. @shreyas-s-rao wanted to do it already. It is just that we wanted to see if we can do the storage class migration already this week in the dev landscape. It has been pending for 3 weeks now. To me it looked like the delta snapshot on restart looked like large enough effort to postpone it to next week after the migration. Testing effort, not development effort. We do not have automated test cases to cover this scenario yet.

Do you see a risk in going ahead with migration without this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/high-availability High availability related component/etcd-backup-restore ETCD Backup & Restore kind/enhancement Enhancement, improvement, extension needs/lgtm Needs approval for merging needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review platform/all
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants