Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis/Sentinel Update Ordering #89

Closed
rhefner1 opened this issue Sep 6, 2018 · 2 comments
Closed

Redis/Sentinel Update Ordering #89

rhefner1 opened this issue Sep 6, 2018 · 2 comments
Assignees

Comments

@rhefner1
Copy link
Contributor

rhefner1 commented Sep 6, 2018

Expected behaviour

Update of redis-operator that is close to zero-downtime.

Actual behaviour

When I updated my redis-operator to 0.5.1, it did the following:

  1. Made changes to both the sentinel deployment and Redis statefulset.
  2. Both begin a rolling update (and the +60 second changes from [DEVOPS-823] Improve update process #86 worked perfectly)
  3. In my case, redis-0 was the current master. When it was taken down, a failover wasn't possible since the sentinels had also just been restarted and were still discovering the current master. This caused about 2-3 minutes of downtime for the check/heal process to get things back in order.

If it were just my changes (like changing resources), I would just work around that and deploy them separately. But in this case, it was a new version of the operator that I don't have any control over. Anyone who did that update would experience 2-3 mins of downtime in every Redis HA cluster they have (and all at the same time).

I'm not sure what the right answer is here. Maybe do the sentinel update first and then wait for all of them to discover the current master and then proceed? It seems like it might be easier though to do the redis upgrade first and when that is completely rolled out, then do the sentinel upgrade. That way you don't have to figure out when all of the sentinels are ready. ¯\_(ツ)_/¯

cc: @jchanam

@jchanam jchanam self-assigned this Sep 11, 2018
@jchanam
Copy link
Collaborator

jchanam commented Sep 11, 2018

Hi @rhefner1,

You're right, I've upgraded the operator recently too and I had the same downtime and since them, I'm thinking about how to improve the whole process and try to get to a zero downtime.

I'll try to have an approach soon. Until then, if you want to propose something, I'll be happy to discuss any PR and improve this 👍

@ese
Copy link
Member

ese commented Dec 11, 2019

The update policy has been update in latest release. Check it out and feel free to reopen if the problem persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants