Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient readiness probe leads to data loss in some particular cases #205

Closed
lazovskiy opened this issue Dec 1, 2019 · 4 comments
Closed

Comments

@lazovskiy
Copy link

I've encountered the severe issue that led to data loss while performing some maintenance tasks using Redis operator.

Initial state

  1. redisfailover resource deployed to cluster.
  2. three replicas of Redis-sentinel.
  3. one replica of Redis with large data-set (~15G RSS)

Desired state

  1. Two replicas of Redis to achieve some redundancy
  2. Decreased memory request of Redis StatefulSet for resource optimization sake

Expected behaviour

  1. New replica spawned
  2. Data replicated and left intact
  3. New resources applied to old Redis replica

Actual behaviour

After I edited both redis resources section and replicas count following happened:

  1. The new StatefulSet replica (rfr-redis-1) of Redis spawned.
  2. The new StatefulSet replica was attached to master and replication started.
  3. Master initiated BGSAVE to provide bulk data for the connected slave.
  4. Readiness probe completed successfully on a new StatefulSet replica.
  5. Right after that, in order to set new resources rolling update process was initiated. rfr-redis-0 was terminated.
  6. rfr-redis-1 was immediately promoted as a new master. It had an empty data-set because the transfer from master did not have time to start.
  7. After rfr-redis-0 replica was recreated, it became a slave and replicated empty data-set instantaneously.

So I ended up with an empty master-slave replicated Redis and had to restore my data from backup.

Steps to reproduce the behaviour

Expand single-replica Redis to two or more and change something else to induce StatefulSet rolling update process.

Environment

  • Redis Operator version v1.0.0-rc.1
  • Kubernetes v1.15.0

More details

As far as I can see, current readiness probe just ensures that Redis can accept connections and reply to commands. It does not check replication status and initialDelaySeconds is too small to be sure that replication of large instances is completed.

       readinessProbe:
          exec:
            command:
            - sh
            - -c
            - redis-cli -h $(hostname) ping
          failureThreshold: 3
          initialDelaySeconds: 30

Also another problem may arise for the same reason: some client libraries (ex. predis) may distribute read-only request among the replicas. In the case of ready-but-not-replicated some other issues may emerge.

@ese
Copy link
Member

ese commented Dec 5, 2019

Thanks, @lazovskiy for the detailed report. We have merged recently two PRs to tackle this issue.

  1. Let the operator manage the rolling update process so it is aware of the cluster topology and update the first slave nodes, wait for successful sync and update master. change update strategy #203
  2. Improve readiness probe to avoid jeopardizing the disruption budget in evictions when nodes are not yet integrated into the cluster change readiness probe #206

We are going to do a release in the next days with these changes.

@ese
Copy link
Member

ese commented Dec 11, 2019

We just released https://github.com/spotahome/redis-operator/releases/tag/v1.0.0-rc.3

If you can test the issue and confirm is fixed it would be great @lazovskiy Thanks

@ese ese closed this as completed Dec 18, 2019
@vmrm
Copy link

vmrm commented Dec 27, 2019

We've just tested it with v1.0.0-rc.4 - everything works now good @ese

@chusAlvarez
Copy link
Contributor

We've just tested it with v1.0.0-rc.4 - everything works now good @ese

Thanks for share it @vmrm!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants