Insufficient readiness probe leads to data loss in some particular cases #205

lazovskiy · 2019-12-01T16:12:44Z

I've encountered the severe issue that led to data loss while performing some maintenance tasks using Redis operator.

Initial state

redisfailover resource deployed to cluster.
three replicas of Redis-sentinel.
one replica of Redis with large data-set (~15G RSS)

Desired state

Two replicas of Redis to achieve some redundancy
Decreased memory request of Redis StatefulSet for resource optimization sake

Expected behaviour

New replica spawned
Data replicated and left intact
New resources applied to old Redis replica

Actual behaviour

After I edited both redis resources section and replicas count following happened:

The new StatefulSet replica (rfr-redis-1) of Redis spawned.
The new StatefulSet replica was attached to master and replication started.
Master initiated BGSAVE to provide bulk data for the connected slave.
Readiness probe completed successfully on a new StatefulSet replica.
Right after that, in order to set new resources rolling update process was initiated. rfr-redis-0 was terminated.
rfr-redis-1 was immediately promoted as a new master. It had an empty data-set because the transfer from master did not have time to start.
After rfr-redis-0 replica was recreated, it became a slave and replicated empty data-set instantaneously.

So I ended up with an empty master-slave replicated Redis and had to restore my data from backup.

Steps to reproduce the behaviour

Expand single-replica Redis to two or more and change something else to induce StatefulSet rolling update process.

Environment

Redis Operator version v1.0.0-rc.1
Kubernetes v1.15.0

More details

As far as I can see, current readiness probe just ensures that Redis can accept connections and reply to commands. It does not check replication status and initialDelaySeconds is too small to be sure that replication of large instances is completed.

       readinessProbe:
          exec:
            command:
            - sh
            - -c
            - redis-cli -h $(hostname) ping
          failureThreshold: 3
          initialDelaySeconds: 30

Also another problem may arise for the same reason: some client libraries (ex. predis) may distribute read-only request among the replicas. In the case of ready-but-not-replicated some other issues may emerge.

The text was updated successfully, but these errors were encountered:

ese · 2019-12-05T22:39:02Z

Thanks, @lazovskiy for the detailed report. We have merged recently two PRs to tackle this issue.

Let the operator manage the rolling update process so it is aware of the cluster topology and update the first slave nodes, wait for successful sync and update master. change update strategy #203
Improve readiness probe to avoid jeopardizing the disruption budget in evictions when nodes are not yet integrated into the cluster change readiness probe #206

We are going to do a release in the next days with these changes.

ese · 2019-12-11T18:05:45Z

We just released https://github.com/spotahome/redis-operator/releases/tag/v1.0.0-rc.3

If you can test the issue and confirm is fixed it would be great @lazovskiy Thanks

vmrm · 2019-12-27T12:40:30Z

We've just tested it with v1.0.0-rc.4 - everything works now good @ese

chusAlvarez · 2019-12-27T15:59:16Z

We've just tested it with v1.0.0-rc.4 - everything works now good @ese

Thanks for share it @vmrm!!

ese closed this as completed Dec 18, 2019

carlosa8c mentioned this issue Aug 19, 2020

Insufficient readiness probe leads to data loss in some particular cases ucloud/redis-operator#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient readiness probe leads to data loss in some particular cases #205

Insufficient readiness probe leads to data loss in some particular cases #205

lazovskiy commented Dec 1, 2019

ese commented Dec 5, 2019 •

edited

Loading

ese commented Dec 11, 2019 •

edited

Loading

vmrm commented Dec 27, 2019

chusAlvarez commented Dec 27, 2019

Insufficient readiness probe leads to data loss in some particular cases #205

Insufficient readiness probe leads to data loss in some particular cases #205

Comments

lazovskiy commented Dec 1, 2019

Initial state

Desired state

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Environment

More details

ese commented Dec 5, 2019 • edited Loading

ese commented Dec 11, 2019 • edited Loading

vmrm commented Dec 27, 2019

chusAlvarez commented Dec 27, 2019

ese commented Dec 5, 2019 •

edited

Loading

ese commented Dec 11, 2019 •

edited

Loading