[DPE-5218] Implement restore flow #162

Batalex · 2024-09-25T15:22:12Z

This PR implements the flow for restoring a ZooKeeper snapshot.

Other changes

Add retry configuration to S3 resource in case of flaky network.
Recursively update ACLs if the requested zNode already exists
Write a client-jaas.cfg file to streamline a little bit the process of accessing the ZK cluster within the machine.
Update ACLs to only let the superuser access the path for zNodes that are not currently requested by a related client application

Use cases

This flow can be used to bootstrap a new cluster using seeded data. Upon relating with a new client application, let's say, Kafka, here is what is going to happen:

the ACLs for /kafka and its subnodes will be updated to the new relation, if it already existed
Kafka's relation flow will recreate its admin and sync users, but, we will keep others pre-existing users (and other sub zNodes around).
The other zNodes which were already present in the restored snapshot will see their ACL reset to only the ZK superuser, instead of being deleted

With a minimal change in the Kafka charm, we can also restore a snapshot on an already related ZK application. The chain of events will be as follows:

Clients will be disconnected by removing the endpoints field from the databag. On Kafka, this has the effect of triggering the relation-changed event and setting the application's status to ZK_NO_DATA
We can then follow the same procedure as above after restoring the snapshot. Re-writing the databag with the required fields will trigger the relation-changed event. Kafka relation flow will then rotate the admin and sync user passwords

Here is the patch for Kafka:

diff --git a/src/events/zookeeper.py b/src/events/zookeeper.py
index 4b4f5a5..5a1adc0 100644
--- a/src/events/zookeeper.py
+++ b/src/events/zookeeper.py
@@ -71,7 +71,7 @@ class ZooKeeperHandler(Object):
             event.defer()
             return
 
-        if not self.charm.state.cluster.internal_user_credentials and self.model.unit.is_leader():
+        if self.model.unit.is_leader():
             # loading the minimum config needed to authenticate to zookeeper
             self.dependent.config_manager.set_zk_jaas_config()
             self.dependent.config_manager.set_server_properties()
@@ -87,6 +87,12 @@ class ZooKeeperHandler(Object):
             for username, password in internal_user_credentials:
                 self.charm.state.cluster.update({f"{username}-password": password})
 
+        # Kafka keeps a meta.properties in every log.dir with a unique ClusterID
+        # this ID is provided by ZK, and removing it on relation-changed allows
+        # re-joining to another ZK cluster while restoring.
+        for storage in self.charm.model.storages["data"]:
+            self.charm.workload.exec(["rm", f"{storage.location}/meta.properties"])
+
         # attempt re-start of Kafka for all units on zookeeper-changed
         # avoids relying on deferred events elsewhere that may not exist after cluster init
         if not self.dependent.healthy:

About the restore flow

Since we need to operate the workload on all units, we use a chain of events instead of staying within the context of the action execution.
We want to synchronize the units while they follow these steps:
- stop the workload
- move the existing data away, and download the snapshot there
- restart the workload
- cleanup leftover files
Therefore, we do not use the rolling ops lib, and use a different peer relation to manage this flow, making sure all units are done with a step before moving on to the next.
~~- Instead of using the peer cluster relation, we now have a new restore one. There are two main reasons why:~~
~~- it makes following the chain of events way easier because the dozens of events fired during the flow do not trigger the "something changed" cluster-relation-changed~~
- this separation of concerns guarantees that if anything goes wrong, we can ssh into the machine to solve this issue, and then continue with the flow using juju resolve. This is also why we do not have defers or try except for the exec calls
NEW: we use the peer cluster relation instead of the dedicated restore from earlier commits.

TODO

Add action message after initiating the restore flow
Notify clients upon completion. How can we restart the relation-changed chain of events?
Block / defer other operations during restore
Add integration tests
Move around stuff? Should the new peer relation interface go some place else, like core/cluster or core/models?
Reset ACLs instead of removing the zNodes upon relation departure, following the team discussion on MM
Cleanup old files

Demo

demo_restore.mp4

src/charm.py

src/events/backup.py

src/managers/backup.py

deusebio

I had a look and the code looks very well structured and clear to me. I don't have any major concerns with this PR. Actually kudos for the work here, and for the very detailed PR description (even with a demo!!!)

I would only have a small request (which is outside of the scope of the ticket) and we could maybe do this in the next pulse: to have some integration tests for Kafka, where we actually tests both restoring a backup with or without an existing relations, and making sure that the credentials of a given clients do not need to be changed (meaning Kafka still retrieves the correct credentials/ACLs/lags from ZooKeeper). I have already seen that there is a PR open with the required changes, probably adding the tests too is very appropriate for that PR

src/charm.py

src/events/backup.py

src/managers/backup.py

src/charm.py

src/core/stubs.py

src/events/backup.py

marcoppenheimer

This was an excellent PR 👍🏾. I mentioned it in the comments, but I really appreciate how readable the 'relation-changed-restore' flow was. Overall I'm fine to approve now, but will hold off until the comments have been addressed. They're non-blocking mostly, but I expect a few conversations.

src/events/backup.py

src/events/provider.py

src/managers/backup.py

src/managers/quorum.py

zmraul

LGMT! The RestoreStep is a neat abstraction to do the rolling databag keys for units/app.

src/managers/backup.py

Batalex added 4 commits September 24, 2024 17:14

feat: Update ACLs in znode sub nodes

2047a80

fix: Add retry configuration to S3 resource

0f74580

feat: Add restore action to actions.yaml

1c8199a

feat: Add initial synced restore flow

5ddc646

Batalex self-assigned this Sep 25, 2024

Batalex added 14 commits September 25, 2024 17:59

tests: Add restore action unit tests for action failures

109eb8d

feat: Add restoring snapshot method

e1d7d4d

tests: Fix broken quorum unit test

e23703e

feat: Add feedback message to restore action

94c26b9

feat: Prevent running restore action if already ongoing

e0097ec

Merge branch 'main' into feat/dpe-5218-restore-flow

30e5c49

feat: Add client jaas configuration file

f425adb

feat: Notify clients after backup restored

af1eeb3

feat: Add cleanup method

db97668

refactor: Move restore relation fields to the cluster one

55371a7

feat: Rework events emission to disconnect/reconnect clients

c9dec8a

tests: Add restore backup integration test

fc41ac3

feat: Reset ACLs instead of dropping zNode for non related apps

5f4a14a

chore: Format

5e498ac

Batalex force-pushed the feat/dpe-5218-restore-flow branch from 8bc5ffc to 5e498ac Compare September 27, 2024 15:34

chore: Remove unused literal

795f472

Batalex commented Sep 27, 2024

View reviewed changes

src/charm.py Show resolved Hide resolved

Batalex marked this pull request as ready for review September 27, 2024 15:39

tests: Fix application name in backup integration tests

c78ac8f

Batalex force-pushed the feat/dpe-5218-restore-flow branch from 1d939aa to c78ac8f Compare September 30, 2024 07:36

fix: Fix ACL restriction on app departed

0141fa5

Batalex requested review from marcoppenheimer, deusebio and zmraul September 30, 2024 09:20

Batalex mentioned this pull request Sep 30, 2024

[DPE-5218] Implement restore flow canonical/zookeeper-k8s-operator#108

Merged

Batalex commented Sep 30, 2024

View reviewed changes

src/events/backup.py Show resolved Hide resolved

src/events/backup.py Outdated Show resolved Hide resolved

src/managers/backup.py Outdated Show resolved Hide resolved

refactor: Remove duplicate disconnect client call

2fcb26d

This was referenced Sep 30, 2024

[DPE-5218] Enable compatibility with ZK restore feature canonical/kafka-operator#243

Merged

[DPE-5218] Enable compatibility with ZK restore feature canonical/kafka-k8s-operator#137

Merged

deusebio approved these changes Sep 30, 2024

View reviewed changes

src/charm.py Show resolved Hide resolved

src/events/backup.py Show resolved Hide resolved

src/managers/backup.py Outdated Show resolved Hide resolved

Batalex commented Oct 3, 2024

View reviewed changes

src/managers/backup.py Show resolved Hide resolved

marcoppenheimer reviewed Oct 3, 2024

View reviewed changes

src/charm.py Show resolved Hide resolved

marcoppenheimer reviewed Oct 3, 2024

View reviewed changes

src/core/stubs.py Outdated Show resolved Hide resolved

src/events/backup.py Outdated Show resolved Hide resolved

src/events/backup.py Outdated Show resolved Hide resolved

src/events/backup.py Outdated Show resolved Hide resolved

marcoppenheimer reviewed Oct 3, 2024

View reviewed changes

src/events/backup.py Outdated Show resolved Hide resolved

marcoppenheimer reviewed Oct 3, 2024

View reviewed changes

Batalex added 6 commits October 4, 2024 09:51

refactor: Revert annotations changes

e1bc5ed

refactor: Inline variables in method dispatcher

d6e4c97

refactor: Streamline bool response in is_snapshot_in_bucket

6d1a746

refactor: Use all=True instead of individual permissions

4287671

refactor: Make failure conditions lazy

653db76

refactor: Add new state properties for readability

feae3eb

marcoppenheimer approved these changes Oct 4, 2024

View reviewed changes

zmraul approved these changes Oct 4, 2024

View reviewed changes

src/managers/backup.py Show resolved Hide resolved

Batalex merged commit b0b9bc0 into main Oct 7, 2024
17 checks passed

Batalex deleted the feat/dpe-5218-restore-flow branch October 7, 2024 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-5218] Implement restore flow #162

[DPE-5218] Implement restore flow #162

Batalex commented Sep 25, 2024 •

edited

Loading

deusebio left a comment

marcoppenheimer left a comment

zmraul left a comment

[DPE-5218] Implement restore flow #162

[DPE-5218] Implement restore flow #162

Conversation

Batalex commented Sep 25, 2024 • edited Loading

Other changes

Use cases

About the restore flow

TODO

Demo

deusebio left a comment

Choose a reason for hiding this comment

marcoppenheimer left a comment

Choose a reason for hiding this comment

zmraul left a comment

Choose a reason for hiding this comment

Batalex commented Sep 25, 2024 •

edited

Loading