Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5218] Implement restore flow #162

Merged
merged 28 commits into from
Oct 7, 2024
Merged

[DPE-5218] Implement restore flow #162

merged 28 commits into from
Oct 7, 2024

Conversation

Batalex
Copy link
Contributor

@Batalex Batalex commented Sep 25, 2024

This PR implements the flow for restoring a ZooKeeper snapshot.

Other changes

  • Add retry configuration to S3 resource in case of flaky network.
  • Recursively update ACLs if the requested zNode already exists
  • Write a client-jaas.cfg file to streamline a little bit the process of accessing the ZK cluster within the machine.
  • Update ACLs to only let the superuser access the path for zNodes that are not currently requested by a related client application

Use cases

This flow can be used to bootstrap a new cluster using seeded data. Upon relating with a new client application, let's say, Kafka, here is what is going to happen:

  • the ACLs for /kafka and its subnodes will be updated to the new relation, if it already existed
  • Kafka's relation flow will recreate its admin and sync users, but, we will keep others pre-existing users (and other sub zNodes around).
  • The other zNodes which were already present in the restored snapshot will see their ACL reset to only the ZK superuser, instead of being deleted

With a minimal change in the Kafka charm, we can also restore a snapshot on an already related ZK application. The chain of events will be as follows:

  • Clients will be disconnected by removing the endpoints field from the databag. On Kafka, this has the effect of triggering the relation-changed event and setting the application's status to ZK_NO_DATA
  • We can then follow the same procedure as above after restoring the snapshot. Re-writing the databag with the required fields will trigger the relation-changed event. Kafka relation flow will then rotate the admin and sync user passwords

Here is the patch for Kafka:

diff --git a/src/events/zookeeper.py b/src/events/zookeeper.py
index 4b4f5a5..5a1adc0 100644
--- a/src/events/zookeeper.py
+++ b/src/events/zookeeper.py
@@ -71,7 +71,7 @@ class ZooKeeperHandler(Object):
             event.defer()
             return
 
-        if not self.charm.state.cluster.internal_user_credentials and self.model.unit.is_leader():
+        if self.model.unit.is_leader():
             # loading the minimum config needed to authenticate to zookeeper
             self.dependent.config_manager.set_zk_jaas_config()
             self.dependent.config_manager.set_server_properties()
@@ -87,6 +87,12 @@ class ZooKeeperHandler(Object):
             for username, password in internal_user_credentials:
                 self.charm.state.cluster.update({f"{username}-password": password})
 
+        # Kafka keeps a meta.properties in every log.dir with a unique ClusterID
+        # this ID is provided by ZK, and removing it on relation-changed allows
+        # re-joining to another ZK cluster while restoring.
+        for storage in self.charm.model.storages["data"]:
+            self.charm.workload.exec(["rm", f"{storage.location}/meta.properties"])
+
         # attempt re-start of Kafka for all units on zookeeper-changed
         # avoids relying on deferred events elsewhere that may not exist after cluster init
         if not self.dependent.healthy:

About the restore flow

  • Since we need to operate the workload on all units, we use a chain of events instead of staying within the context of the action execution.

  • We want to synchronize the units while they follow these steps:

    • stop the workload
    • move the existing data away, and download the snapshot there
    • restart the workload
    • cleanup leftover files

    Therefore, we do not use the rolling ops lib, and use a different peer relation to manage this flow, making sure all units are done with a step before moving on to the next.
    - Instead of using the peer cluster relation, we now have a new restore one. There are two main reasons why:
    - it makes following the chain of events way easier because the dozens of events fired during the flow do not trigger the "something changed" cluster-relation-changed
    - this separation of concerns guarantees that if anything goes wrong, we can ssh into the machine to solve this issue, and then continue with the flow using juju resolve. This is also why we do not have defers or try except for the exec calls
    NEW: we use the peer cluster relation instead of the dedicated restore from earlier commits.

TODO

  • Add action message after initiating the restore flow
  • Notify clients upon completion. How can we restart the relation-changed chain of events?
  • Block / defer other operations during restore
  • Add integration tests
  • Move around stuff? Should the new peer relation interface go some place else, like core/cluster or core/models?
  • Reset ACLs instead of removing the zNodes upon relation departure, following the team discussion on MM
  • Cleanup old files

Demo

demo_restore.mp4

@Batalex Batalex self-assigned this Sep 25, 2024
@Batalex Batalex force-pushed the feat/dpe-5218-restore-flow branch from 8bc5ffc to 5e498ac Compare September 27, 2024 15:34
src/charm.py Show resolved Hide resolved
@Batalex Batalex marked this pull request as ready for review September 27, 2024 15:39
@Batalex Batalex force-pushed the feat/dpe-5218-restore-flow branch from 1d939aa to c78ac8f Compare September 30, 2024 07:36
src/events/backup.py Show resolved Hide resolved
src/events/backup.py Outdated Show resolved Hide resolved
src/managers/backup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look and the code looks very well structured and clear to me. I don't have any major concerns with this PR. Actually kudos for the work here, and for the very detailed PR description (even with a demo!!!)

I would only have a small request (which is outside of the scope of the ticket) and we could maybe do this in the next pulse: to have some integration tests for Kafka, where we actually tests both restoring a backup with or without an existing relations, and making sure that the credentials of a given clients do not need to be changed (meaning Kafka still retrieves the correct credentials/ACLs/lags from ZooKeeper). I have already seen that there is a PR open with the required changes, probably adding the tests too is very appropriate for that PR

src/charm.py Show resolved Hide resolved
src/events/backup.py Show resolved Hide resolved
src/managers/backup.py Outdated Show resolved Hide resolved
src/core/stubs.py Outdated Show resolved Hide resolved
src/events/backup.py Outdated Show resolved Hide resolved
src/events/backup.py Outdated Show resolved Hide resolved
src/events/backup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@marcoppenheimer marcoppenheimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an excellent PR 👍🏾. I mentioned it in the comments, but I really appreciate how readable the 'relation-changed-restore' flow was. Overall I'm fine to approve now, but will hold off until the comments have been addressed. They're non-blocking mostly, but I expect a few conversations.

src/events/backup.py Outdated Show resolved Hide resolved
src/events/backup.py Show resolved Hide resolved
src/events/provider.py Outdated Show resolved Hide resolved
src/managers/backup.py Outdated Show resolved Hide resolved
src/managers/backup.py Show resolved Hide resolved
src/managers/backup.py Outdated Show resolved Hide resolved
src/managers/quorum.py Outdated Show resolved Hide resolved
src/managers/quorum.py Show resolved Hide resolved
Copy link
Contributor

@zmraul zmraul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT! The RestoreStep is a neat abstraction to do the rolling databag keys for units/app.

src/managers/backup.py Show resolved Hide resolved
@Batalex Batalex merged commit b0b9bc0 into main Oct 7, 2024
17 checks passed
@Batalex Batalex deleted the feat/dpe-5218-restore-flow branch October 7, 2024 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants