Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a checkpointing mechanism for kafkasql storage #3194

Closed
jsenko opened this issue Mar 14, 2023 · 4 comments · Fixed by #4665
Closed

Support a checkpointing mechanism for kafkasql storage #3194

jsenko opened this issue Mar 14, 2023 · 4 comments · Fixed by #4665

Comments

@jsenko
Copy link
Member

jsenko commented Mar 14, 2023

Feature or Problem Description

Support a checkpointing mechanism for kafkasql storage. This will help with startup times.

Proposed Solution

TBD

Additional Context

Related Apicurio/apicurio-registry-operator#197

@jsenko jsenko added the type/enhancement New feature or request label Mar 14, 2023
@jsenko jsenko changed the title Support a checkpointing mechanism in kafkasql storage Support a checkpointing mechanism for kafkasql storage Mar 14, 2023
@EricWittmann
Copy link
Member

Possibly use H2 online backup/restore to implement checkpointing:

http://www.h2database.com/html/tutorial.html#upgrade_backup_restore

We would need to backup the DB to a persistent volume and also store the location in the kafka topic (journal) that corresponds to the checkpoint. Then on pod restart, we can restore the DB from the saved backup and also skip ahead in the topic to the correct message and pickup consuming messages from there.

@EricWittmann
Copy link
Member

Note: this should hopefully be easier to do since we sequence writes to the H2 DB. We should probably use the existing consumer thread to perform this checkpointing task. Perhaps by simply adding a message to the topic when the checkpoint should occur.

@jsenko
Copy link
Member Author

jsenko commented Mar 14, 2023

Suggested solution using Kafka to store the checkpoint data

Two new topics kafkasql-checkpoint and kafkasql-checkpoint-coordination will be added.

The first topic will contain the checkpoint data itself (analogous to the export functionality we have currently, but dumping the data into a topic).

The second topic will ensure that only a single Registry instance will perform checkpointing at the same time, and provides metadata for loading checkpoints. This will be done by an instance announcing the intention of doing the checkpoint, so the other pods will not do it as well. This requires some additional recovery logic, in case the instance fails or is restarted during the process.

Creating a checkpoint

  1. Only a single checkpoint may be done at the same time. When an instance decides to create a checkpoint, it will send a "lock request" message to apicurio-checkpoint-coordination. In case multiple instances send this message at the same time, the first instance wins, and sends the message again. After a short time without another instance sending, it initiates the checkpointing process by sending a "lock" message.

  2. Instance performing a checkpoint must make sure it's local database is frozen. It sets the liveness probe to not-ready, so it does not receive more requests, and disables writes to the database. This requires that any available deployment must consist of at least 2 pods.

  3. It determines the last offset in the kafkasql-journaltopic it read, so older messages can be skipped in the future. It also determines the offset of the last message in the kafkasql-checkpoint topic, to tell the future readers where the latest checkpoint starts. These two offsets are written to kafkasql-checkpoint-coordination.

  4. Using a logic similar to the export feature, the data is written to kafkasql-checkpoint.

  5. When the checkpointing process succeeds, a commit message is written into the kafkasql-checkpoint-coordination topic. This message contains the offset of the last message it wrote to kafkasql-checkpoint, so future readers do not accidentally read a newer checkpoint data. It's also possible to write the metadata from step 3 here, and have a single commit message.

  6. The instance unfreezes itself, and continues reading from the kafkasql-journal topic.

Each coordination message contains a unique instance ID that is regenerated after restart.

Reading a checkpoint

  1. An instance reads the kafkasql-checkpoint-coordination topic until it sees the latest commit message. It then imports the data based on the provided offsets and continues by reading the rest of the kafkasql-journal topic.

Failures and recovery

  • If the instance fails after writing the commit message, it is not a problem, since the process is done.

  • If the instance fails after sending the "lock request" message, since there is no "lock" message, other instances are free to assume, after a timeout, that the instance that was initiating the checkpointing failed, and may attempt to start the process themselves.

  • If the instance fails after the "lock" message, there needs to be a mechanism to safely determine that the process failed. I suggest a timeout system as well, but since we cannot correctly determine how long will the process last, the checkpointing instance must periodically send a "lock refresh" message, before the previous lock times out. It must also abort the process if it sees any "lock request" message from other instances. In this case the data at the end of the kafkasql-checkpoint topic may be incomplete, but since we specify both start and end offset in the commit message, they will just be skipped.

@EricWittmann
Copy link
Member

Very interesting proposal. I have some general comments that might inspire you to simplify/modify your ideas. Or maybe not. :)

  • Consider that we might not need to lock the instance performing the checkpoint. Because we sequence all writes to the H2 database, locking for writes will happen naturally if the instance performing the checkpoint does that work on the kafka consumer thread. And in that case, read operations will continue to work. The risk is that while that's happening, the instance doing the checkpointing will start to fall behind on messages in kafkasql-journal.

  • I think the checkpoint data could just be an actual export ZIP file that we store somewhere shared, like an S3 bucket. We might be able to get away with just a single additional topic kafkasql-checkpoints, where each message is the checkpoint metadata and a URL to the exported ZIP file. I'd also be happy with an H2 dump file, if it turns out that's faster to create, but an actual export ZIP is attractive because it would open up the possibility of manually adding a checkpoint to the topic. It would also make the checkpoints more portable.

  • If we stagger checkpoint creation, we might not need as much coordination, with locking and whatnot. Meaning that every instance can attempt to create a checkpoint every X minutes, but if a checkpoint was created recently enough, it can just skip it. This might result in the same instance creating checkpoints every time, but that would be fine. Doing this might mean we don't need the coordination topic - just the checkpoints topic, since all instances can have a checkpoints consumer thread, which does nothing except read all messages on the checkpoints topic so that the instance knows when the most recent checkpoint was created.

  • A big question we need answered is how long does it take to create a checkpoint. This time will grow as the size of the DB grows of course. But we would need some real metrics on this. It will inform the decision about whether the instance needs to mark itself as "not ready" while the checkpoint is being created.

@carlesarnal carlesarnal linked a pull request May 15, 2024 that will close this issue
@carlesarnal carlesarnal moved this to In Progress in Registry 3.0 May 15, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Registry 3.0 May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants