Support a checkpointing mechanism for kafkasql storage #3194

jsenko · 2023-03-14T13:29:03Z

Feature or Problem Description

Support a checkpointing mechanism for kafkasql storage. This will help with startup times.

Proposed Solution

TBD

Additional Context

Related Apicurio/apicurio-registry-operator#197

EricWittmann · 2023-03-14T13:52:02Z

Possibly use H2 online backup/restore to implement checkpointing:

http://www.h2database.com/html/tutorial.html#upgrade_backup_restore

We would need to backup the DB to a persistent volume and also store the location in the kafka topic (journal) that corresponds to the checkpoint. Then on pod restart, we can restore the DB from the saved backup and also skip ahead in the topic to the correct message and pickup consuming messages from there.

EricWittmann · 2023-03-14T13:54:38Z

Note: this should hopefully be easier to do since we sequence writes to the H2 DB. We should probably use the existing consumer thread to perform this checkpointing task. Perhaps by simply adding a message to the topic when the checkpoint should occur.

jsenko · 2023-03-14T14:27:46Z

Suggested solution using Kafka to store the checkpoint data

Two new topics kafkasql-checkpoint and kafkasql-checkpoint-coordination will be added.

The first topic will contain the checkpoint data itself (analogous to the export functionality we have currently, but dumping the data into a topic).

The second topic will ensure that only a single Registry instance will perform checkpointing at the same time, and provides metadata for loading checkpoints. This will be done by an instance announcing the intention of doing the checkpoint, so the other pods will not do it as well. This requires some additional recovery logic, in case the instance fails or is restarted during the process.

Creating a checkpoint

Only a single checkpoint may be done at the same time. When an instance decides to create a checkpoint, it will send a "lock request" message to apicurio-checkpoint-coordination. In case multiple instances send this message at the same time, the first instance wins, and sends the message again. After a short time without another instance sending, it initiates the checkpointing process by sending a "lock" message.
Instance performing a checkpoint must make sure it's local database is frozen. It sets the liveness probe to not-ready, so it does not receive more requests, and disables writes to the database. This requires that any available deployment must consist of at least 2 pods.
It determines the last offset in the kafkasql-journaltopic it read, so older messages can be skipped in the future. It also determines the offset of the last message in the kafkasql-checkpoint topic, to tell the future readers where the latest checkpoint starts. These two offsets are written to kafkasql-checkpoint-coordination.
Using a logic similar to the export feature, the data is written to kafkasql-checkpoint.
When the checkpointing process succeeds, a commit message is written into the kafkasql-checkpoint-coordination topic. This message contains the offset of the last message it wrote to kafkasql-checkpoint, so future readers do not accidentally read a newer checkpoint data. It's also possible to write the metadata from step 3 here, and have a single commit message.
The instance unfreezes itself, and continues reading from the kafkasql-journal topic.

Each coordination message contains a unique instance ID that is regenerated after restart.

Reading a checkpoint

An instance reads the kafkasql-checkpoint-coordination topic until it sees the latest commit message. It then imports the data based on the provided offsets and continues by reading the rest of the kafkasql-journal topic.

Failures and recovery

If the instance fails after writing the commit message, it is not a problem, since the process is done.
If the instance fails after sending the "lock request" message, since there is no "lock" message, other instances are free to assume, after a timeout, that the instance that was initiating the checkpointing failed, and may attempt to start the process themselves.
If the instance fails after the "lock" message, there needs to be a mechanism to safely determine that the process failed. I suggest a timeout system as well, but since we cannot correctly determine how long will the process last, the checkpointing instance must periodically send a "lock refresh" message, before the previous lock times out. It must also abort the process if it sees any "lock request" message from other instances. In this case the data at the end of the kafkasql-checkpoint topic may be incomplete, but since we specify both start and end offset in the commit message, they will just be skipped.

EricWittmann · 2023-03-24T17:29:14Z

Very interesting proposal. I have some general comments that might inspire you to simplify/modify your ideas. Or maybe not. :)

Consider that we might not need to lock the instance performing the checkpoint. Because we sequence all writes to the H2 database, locking for writes will happen naturally if the instance performing the checkpoint does that work on the kafka consumer thread. And in that case, read operations will continue to work. The risk is that while that's happening, the instance doing the checkpointing will start to fall behind on messages in kafkasql-journal.
I think the checkpoint data could just be an actual export ZIP file that we store somewhere shared, like an S3 bucket. We might be able to get away with just a single additional topic kafkasql-checkpoints, where each message is the checkpoint metadata and a URL to the exported ZIP file. I'd also be happy with an H2 dump file, if it turns out that's faster to create, but an actual export ZIP is attractive because it would open up the possibility of manually adding a checkpoint to the topic. It would also make the checkpoints more portable.
If we stagger checkpoint creation, we might not need as much coordination, with locking and whatnot. Meaning that every instance can attempt to create a checkpoint every X minutes, but if a checkpoint was created recently enough, it can just skip it. This might result in the same instance creating checkpoints every time, but that would be fine. Doing this might mean we don't need the coordination topic - just the checkpoints topic, since all instances can have a checkpoints consumer thread, which does nothing except read all messages on the checkpoints topic so that the instance knows when the most recent checkpoint was created.
A big question we need answered is how long does it take to create a checkpoint. This time will grow as the size of the DB grows of course. But we would need some real metrics on this. It will inform the decision about whether the instance needs to mark itself as "not ready" while the checkpoint is being created.

jsenko added the type/enhancement New feature or request label Mar 14, 2023

apicurio-bot bot added the area/storage label Mar 14, 2023

jsenko changed the title ~~Support a checkpointing mechanism in kafkasql storage~~ Support a checkpointing mechanism for kafkasql storage Mar 14, 2023

EricWittmann added priority/normal component/registry labels Mar 14, 2023

carlesarnal linked a pull request May 15, 2024 that will close this issue

Support kafka sql snapshotting #4665

Merged

carlesarnal added this to Registry 3.0 May 15, 2024

carlesarnal moved this to In Progress in Registry 3.0 May 15, 2024

EricWittmann closed this as completed in #4665 May 29, 2024

github-project-automation bot moved this from In Progress to Done in Registry 3.0 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a checkpointing mechanism for kafkasql storage #3194

Support a checkpointing mechanism for kafkasql storage #3194

jsenko commented Mar 14, 2023

EricWittmann commented Mar 14, 2023

EricWittmann commented Mar 14, 2023

jsenko commented Mar 14, 2023

EricWittmann commented Mar 24, 2023

Support a checkpointing mechanism for kafkasql storage #3194

Support a checkpointing mechanism for kafkasql storage #3194

Comments

jsenko commented Mar 14, 2023

Feature or Problem Description

Proposed Solution

Additional Context

EricWittmann commented Mar 14, 2023

EricWittmann commented Mar 14, 2023

jsenko commented Mar 14, 2023

Suggested solution using Kafka to store the checkpoint data

Creating a checkpoint

Reading a checkpoint

Failures and recovery

EricWittmann commented Mar 24, 2023