Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve snapshot building #823

Closed
drmingdrmer opened this issue May 3, 2023 · 16 comments
Closed

Improve snapshot building #823

drmingdrmer opened this issue May 3, 2023 · 16 comments

Comments

@drmingdrmer
Copy link
Member

drmingdrmer commented May 3, 2023

If it's helpful, I can contribute some long running snapshots to your test suite if you give me some pointers (where to put it, etc). That's probably the best way to make our usage quirks benefit the protocol.

It's appreciate for offering to contribute some long running snapshots to our test suite. :D

The conditions for testing the building of a snapshot are defined using BlockOperation, which simulates different delays.

  • Line 245 simulates a non-exclusive block, similar to your situation where the creation of the snapshot will not complete but does not acquire an exclusive lock.
  • On the other hand, line 258 acquires an exclusive lock to simulate a snapshot building that takes a long time to complete.

You can add more entries to BlockOperation to emulate other conditions in your case.

https://github.com/datafuselabs/openraft/blob/fb5bb36b025aa0bd27014ea79315788acbd5bd80/memstore/src/lib.rs#L238-L262

In this case, the state machine conditions that the snapshot is waiting for depends on future normal entries, which is why it was deadlocking.

Right.

I think your TODO about aborting the snapshot task is important as well. It may be worth considering letting the implementer raise a non-fatal error to either skip the snapshot interval, or retry the snapshot at the last applied log.

Yes, aborting snapshot building by an application can be help. However, it introduces more interaction between Openraft and the application, and I believe that what you require is a snapshot policy configuration that is more adaptable.

At present, Openraft offers a very basic policy that merely verifies if the last snapshot is lagging behind last applied log index. What criteria does your application use to build a snapshot? It would be advantageous to use a user-defined Fn() -> bool function to inform Openraft when to create a snapshot.

https://github.com/datafuselabs/openraft/blob/fb5bb36b025aa0bd27014ea79315788acbd5bd80/openraft/src/config/config.rs#L154

https://github.com/datafuselabs/openraft/blob/fb5bb36b025aa0bd27014ea79315788acbd5bd80/openraft/src/config/config.rs#L25-L40

Originally posted by @fredfortier in #596 (comment)

@github-actions
Copy link

github-actions bot commented May 3, 2023

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@fredfortier
Copy link

fredfortier commented May 4, 2023

At present, Openraft offers a very basic policy that merely verifies if the last snapshot is lagging behind last applied log index. What criteria does your application use to build a snapshot?

I agree with you. It seems like the snapshot issues we had stem from working about the timing of snapshots instead of aligning them with with our internal state.

In a nutshell, our state uses a snapshot persistence pattern, meaning it's ephemeral but a snapshot is taken at regular interval. To load the state, one must restore the snapshot and then replay the inputs (which are the raft normal entries) that follow until the desired point.

The problem is that our state snapshots are time based (for example, it may backup the state every 10 minutes) and opernraft's policy is based on entries. Our internal state snapshots are more frequent that the Raft snapshots need to be. Also, there's asynchronous execution between the openraft state machine and final state which complicates the alignment. To work around these differences, when openraft's build_snapshot begins, we wait for the next internal state snapshot to become available, and package it along with the normal entries that follow until the last applied index. This gives a relatively small list of inputs trailing the backup, and it's a straightforward rule, but the build_snapshot task could be waiting for a backup for a few minutes. Also, there's no way to handle exceptions other than raising a fatal error.

Perhaps the easiest way to align openraft snapshots with the state is simply to add a snapshot policy that takes a closure like Fn(RaftStorage) -> bool to trigger or not the snapshot This closure will be called after every N entries. This would allow us to implement our specific snapshot trigger.

@schreter
Copy link
Collaborator

schreter commented May 4, 2023

Perhaps the easiest way to align openraft snapshots with the state is simply to add a snapshot policy that takes a closure like Fn(RaftStorage) -> bool to trigger or not the snapshot This closure will be called after every N entries. This would allow us to implement our specific snapshot trigger.

+1, we have very similar snapshot pattern - snapshots triggered not by the amount of log, but time + the amount of dirty data.

@fredfortier
Copy link

fredfortier commented May 5, 2023

I notice the Raft struct has this public method now (at least I believe it's relatively new).

    /// Trigger to build a snapshot at once and return at once.
    ///
    /// Returns error when RaftCore has Fatal error, e.g. shut down or having storage error.
    pub async fn trigger_snapshot(&self) -> Result<(), Fatal<C::NodeId>> {
        self.send_external_command(ExternalCommand::Snapshot, "trigger_snapshot").await
    }

This seems good enough for me. Basically, turn off scheduled snapshot and trigger then at the opportune time. @drmingdrmer Does this make sense or is it a bad idea?

@drmingdrmer
Copy link
Member Author

drmingdrmer commented May 6, 2023

I notice the Raft struct has this public method now (at least I believe it's relatively new).

    /// Trigger to build a snapshot at once and return at once.
    ///
    /// Returns error when RaftCore has Fatal error, e.g. shut down or having storage error.
    pub async fn trigger_snapshot(&self) -> Result<(), Fatal<C::NodeId>> {
        self.send_external_command(ExternalCommand::Snapshot, "trigger_snapshot").await
    }

This seems good enough for me. Basically, turn off scheduled snapshot and trigger then at the opportune time. @drmingdrmer Does this make sense or is it a bad idea?

Seems good. I would like to give more control to an application to decouple logics from Openraft.

The only limitation is that it cannot accurately determine when to start building a snapshot from outside. The command ExternalCommand::Snapshot will be added to a queue and executed later, but by that time, the conditions for building the snapshot may have changed.

If this limitation doesn't impact your situation, then this solution would work well.

@fredfortier
Copy link

I believe that's ok. The openraft state machine holds the inputs to my program's executor, which asynchronously writes the final state including the said internal clock. Basically, there's inherent complexity to the snapshot logic.

Each snapshot contains a state backup created in advance by the executor as a precondition. This approach would allow me to trigger build_snapshot from the executor directly. Each backup includes openraft's last entry index (last input that transitioned the state). Suppose snapshots are triggered every few hours, and the build_snapshot may be delayed by a few seconds, that's enough safety to assume a legal state or panic.

Can I reasonably do this with the current code? Or should there be a SnapshotPolicy::Manual kind of option to support this pattern?

@drmingdrmer
Copy link
Member Author

To be considered legal, a snapshot must meet the following criteria:

  • It should represent the state of the state machine at a particular time in the past. This means it cannot contain any log entry that was not applied to the state machine.
  • It must adhere to monotonicity rules. Specifically, if a snapshot contains the state transitioned by log-i, it must also contain all the state transitions that occurred before log-i (log-0, log-1, ..., log-(i-1)).

It is permissible to build a snapshot from a former state machine. The build-snapshot command does not have to be executed from the current state of the machine.

In my opinion, it seems suitable for your scenario. No need for SnapshotPolicy::Manual but there should be a way to let you turn off automatic snapshot building, e.g., add an option SnapshotPolicy::Never.

@fredfortier
Copy link

fredfortier commented May 8, 2023

there should be a way to let you turn off automatic snapshot building, e.g., add an option SnapshotPolicy::Never

Yes, that what I meant. Otherwise, I can maybe configure the existing policy with something like u64::MAX but I'm afraid to introduce unintended side effects. This would simplify things for me.

@drmingdrmer
Copy link
Member Author

there should be a way to let you turn off automatic snapshot building, e.g., add an option SnapshotPolicy::Never

Yes, that what I meant. Otherwise, I can maybe give something like u64::MAX to the default policy but I'm afraid to introduce unintended side effects. This would simplify things for me.

Adding another variant would be fine. Changing the default value might break user applications :(

For now, you can just set the policy to a very large value.

@kevlu93
Copy link

kevlu93 commented May 22, 2023

there should be a way to let you turn off automatic snapshot building, e.g., add an option SnapshotPolicy::Never

Yes, that what I meant. Otherwise, I can maybe give something like u64::MAX to the default policy but I'm afraid to introduce unintended side effects. This would simplify things for me.

Adding another variant would be fine. Changing the default value might break user applications :(

For now, you can just set the policy to a very large value.

In the above scenario where we want to manually trigger snapshots because we prefer a time-based approach instead a snapshotting based on number of entries, if we set the policy to a very large value, what are the implications for log purging? My understanding is that when snapshots are build or installed, purge_log is called at the very end to remove entries up to a specified index purge_upto, which is in turn calculated by subtracting a specified max_in_snapshot_log_to_keep:

pub(crate) fn calc_purge_upto(&self) -> Option<LogId<C::NodeId>> {
        let st = &self.state;
        let max_keep = self.config.max_in_snapshot_log_to_keep;
        let batch_size = self.config.purge_batch_size;

        let purge_end = self.state.snapshot_meta.last_log_id.next_index().saturating_sub(max_keep);

        tracing::debug!(
            snapshot_last_log_id = debug(self.state.snapshot_meta.last_log_id),
            max_keep,
            "try purge: (-oo, {})",
            purge_end
        );

        if st.last_purged_log_id().next_index() + batch_size > purge_end {
            tracing::debug!(
                snapshot_last_log_id = debug(self.state.snapshot_meta.last_log_id),
                max_keep,
                last_purged_log_id = display(st.last_purged_log_id().summary()),
                batch_size,
                purge_end,
                "no need to purge",
            );
            return None;
        }
        let log_id = self.state.log_ids.get(purge_end - 1);

So I think if we set max_in_snapshot_log_to_keep to a very large value as well, then this makes it so that purge_log doesn't do anything. At which point the app would be in charge of log pruning through the purge function as part of the RaftStorage trait. Would this approach to log purging be correct? It does feel a little too hacky, and would perhaps be a reason why it's better to have more options in SnapshotPolicy. Having a SnapshotPolicy::Never or SnapshotPolicy::Manual could make it safer to control how snapshot building/installing/pruning works for a given app.

@drmingdrmer
Copy link
Member Author

@kevlu93
When a snapshot is built successfully, no matter scheduled or triggered by application with Raft::trigger_snapshot(), Openraft will purge the logs according to the config.

I'll add SnapshotPolicy::Never and Manual.
Most of the information you provided are correct.

if we set the policy to a very large value, what are the implications for log purging? My understanding is that when snapshots are build or installed, purge_log is called at the very end to remove entries up to a specified index purge_upto, which is in turn calculated by subtracting a specified max_in_snapshot_log_to_keep:

Logs wont be purged if they are not included in a snapshot.

At which point the app would be in charge of log pruning through the purge function as part of the RaftStorage trait. Would this approach to log purging be correct?

No. If your application purges logs but RaftCore does not know of it, RaftCore may try to replicate these logs to a follower, and then result in a StorageError that will shutdown RaftCore at once.

Isn't max_in_snapshot_log_to_keep enough for you need? 🤔

@kevlu93
Copy link

kevlu93 commented May 23, 2023

Well currently snapshotting and purging with max_in_snapshot_log_to_keep are both based on the number of entries right? If we change snapshotting so that it is triggered manually and time based, then it might not be consistent how many entries are in a snapshot. Thus, isn't it better in this scenario to not have a rigid max_in_snapshot_log_to_keep? For example, maybe if we prefer pruning up to the previous snapshot so that we always have an extra snapshot buffer.

@drmingdrmer
Copy link
Member Author

@kevlu93
Do you simply wish to specify the log index that needs to be purged manually?

And I do not know what a snapshot buffer is. Did you mean to keep the last two snapshots?

@kevlu93
Copy link

kevlu93 commented May 23, 2023

@kevlu93 Do you simply wish to specify the log index that needs to be purged manually?

And I do not know what a snapshot buffer is. Did you mean to keep the last two snapshots?

Yup exactly we want to keep the last two snapshots, so being able to specify the log index manually would be helpful if we switch to a time based snapshot triggers.

@drmingdrmer
Copy link
Member Author

In my opinion, providing you with the ability to manually initiate snapshot and log purging would suffice.
You already can manually trigger a snapshot building with Raft::trigger_snapshot(), and:

@kevlu93
Copy link

kevlu93 commented May 23, 2023

In my opinion, providing you with the ability to manually initiate snapshot and log purging would suffice. You already can manually trigger a snapshot building with Raft::trigger_snapshot(), and:

Yup I think this should work well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants