Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

zach-schoenberger · 2022-11-06T22:45:27Z

I still need to do the pr cleanup below.

The goal of this PR is to update the snapshoting process to be more customizable. The first thought was to make the snapshot be broken into a stream and a sink. Which is what this PR currently shows. But looking at it more I am curious if even this isn't quite right. I know the paper goes over the RPC for InstallSnapshot and that the Raft engine should process the chunks of the snapshot. But is this really necessary? Couldn't the RPC be simplified to be

pub struct InstallSnapshotRequest<C: RaftTypeConfig> {
    pub vote: Vote<C::NodeId>,

    /// Metadata of a snapshot: snapshot_id, last_log_ed membership etc.
    pub meta: SnapshotMeta<C::NodeId, C::Node>,

    /// The byte offset where this chunk of data is positioned in the snapshot file.
    pub offset: u64,

    /// The snapshot data.
    pub data: C::SD,
}

Where C::SD is the snapshot data which can be any struct. And have the user of the API handle how it should be sent and received. This would be much more flexible and remove all the logic in the raft_core around building the snapshot parts since the user will have already done that in the best way for their use case.

Checklist

Updated guide with pertinent info (may not always apply).
Squash down commits to one or two logical commits which clearly describe the work you've done.
Unittest is a friend:)

This change is

drmingdrmer · 2022-11-07T08:33:23Z

    /// The snapshot data.
    pub data: C::SD,

This looks good

…of the snapshot

zach-schoenberger · 2022-11-07T14:09:07Z

@drmingdrmer i just updated this with what I think would be a more useful way of handling snapshots. The main thought being that the raft engine itself should let the user of the library send the snapshot how they want too. The main downside I see to this being that currently the chuncks of the snapshot could fail early if some aspect of the vote changes. but this could be put on the user to handle. This branches changes I believe simplify the snapshot concept in the engine and give the user more freedom to do as they wish with the snapshot process.

drmingdrmer · 2022-11-07T16:20:08Z

The main downside I see to this being that currently the chuncks of the snapshot could fail early if some aspect of the vote changes. but this could be put on the user to handle. This branches changes I believe simplify the snapshot concept in the engine and give the user more freedom to do as they wish with the snapshot process.

I do not quite get this: if a vote change causes the snapshotting to shut down, it has to be dealt with openraft. Such a task can not be left to the application.

If the major change is to define snapshot chunk with C::SD, why not make it a runnable PR, instead of commenting out large blocks of codes?

zach-schoenberger · 2022-11-07T20:14:47Z

@drmingdrmer I've gone through and updated the tests where appropriate now.

drmingdrmer · 2022-11-08T00:33:07Z

@drmingdrmer I've gone through and updated the tests where appropriate now.

It looks like you removed the snapshot streaming entirely.

How does an application implement streaming if the snapshot is very large?

zach-schoenberger · 2022-11-08T00:37:56Z

it really becomes a question for the user of the api. The C::SD can still be a stream if that makes sense for the user. Or it could be a directory location where the files reside. then in the network trait that sends the snapshot, the user can do what makes the most sense for them. they could send the metadata and start streaming the data if its a stream. or send multiple files in parallel if they have a list of files to send over. Or they could even send each SM entry one by one.

On the receiving the client needs to aggregate the full snapshot before applying raft. But that again can be optimized by the user for their use case.

drmingdrmer · 2022-11-08T00:48:21Z

If C::SD is a stream, it can not be simply sent as a struct field.

The RaftNetwork has to provide an API whose argument is a Stream and an application has to connect this Stream to the remote peer, e.g., by implementing this Stream with a gRPC-stream.

zach-schoenberger · 2022-11-08T03:09:10Z

yep thats right. you would not want to just serialize the InstallSnapshotRequest as is in that case. the user can have their client break those into separate requests. or do whatever they like rpc wise to send the data over. all of this can fit nicely under the hood of the RaftNetwork::send_install_snapshot call. I had thought it might be more intuitive to break out the snapshot from the InstallSnapshotRequest because of what you said, but I didn't see much benefit.

zach-schoenberger · 2022-11-08T13:06:56Z

I also think this pr needs some work in relation to the API change. It's just that I'm not sure what the project wants in terms of that. I know from my experience and what I've read from other issues it looks like the engine should not be what defines how the snapshot is sent between nodes, since what defines a user's snapshot can vary so greatly between implementations. Please let me know any suggestions or feedback!

drmingdrmer · 2022-11-09T16:14:46Z

yep thats right. you would not want to just serialize the InstallSnapshotRequest as is in that case. the user can have their client break those into separate requests. or do whatever they like rpc wise to send the data over. all of this can fit nicely under the hood of the RaftNetwork::send_install_snapshot call. I had thought it might be more intuitive to break out the snapshot from the InstallSnapshotRequest because of what you said, but I didn't see much benefit.

One of my concerns is that an application has to understand raft protocol very well to define its own RPC APIs.
A snapshot may be very large, and an application might have to split it into chunks and send them with several application-defined RPC.
The application may not deal with xxx_request.vote correctly in the RPC handlers, and there are no tests that can discover such issues.

This is why raft-protocol RPCs have to be defined by RaftNetwork, and every request has to be dealt with by RaftCore.

zach-schoenberger · 2022-11-09T17:14:59Z

Could you expand on the xxx_request.vote handling more? Voting changes while a snapshot streams was one thing i was worried about and not 100% on my understanding. Also I completely agree on the RPC point. Those haven't changed with this PR. I guess in my mind I see a couple different scenarios when it comes to snapshots (all inside RaftNetwork::send_install_snapshot ):

they are small and in memory - super easy, C:SD is a Vec<u8> like in the samples. user defines how they want to send it over. Or C::SD is just a reference to the statemachine and the user can stream over the values. Dealers choice. InstallSnapshotRequest<C> is created on the other side and install_snapshot(&self, rpc: InstallSnapshotRequest<C>) is called with it.
they are large and in a single file - still pretty easy, C::SD could be a filename and the user sends the file over, followed by an install which triggers the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)
they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

Technically all of these can be done in the current branch without these changes. InstallSnapshotRequest<C> just contains a serialized file list, and the RaftStorage::install_snapshot is what actually does the data transfer. (I have done this for my use case). But it makes setting the snapshot timeout pretty odd since the timeout will have to account for the entire snapshot download instead of a chunk. It's just when going through this I didn't see a really good reason to force the raft engine snapshot send to be so tightly bound. But maybe there's something I missed and I'm wrong about this.

drmingdrmer · 2022-11-10T03:40:44Z

Could you expand on the xxx_request.vote handling more? Voting changes while a snapshot streams was one thing i was worried about and not 100% on my understanding.

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked.
When streaming a snapshot to a remote peer, a piece of data is transferred along this path:
local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore

No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Also I completely agree on the RPC point. Those haven't changed with this PR. I guess in my mind I see a couple different scenarios when it comes to snapshots (all inside RaftNetwork::send_install_snapshot ):

they are small and in memory - super easy, C:SD is a Vec<u8> like in the samples. user defines how they want to send it over. Or C::SD is just a reference to the statemachine and the user can stream over the values. Dealers choice. InstallSnapshotRequest<C> is created on the other side and install_snapshot(&self, rpc: InstallSnapshotRequest<C>) is called with it.

Yes.

they are large and in a single file - still pretty easy, C::SD could be a filename and the user sends the file over, followed by an install which triggers the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

This will block RaftCore.

they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

This blocks RaftCore` too.

zach-schoenberger · 2022-11-10T12:58:10Z

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked. When streaming a snapshot to a remote peer, a piece of data is transferred along this path: local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore

No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Sorry, do you mean the local RaftCore or remote RaftCore?

The blocking would occur at 3. Since 3 is in the local replication task, my understanding is that the local RaftCore should not be blocked. The blocking would really occur inside the replication task. I see that could definitely cause an issue with it's state. Would it be reasonable to move call 3 into its own task? That way it does not block the replication task and could be cancelled when necessary?

drmingdrmer · 2022-11-10T13:12:55Z

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked. When streaming a snapshot to a remote peer, a piece of data is transferred along this path: local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore
No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Sorry, do you mean the local RaftCore or remote RaftCore?

The remote RaftCore, at step 5.

The blocking would occur at 3. Since 3 is in the local replication task, my understanding is that the local RaftCore should not be blocked. The blocking would really occur inside the replication task. I see that could definitely cause an issue with it's state. Would it be reasonable to move call 3 into its own task? That way it does not block the replication task and could be cancelled when necessary?

Sending data from leader to followers are already done in other tasks. Step 3 won't block.

zach-schoenberger · 2022-11-10T13:30:55Z

Step 4 isn't callable until step 3 has completed. So the full snapshot would already be on the remote client when step 4 is called.

I know above it was referenced that C::SD could be a stream. As you've pointed out that scenario doesn't work if the snapshot system is simplified, so sorry about that confusion. But in the 3 scenarios above it should work, no?

zach-schoenberger · 2022-11-10T13:38:41Z

Let me update the rocks and mem examples to show what I mean.

drmingdrmer · 2022-11-10T14:12:33Z

Step 4 isn't callable until step 3 has completed. So the full snapshot would already be on the remote client when step 4 is called.

If C::SD is a simple Vec<T>, yes.

For a stream C::SD, I would say no:).

I know above it was referenced that C::SD could be a stream. As you've pointed out that scenario doesn't work if the snapshot system is simplified, so sorry about that confusion. But in the 3 scenarios above it should work, no?

I'm not quite sure what it should work means: AFAIK, it only works when C::SD is a single chunk snapshot, or a list of chunk names(the follower has to download chunks then). But if C::SD is a Stream, it won't work.

The 3rd scenario:

they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the

Did you mean to let the leader send multiple chunks, and let the follower buffer all the chunks and then let the follower re-build a C::SD and pass it to Raft?

I think it works but introduces some complexity, such as the receiving peer has to watch raft vote changes so that it can be canceled.

zach-schoenberger · 2022-11-10T16:43:23Z

I'm not quite sure what it should work means: AFAIK, it only works when C::SD is a single chunk snapshot, or a list of chunk names(the follower has to download chunks then). But if C::SD is a Stream, it won't work.

Right the C::SD as just a stream wont work with this setup. But the contents or the snapshot C::SD represents could still be transferred however the user wants.

Did you mean to let the leader send multiple chunks, and let the follower buffer all the chunks and then let the follower re-build a C::SD and pass it to Raft?

Yep

I think it works but introduces some complexity, such as the receiving peer has to watch raft vote changes so that it can be canceled.

That's a great point about the vote change. ~~I would expect the snapshot to be rejected if it's vote is not correct, but I haven't verified that is the case.~~ (This is the case). The case of the the local raft going down and leaving the remote snapshot unfinished would also have to be handled by the remote client. But I would argue that these complexities should be put on the user.

zach-schoenberger · 2022-11-13T14:31:51Z

Any more feedback on this idea?

drmingdrmer · 2022-11-13T16:27:13Z

Any more feedback on this idea?

Such an abstraction leaves too many things to do for the application developers.

Application developers should spend as little time as possible on understanding a framework.
Some application developers need raft, but they do not really want to understand how it works.

As I recall one of the openraft application developers believes the vote does not have to be persisted when RaftStorage::save_vote() is called.:)

So I'd try not to introduce complexity for application developers if possible.

zach-schoenberger · 2022-11-15T13:33:19Z

Sounds like this change doesn't make much sense as is then. I'll close the PR.

switching to have snapshot use Stream/Sink traits

da3dfc3

removing all aspects of having raft engine maintain actual streaming …

b2dcadd

…of the snapshot

zach-schoenberger added 2 commits November 7, 2022 09:07

updating all of the tests and some of the examples

44dbde9

updating the rest of the examples

62ca966

zach-schoenberger mentioned this pull request Nov 7, 2022

Consider refactoring Snapshot type bound. #209

Closed

zach-schoenberger changed the title ~~switching to have snapshot use Stream/Sink traits~~ Change: Let user define Snapshot and how to Send/Receive the Snapshot Nov 7, 2022

removing un-needed tests

357a3eb

zach-schoenberger closed this Nov 9, 2022

zach-schoenberger reopened this Nov 9, 2022

Change: Let user define Snapshot and how to Send/Receive the Snapshot

69ba5c5

zach-schoenberger mentioned this pull request Nov 13, 2022

Feature: adding a snapshot finalize timeout config #602

Merged

3 tasks

zach-schoenberger closed this Nov 15, 2022

zach-schoenberger mentioned this pull request Nov 18, 2022

Change: remove AsyncSeek trait from snapshot #605

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

zach-schoenberger commented Nov 6, 2022 •

edited

Loading

drmingdrmer commented Nov 7, 2022

zach-schoenberger commented Nov 7, 2022

drmingdrmer commented Nov 7, 2022

zach-schoenberger commented Nov 7, 2022

drmingdrmer commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022 •

edited

Loading

drmingdrmer commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022

drmingdrmer commented Nov 9, 2022

zach-schoenberger commented Nov 9, 2022 •

edited

Loading

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022 •

edited

Loading

zach-schoenberger commented Nov 10, 2022

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022 •

edited

Loading

zach-schoenberger commented Nov 13, 2022

drmingdrmer commented Nov 13, 2022

zach-schoenberger commented Nov 15, 2022

Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

Conversation

zach-schoenberger commented Nov 6, 2022 • edited Loading

drmingdrmer commented Nov 7, 2022

zach-schoenberger commented Nov 7, 2022

drmingdrmer commented Nov 7, 2022

zach-schoenberger commented Nov 7, 2022

drmingdrmer commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022 • edited Loading

drmingdrmer commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022

zach-schoenberger commented Nov 8, 2022

drmingdrmer commented Nov 9, 2022

zach-schoenberger commented Nov 9, 2022 • edited Loading

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022 • edited Loading

zach-schoenberger commented Nov 10, 2022

drmingdrmer commented Nov 10, 2022

zach-schoenberger commented Nov 10, 2022 • edited Loading

zach-schoenberger commented Nov 13, 2022

drmingdrmer commented Nov 13, 2022

zach-schoenberger commented Nov 15, 2022

zach-schoenberger commented Nov 6, 2022 •

edited

Loading

zach-schoenberger commented Nov 8, 2022 •

edited

Loading

zach-schoenberger commented Nov 9, 2022 •

edited

Loading

zach-schoenberger commented Nov 10, 2022 •

edited

Loading

zach-schoenberger commented Nov 10, 2022 •

edited

Loading