Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: Let user define Snapshot and how to Send/Receive the Snapshot #600

Closed

Conversation

zach-schoenberger
Copy link
Contributor

@zach-schoenberger zach-schoenberger commented Nov 6, 2022

I still need to do the pr cleanup below.

The goal of this PR is to update the snapshoting process to be more customizable. The first thought was to make the snapshot be broken into a stream and a sink. Which is what this PR currently shows. But looking at it more I am curious if even this isn't quite right. I know the paper goes over the RPC for InstallSnapshot and that the Raft engine should process the chunks of the snapshot. But is this really necessary? Couldn't the RPC be simplified to be

pub struct InstallSnapshotRequest<C: RaftTypeConfig> {
    pub vote: Vote<C::NodeId>,

    /// Metadata of a snapshot: snapshot_id, last_log_ed membership etc.
    pub meta: SnapshotMeta<C::NodeId, C::Node>,

    /// The byte offset where this chunk of data is positioned in the snapshot file.
    pub offset: u64,

    /// The snapshot data.
    pub data: C::SD,
}

Where C::SD is the snapshot data which can be any struct. And have the user of the API handle how it should be sent and received. This would be much more flexible and remove all the logic in the raft_core around building the snapshot parts since the user will have already done that in the best way for their use case.

Checklist

  • Updated guide with pertinent info (may not always apply).
  • Squash down commits to one or two logical commits which clearly describe the work you've done.
  • Unittest is a friend:)

This change is Reviewable

@drmingdrmer
Copy link
Member

    /// The snapshot data.
    pub data: C::SD,

This looks good

@zach-schoenberger
Copy link
Contributor Author

@drmingdrmer i just updated this with what I think would be a more useful way of handling snapshots. The main thought being that the raft engine itself should let the user of the library send the snapshot how they want too. The main downside I see to this being that currently the chuncks of the snapshot could fail early if some aspect of the vote changes. but this could be put on the user to handle. This branches changes I believe simplify the snapshot concept in the engine and give the user more freedom to do as they wish with the snapshot process.

@drmingdrmer
Copy link
Member

The main downside I see to this being that currently the chuncks of the snapshot could fail early if some aspect of the vote changes. but this could be put on the user to handle. This branches changes I believe simplify the snapshot concept in the engine and give the user more freedom to do as they wish with the snapshot process.

I do not quite get this: if a vote change causes the snapshotting to shut down, it has to be dealt with openraft. Such a task can not be left to the application.

If the major change is to define snapshot chunk with C::SD, why not make it a runnable PR, instead of commenting out large blocks of codes?

@zach-schoenberger zach-schoenberger changed the title switching to have snapshot use Stream/Sink traits Change: Let user define Snapshot and how to Send/Receive the Snapshot Nov 7, 2022
@zach-schoenberger
Copy link
Contributor Author

@drmingdrmer I've gone through and updated the tests where appropriate now.

@drmingdrmer
Copy link
Member

@drmingdrmer I've gone through and updated the tests where appropriate now.

It looks like you removed the snapshot streaming entirely.

How does an application implement streaming if the snapshot is very large?

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 8, 2022

it really becomes a question for the user of the api. The C::SD can still be a stream if that makes sense for the user. Or it could be a directory location where the files reside. then in the network trait that sends the snapshot, the user can do what makes the most sense for them. they could send the metadata and start streaming the data if its a stream. or send multiple files in parallel if they have a list of files to send over. Or they could even send each SM entry one by one.

On the receiving the client needs to aggregate the full snapshot before applying raft. But that again can be optimized by the user for their use case.

@drmingdrmer
Copy link
Member

If C::SD is a stream, it can not be simply sent as a struct field.

The RaftNetwork has to provide an API whose argument is a Stream and an application has to connect this Stream to the remote peer, e.g., by implementing this Stream with a gRPC-stream.

@zach-schoenberger
Copy link
Contributor Author

yep thats right. you would not want to just serialize the InstallSnapshotRequest as is in that case. the user can have their client break those into separate requests. or do whatever they like rpc wise to send the data over. all of this can fit nicely under the hood of the RaftNetwork::send_install_snapshot call. I had thought it might be more intuitive to break out the snapshot from the InstallSnapshotRequest because of what you said, but I didn't see much benefit.

@zach-schoenberger
Copy link
Contributor Author

I also think this pr needs some work in relation to the API change. It's just that I'm not sure what the project wants in terms of that. I know from my experience and what I've read from other issues it looks like the engine should not be what defines how the snapshot is sent between nodes, since what defines a user's snapshot can vary so greatly between implementations. Please let me know any suggestions or feedback!

@drmingdrmer
Copy link
Member

yep thats right. you would not want to just serialize the InstallSnapshotRequest as is in that case. the user can have their client break those into separate requests. or do whatever they like rpc wise to send the data over. all of this can fit nicely under the hood of the RaftNetwork::send_install_snapshot call. I had thought it might be more intuitive to break out the snapshot from the InstallSnapshotRequest because of what you said, but I didn't see much benefit.

One of my concerns is that an application has to understand raft protocol very well to define its own RPC APIs.
A snapshot may be very large, and an application might have to split it into chunks and send them with several application-defined RPC.
The application may not deal with xxx_request.vote correctly in the RPC handlers, and there are no tests that can discover such issues.

This is why raft-protocol RPCs have to be defined by RaftNetwork, and every request has to be dealt with by RaftCore.

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 9, 2022

Could you expand on the xxx_request.vote handling more? Voting changes while a snapshot streams was one thing i was worried about and not 100% on my understanding. Also I completely agree on the RPC point. Those haven't changed with this PR. I guess in my mind I see a couple different scenarios when it comes to snapshots (all inside RaftNetwork::send_install_snapshot ):

  • they are small and in memory - super easy, C:SD is a Vec<u8> like in the samples. user defines how they want to send it over. Or C::SD is just a reference to the statemachine and the user can stream over the values. Dealers choice. InstallSnapshotRequest<C> is created on the other side and install_snapshot(&self, rpc: InstallSnapshotRequest<C>) is called with it.
  • they are large and in a single file - still pretty easy, C::SD could be a filename and the user sends the file over, followed by an install which triggers the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)
  • they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

Technically all of these can be done in the current branch without these changes. InstallSnapshotRequest<C> just contains a serialized file list, and the RaftStorage::install_snapshot is what actually does the data transfer. (I have done this for my use case). But it makes setting the snapshot timeout pretty odd since the timeout will have to account for the entire snapshot download instead of a chunk. It's just when going through this I didn't see a really good reason to force the raft engine snapshot send to be so tightly bound. But maybe there's something I missed and I'm wrong about this.

@drmingdrmer
Copy link
Member

Could you expand on the xxx_request.vote handling more? Voting changes while a snapshot streams was one thing i was worried about and not 100% on my understanding.

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked.
When streaming a snapshot to a remote peer, a piece of data is transferred along this path:
local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore

No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Also I completely agree on the RPC point. Those haven't changed with this PR. I guess in my mind I see a couple different scenarios when it comes to snapshots (all inside RaftNetwork::send_install_snapshot ):

  • they are small and in memory - super easy, C:SD is a Vec<u8> like in the samples. user defines how they want to send it over. Or C::SD is just a reference to the statemachine and the user can stream over the values. Dealers choice. InstallSnapshotRequest<C> is created on the other side and install_snapshot(&self, rpc: InstallSnapshotRequest<C>) is called with it.

Yes.

  • they are large and in a single file - still pretty easy, C::SD could be a filename and the user sends the file over, followed by an install which triggers the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

This will block RaftCore.

  • they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the install_snapshot(&self, rpc: InstallSnapshotRequest<C>)

This blocks RaftCore` too.

@zach-schoenberger
Copy link
Contributor Author

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked. When streaming a snapshot to a remote peer, a piece of data is transferred along this path: local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore

No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Sorry, do you mean the local RaftCore or remote RaftCore?

The blocking would occur at 3. Since 3 is in the local replication task, my understanding is that the local RaftCore should not be blocked. The blocking would really occur inside the replication task. I see that could definitely cause an issue with it's state. Would it be reasonable to move call 3 into its own task? That way it does not block the replication task and could be cancelled when necessary?

@drmingdrmer
Copy link
Member

Every time enter RaftCore, e.g., calling a method of RaftCore, such as install_snapshot(), append_entries(), the vote must be checked. When streaming a snapshot to a remote peer, a piece of data is transferred along this path: local-RaftStorage-impl -(1)-> local-RaftCore -(2)-> local-RaftNetwork-impl -(3)-> remote-RPC-service-impl -(4)-> remote-Raft -(5)-> remote-RaftCore
No matter what C::SD is, (5) is called only once, and the vote of remote RaftCore is checked only once. This means (5) can not return until all data is transferred, which will just block RaftCore for a long time.

Sorry, do you mean the local RaftCore or remote RaftCore?

The remote RaftCore, at step 5.

The blocking would occur at 3. Since 3 is in the local replication task, my understanding is that the local RaftCore should not be blocked. The blocking would really occur inside the replication task. I see that could definitely cause an issue with it's state. Would it be reasonable to move call 3 into its own task? That way it does not block the replication task and could be cancelled when necessary?

Sending data from leader to followers are already done in other tasks. Step 3 won't block.

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 10, 2022

Step 4 isn't callable until step 3 has completed. So the full snapshot would already be on the remote client when step 4 is called.

I know above it was referenced that C::SD could be a stream. As you've pointed out that scenario doesn't work if the snapshot system is simplified, so sorry about that confusion. But in the 3 scenarios above it should work, no?

@zach-schoenberger
Copy link
Contributor Author

Let me update the rocks and mem examples to show what I mean.

@drmingdrmer
Copy link
Member

Step 4 isn't callable until step 3 has completed. So the full snapshot would already be on the remote client when step 4 is called.

If C::SD is a simple Vec<T>, yes.

For a stream C::SD, I would say no:).

I know above it was referenced that C::SD could be a stream. As you've pointed out that scenario doesn't work if the snapshot system is simplified, so sorry about that confusion. But in the 3 scenarios above it should work, no?

I'm not quite sure what it should work means: AFAIK, it only works when C::SD is a single chunk snapshot, or a list of chunk names(the follower has to download chunks then). But if C::SD is a Stream, it won't work.

The 3rd scenario:

they are large and in multiple files - more complicated but easier with the simplified code. C::SD could be a list of file names or a directory. RaftNetwork::send_install_snapshot can either send each file in sequence or in parallel. then send a finalize that would call the

Did you mean to let the leader send multiple chunks, and let the follower buffer all the chunks and then let the follower re-build a C::SD and pass it to Raft?

I think it works but introduces some complexity, such as the receiving peer has to watch raft vote changes so that it can be canceled.

@zach-schoenberger
Copy link
Contributor Author

zach-schoenberger commented Nov 10, 2022

I'm not quite sure what it should work means: AFAIK, it only works when C::SD is a single chunk snapshot, or a list of chunk names(the follower has to download chunks then). But if C::SD is a Stream, it won't work.

Right the C::SD as just a stream wont work with this setup. But the contents or the snapshot C::SD represents could still be transferred however the user wants.

Did you mean to let the leader send multiple chunks, and let the follower buffer all the chunks and then let the follower re-build a C::SD and pass it to Raft?

Yep

I think it works but introduces some complexity, such as the receiving peer has to watch raft vote changes so that it can be canceled.

That's a great point about the vote change. I would expect the snapshot to be rejected if it's vote is not correct, but I haven't verified that is the case. (This is the case). The case of the the local raft going down and leaving the remote snapshot unfinished would also have to be handled by the remote client. But I would argue that these complexities should be put on the user.

@zach-schoenberger
Copy link
Contributor Author

Any more feedback on this idea?

@drmingdrmer
Copy link
Member

Any more feedback on this idea?

Such an abstraction leaves too many things to do for the application developers.

Application developers should spend as little time as possible on understanding a framework.
Some application developers need raft, but they do not really want to understand how it works.

As I recall one of the openraft application developers believes the vote does not have to be persisted when RaftStorage::save_vote() is called.:)

So I'd try not to introduce complexity for application developers if possible.

@zach-schoenberger
Copy link
Contributor Author

Sounds like this change doesn't make much sense as is then. I'll close the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants