-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawn VID with spawn_blocking
#2369
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor comment.
let vid_disperse = vid.disperse(encoded_transactions.clone()).unwrap(); | ||
|
||
let vid_disperse = spawn_blocking(move || { | ||
let vid = VidScheme::new(chunk_size, num_quorum_committee, &srs).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the note below (copied from the PR description) as a comment here?
Note that the unwrap for the tokio version just passes on any panics that happen during the thread, so it is "safe", i.e it's not adding a new place we can panic.
spawn_blocking
WIPspawn_blocking
Closes #2349
This PR:
Spawn the VID calculation on a different thread from the async executor to not block other tasks. From this blog https://ryhl.io/blog/async-what-is-blocking/ from the tokio creator we should not be going more than ~100 micro seconds. VID disperse takes an order of magnitude or longer than this so we should run it on it a dedicated thread for blocking. The blog recommends rayon for cpu tasks but also says if we don't have a lot of CPU work than
spawn_blocking
is okI also increased the async threads to two in the just file because I believe the CI infra has 2 threads (except for maybe self hosted). Personally I think we should remove this restriction entirely.
This PR does not:
Look at other places where we block async tasks for cpu heavy work. We should probably audit the code for other places where we might go a long time between awaits in our code. Potentially signature aggregation, qc validation, and serialization.
It also doesn't make the VID disperse run on multiple threads. As mentioned in the task we should not be dividing up the block on the consensus side. The further parallelization will be done in jellyfish.
Key places to review:
vid.rs where we now spawn a task. Note that the
unwrap
for the tokio version just passes on any panics that happen during the thread, so it is "safe", i.e it's not adding a new place we can panic.This should improve the consistency of our CI since we should be able to run faster. I think a key issue with our tests was that all nodes would block each other from running while the leader calculated the vid disperse and blocked the main executor. In a single node setting it's probably not so bad since a leader would probably not have other work to do while it was blocked on vid calculation.