-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Separate preparation timeouts for PVF prechecking and execution #6139
Changes from all commits
10d1460
ee0abe2
ea82029
218d640
8514193
b07e37d
d9a1fac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,6 +38,16 @@ use std::{ | |
time::{Duration, SystemTime}, | ||
}; | ||
|
||
/// The time period after which the precheck preparation worker is considered unresponsive and will | ||
/// be killed. | ||
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
pub const PRECHECK_COMPILATION_TIMEOUT: Duration = Duration::from_secs(60); | ||
|
||
/// The time period after which the execute preparation worker is considered unresponsive and will | ||
/// be killed. | ||
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: It would be great if there were a doc line explaining the relationship between the two. Perhaps, moving them into a module (inline or separate file) and in the module doc explaining the stuff we discussed in DMs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: I wonder if this would be better named as |
||
|
||
/// An alias to not spell the type for the oneshot sender for the PVF execution result. | ||
pub(crate) type ResultSender = oneshot::Sender<Result<ValidationResult, ValidationError>>; | ||
|
||
|
@@ -51,10 +61,11 @@ pub struct ValidationHost { | |
} | ||
|
||
impl ValidationHost { | ||
/// Precheck PVF with the given code, i.e. verify that it compiles within a reasonable time limit. | ||
/// The result of execution will be sent to the provided result sender. | ||
/// Precheck PVF with the given code, i.e. verify that it compiles within a reasonable time | ||
/// limit. This will prepare the PVF. The result of preparation will be sent to the provided | ||
/// result sender. | ||
/// | ||
/// This is async to accommodate the fact a possibility of back-pressure. In the vast majority of | ||
/// This is async to accommodate the possibility of back-pressure. In the vast majority of | ||
/// situations this function should return immediately. | ||
/// | ||
/// Returns an error if the request cannot be sent to the validation host, i.e. if it shut down. | ||
|
@@ -72,7 +83,7 @@ impl ValidationHost { | |
/// Execute PVF with the given code, execution timeout, parameters and priority. | ||
/// The result of execution will be sent to the provided result sender. | ||
/// | ||
/// This is async to accommodate the fact a possibility of back-pressure. In the vast majority of | ||
/// This is async to accommodate the possibility of back-pressure. In the vast majority of | ||
/// situations this function should return immediately. | ||
/// | ||
/// Returns an error if the request cannot be sent to the validation host, i.e. if it shut down. | ||
|
@@ -92,7 +103,7 @@ impl ValidationHost { | |
|
||
/// Sends a signal to the validation host requesting to prepare a list of the given PVFs. | ||
/// | ||
/// This is async to accommodate the fact a possibility of back-pressure. In the vast majority of | ||
/// This is async to accommodate the possibility of back-pressure. In the vast majority of | ||
/// situations this function should return immediately. | ||
/// | ||
/// Returns an error if the request cannot be sent to the validation host, i.e. if it shut down. | ||
|
@@ -418,6 +429,9 @@ async fn handle_to_host( | |
Ok(()) | ||
} | ||
|
||
/// Handles PVF prechecking. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: ... prechecking requests |
||
/// | ||
/// This tries to prepare the PVF by compiling the WASM blob within a given timeout ([`PRECHECK_COMPILATION_TIMEOUT`]). | ||
async fn handle_precheck_pvf( | ||
artifacts: &mut Artifacts, | ||
prepare_queue: &mut mpsc::Sender<prepare::ToQueue>, | ||
|
@@ -440,12 +454,24 @@ async fn handle_precheck_pvf( | |
} | ||
} else { | ||
artifacts.insert_preparing(artifact_id, vec![result_sender]); | ||
send_prepare(prepare_queue, prepare::ToQueue::Enqueue { priority: Priority::Normal, pvf }) | ||
.await?; | ||
send_prepare( | ||
prepare_queue, | ||
prepare::ToQueue::Enqueue { | ||
priority: Priority::Normal, | ||
pvf, | ||
compilation_timeout: PRECHECK_COMPILATION_TIMEOUT, | ||
}, | ||
) | ||
.await?; | ||
} | ||
Ok(()) | ||
} | ||
|
||
/// Handles PVF execution. | ||
/// | ||
/// This will first try to prepare the PVF, if a prepared artifact does not already exist. If there is already a | ||
/// preparation job, we coalesce the two preparation jobs. When preparing for execution, we use a more lenient timeout | ||
/// ([`EXECUTE_COMPILATION_TIMEOUT`]) than when prechecking. | ||
async fn handle_execute_pvf( | ||
cache_path: &Path, | ||
artifacts: &mut Artifacts, | ||
|
@@ -462,7 +488,7 @@ async fn handle_execute_pvf( | |
|
||
if let Some(state) = artifacts.artifact_state_mut(&artifact_id) { | ||
match state { | ||
ArtifactState::Prepared { ref mut last_time_needed } => { | ||
ArtifactState::Prepared { last_time_needed } => { | ||
*last_time_needed = SystemTime::now(); | ||
|
||
send_execute( | ||
|
@@ -485,9 +511,17 @@ async fn handle_execute_pvf( | |
} | ||
} else { | ||
// Artifact is unknown: register it and enqueue a job with the corresponding priority and | ||
// | ||
// PVF. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol, finally this is fixed 🎉 |
||
artifacts.insert_preparing(artifact_id.clone(), Vec::new()); | ||
send_prepare(prepare_queue, prepare::ToQueue::Enqueue { priority, pvf }).await?; | ||
send_prepare( | ||
prepare_queue, | ||
prepare::ToQueue::Enqueue { | ||
priority, | ||
pvf, | ||
compilation_timeout: EXECUTE_COMPILATION_TIMEOUT, | ||
}, | ||
) | ||
.await?; | ||
|
||
awaiting_prepare.add(artifact_id, execution_timeout, params, result_tx); | ||
} | ||
|
@@ -520,7 +554,11 @@ async fn handle_heads_up( | |
|
||
send_prepare( | ||
prepare_queue, | ||
prepare::ToQueue::Enqueue { priority: Priority::Normal, pvf: active_pvf }, | ||
prepare::ToQueue::Enqueue { | ||
priority: Priority::Normal, | ||
pvf: active_pvf, | ||
compilation_timeout: EXECUTE_COMPILATION_TIMEOUT, | ||
}, | ||
) | ||
.await?; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -155,7 +155,8 @@ impl metrics::Metrics for Metrics { | |
"Time spent in preparing PVF artifacts in seconds", | ||
) | ||
.buckets(vec![ | ||
// This is synchronized with COMPILATION_TIMEOUT=60s constant found in | ||
// This is synchronized with the PRECHECK_COMPILATION_TIMEOUT=60s | ||
// and EXECUTE_COMPILATION_TIMEOUT=180s constants found in | ||
// src/prepare/worker.rs | ||
0.1, | ||
0.5, | ||
|
@@ -166,6 +167,7 @@ impl metrics::Metrics for Metrics { | |
20.0, | ||
30.0, | ||
60.0, | ||
180.0, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Do you think that's a good resolution for this metric? IOW, ask the question, looking at a metrics dashboard do you think it's possible that you would think for yourself "I wish there were more buckets available that are higher/lesser than 180"? If you are inclined to say yes just plop more bands. It's very cold code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by cold code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. generally, cold code is the code that is not frequently called. Here, I meant that there are no performance reasons to save on the bands. Assuming more bands, less performance. I did not even think about this too much since it's how many preparations per second can we reasonably do in the worst case? 100? So with the performance of the node argument being irrelevant, and by extension memory as well. I don't see other arguments against it. |
||
]), | ||
)?, | ||
registry, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: those docs distinguish between pre-check preparation worker and execute preparation worker. I am not sure if that's the right way of thinking about that. After all it's the very same worker that does the same the job. It does not even have any parametrization.