-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF: Don't dispute on missing artifact #7011
Conversation
A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check. Design: Currently we assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it still doesn't work). And when enqueuing an execute job, we check for the artifact on-disk, and start preparation if not found. Changes: - [x] Integration test (should fail without the following changes) - [x] Check if artifact exists when executing, prepare if not - [x] Return an internal error when file is missing - [x] Retry once on internal errors - [x] Document design (update impl guide)
d043a32
to
6fa0046
Compare
// Wait a brief delay before retrying. | ||
futures_timer::Delay::new(PVF_EXECUTION_RETRY_DELAY).await; | ||
// Allow one retry for each kind of error. | ||
let mut num_internal_retries_left = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this higher, since this kind of error is probably the most likely to be transient.
node/core/pvf/src/execute/worker.rs
Outdated
@@ -359,7 +366,13 @@ fn validate_using_artifact( | |||
// [`executor_intf::prepare`]. | |||
executor.execute(artifact_path.as_ref(), params) | |||
} { | |||
Err(err) => return Response::format_invalid("execute", &err), | |||
Err(err) => | |||
return if err.contains("failed to open file: No such file or directory") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really requires a refactor changing the error type to something sensible, like an enum
. Matching on a string is way too error prone. Something changes the message, localization, ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very likely to come from substrate, it's full of string errors. I agree that matching against strings is no-go, but otherwise we'd have to halt the pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really requires a refactor changing the error type to something sensible, like an
enum
Vote up, I've already raised that concern somewhere... Many errors coming from Substrate are not sensible at all. Also agree that string matching makes no good.
@mrcnski a (probably stupid) idea: until we have an enum
error from Substrate, would it be better not to rely on its string errors but to check for the file existence ourselves? Should be simple enough. Of course, it introduces a race condition, but still better than parsing strings. Also, nobody guarantees that the file persists between the moments when it is open and when it is read, so that kind of race condition already exists anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it comes from Substrate. Didn't think about localization. 😬 I considered just treating RuntimeConstruction
itself as an internal error, but seems it's also used for some case where wasm runs out of memory, which would be a problem with the PVF itself. link
Checking for the file existence seems sensible to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problems with the PVF itself we agreed are also no reason to raise a dispute, since we have pre-checking enabled. Basically any error that is independent of the candidate at hand should not be cause for a dispute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's in any case create a ticket for fixing those string errors - or at least the one in question right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like "output exceeds bounds of wasm memory"
should not be an issue with runtime construction, but is actually indicative of a malicious PVF. And I think this would get past pre-checking, since that just compiles and doesn't execute.
// Do a length check before allocating. The returned output should not be bigger than the
// available WASM memory. Otherwise, a malicious parachain can trigger a large allocation,
// potentially causing memory exhaustion.
//
// Get the size of the WASM memory in bytes.
let memory_size = ctx.as_context().data().memory().data_size(ctx);
if checked_range(output_ptr as usize, output_len as usize, memory_size).is_none() {
Err(WasmError::Other("output exceeds bounds of wasm memory".into()))?
}
(I used WasmError::Other
here to match the other errors in the file, without realizing it gets converted to RuntimeConstruction
. 🤷♂️)
Anyway, basically, this one "output exceeds bounds of wasm memory"
case is deterministic and we should definitely vote against. If we gave it a new separate enum in Substrate, then we could treat the existing RuntimeConstruction
as an internal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised a small fix for the output-bounds case here, but I'm still not confident that RuntimeConstruction
is always a transient error and don't think we should use it. Unless we treat it as a new "possibly transient" case, meaning we retry and dispute if it happens again. 🤷♂️
For the file-not-found case, there is not a clear way to fix the error story on the Substrate side. Just having another check here for file existence should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the check, please reference the substrate issue though in a comment. This way, readers understand why we did it this way and we can reevaluate once the issue is fixed.
Ready for another review. :) (Don't think we can do much about the Substrate error for now.) |
@@ -691,38 +691,54 @@ trait ValidationBackend { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating candidate-validation tests would not harm
bot merge |
Error: Statuses failed for b450803 |
bot rebase |
Rebased |
bot merge |
* master: (30 commits) update rocksdb to 0.20.1 (#7113) Reduce base proof size weight component to zero (#7081) PVF: Move PVF workers into separate crate (#7101) Companion for #13923 (#7111) update safe call filter (#7080) PVF: Don't dispute on missing artifact (#7011) XCM: Properly set the pricing for the DMP router (#6843) pvf: Update docs for PVF artifacts (#6551) Bump syn from 2.0.14 to 2.0.15 (#7093) Companion for substrate#13771 (#6983) Added Dwellir Nigeria bootnodes. (#7097) Companion for Substrate #13889 (#7063) Switch to DNS name based bootnodes for Rococo (#7040) companion for substrate#13883 (#7059) [xcm] Added `UnpaidExecution` instruction to `UnpaidRemoteExporter` (#7091) Bump serde_json from 1.0.85 to 1.0.96 (#7072) Bump hex-literal from 0.3.4 to 0.4.1 (#7071) Small simplification (#7089) [XCM - UnpaidRemoteExporter] Remove unreachable code (#7088) sync versions with current release (#7083) ...
* master: (39 commits) malus: dont panic on missing validation data (#6952) Offences Migration v1: Removes `ReportsByKindIndex` (#7114) Fix stalling dispute coordinator. (#7125) Fix rolling session window (#7126) [ci] Update buildah command and version (#7128) Bump assigned_slots params (#6991) XCM: Remote account converter (#6662) Rework `dispute-coordinator` to use `RuntimeInfo` for obtaining session information instead of `RollingSessionWindow` (#6968) Revert default proof size back to 64 KB (#7115) update rocksdb to 0.20.1 (#7113) Reduce base proof size weight component to zero (#7081) PVF: Move PVF workers into separate crate (#7101) Companion for #13923 (#7111) update safe call filter (#7080) PVF: Don't dispute on missing artifact (#7011) XCM: Properly set the pricing for the DMP router (#6843) pvf: Update docs for PVF artifacts (#6551) Bump syn from 2.0.14 to 2.0.15 (#7093) Companion for substrate#13771 (#6983) Added Dwellir Nigeria bootnodes. (#7097) ...
PULL REQUEST
Overview
A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.
Design
Currently we assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there.
We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for
AmbiguousWorkerDeath
, except we don't dispute if it still doesn't work). And when enqueuing an execute job, we check for the artifact on-disk, and start preparation if not found.Changes
handle_execute_pvf
...Related issues
Closes #6959
Pre-requisite for paritytech/polkadot-sdk#685