-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF: Vote invalid on panics in execution thread (after a retry) #7155
PVF: Vote invalid on panics in execution thread (after a retry) #7155
Conversation
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045.
Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.
…panics-in-execution
Addresses a couple of follow-up TODOs from #7153.
// TODO: Should we stop retrying after some time has passed? | ||
// We would need to know if we came from backing or approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s0me0ne-unkn0wn IIRC you've looked into this, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in the case of backing you want the strict timeout to be applied to both retries cumulatively, not to a single one? Looks like a valid idea. There's no dedicated field to tell from which subsystem the request is coming yet, but you can check for execution timeout kind in the incoming CandidateValidationMessage::ValidateFromExhaustive
or CandidateValidationMessage::ValidateFromChainState
, it will be either PvfExecTimeoutKind::Backing
or PvfExecTimeoutKind::Approval
so you can tell backing from the other stuff. Sounds a bit hacky probably, but will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ?
Maybe, or also just a hardware glitch, spurious OS glitch, etc. We can't really tell why a panic happened, or if it was an issue with the candidate or not.1 Retrying lets us be more sure about our vote.
Footnotes
-
Well we do get a stringified error but we can't match on it or anything. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s0me0ne-unkn0wn! Pushed some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.
No, this should not be possible. If it takes too long, we noshow and more validators will cover for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But still, if those new guys also retry, they will also no-show and malicious votes come in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will then be replaced by even more checkers. One noshow will be replaced by >1 more checkers, which will again be 2/3 honest. Anyhow, this perfect DoS of all honest nodes is certainly the absolute worst case, so yes we have to be careful when to not raise a dispute. In general, if in doubt raise one.
For hardware faults being the cause, I think we should look into having error checking. E.g. we already suggest/require ECC memory - we should have something similar for disk/db. The node should be able to detect a corrupted db and not just read garbage data and then dispute stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU hardware faults are extremely rare (I'm closely familiar with a single case, though a really interesting one), and memory faults are not so rare (I came across two cases in my life) but still not something that should be seriously taken into account (that's something that could be handled by governance reverting the slashes on a case-by-case basis), so do I understand correctly we are left with a single really concerning case of OOM? Can we tell the OOM from the other cases? Afair Linux's oom_kill_task()
sends SIGKILL
to the process instead of SIGSEGV
, could we use it to detect the condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we tell the OOM from the other cases?
Someone (I believe @burdges?) had the idea of setting a flag before each allocation and unsetting it after -- would be easy to do if we had the allocator wrapper that's been discussed. And I agree that we should focus on OOM, since we've seen it happen in the wild.
roadmap/implementers-guide/src/node/utility/candidate-validation.md
Outdated
Show resolved
Hide resolved
// TODO: Should we stop retrying after some time has passed? | ||
// We would need to know if we came from backing or approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.
…on.md Co-authored-by: Andrei Sandu <[email protected]>
- Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread.
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is.
…panics-in-execution
…ution' into mrcnski/pvf-catch-panics-in-execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other internal errors are left and can we be more specific about them?
@eskimor That is a great question. I enumerated the internal errors here (should jump to |
bot merge |
* master: (60 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
…slashing-client * ao-past-session-slashing-runtime: (61 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
PULL REQUEST
Overview
See title. Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.
Related
Closes #7045
Addresses some TODOs from #7153, set as target branch.