PVF: Vote invalid on panics in execution thread (after a retry) #7155

mrcnski · 2023-05-01T13:26:19Z

PULL REQUEST

Overview

See title. Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045.

Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.

…panics-in-execution

Addresses a couple of follow-up TODOs from #7153.

mrcnski · 2023-05-01T13:35:18Z

node/core/candidate-validation/src/lib.rs

+		// TODO: Should we stop retrying after some time has passed?
+		//       We would need to know if we came from backing or approval.


@s0me0ne-unkn0wn IIRC you've looked into this, WDYT?

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

You mean in the case of backing you want the strict timeout to be applied to both retries cumulatively, not to a single one? Looks like a valid idea. There's no dedicated field to tell from which subsystem the request is coming yet, but you can check for execution timeout kind in the incoming CandidateValidationMessage::ValidateFromExhaustive or CandidateValidationMessage::ValidateFromChainState, it will be either PvfExecTimeoutKind::Backing or PvfExecTimeoutKind::Approval so you can tell backing from the other stuff. Sounds a bit hacky probably, but will work.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ?

Maybe, or also just a hardware glitch, spurious OS glitch, etc. We can't really tell why a panic happened, or if it was an issue with the candidate or not.¹ Retrying lets us be more sure about our vote.

Footnotes

Well we do get a stringified error but we can't match on it or anything. ↩

Thanks @s0me0ne-unkn0wn! Pushed some changes.

Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

No, this should not be possible. If it takes too long, we noshow and more validators will cover for us.

But still, if those new guys also retry, they will also no-show and malicious votes come in ?

They will then be replaced by even more checkers. One noshow will be replaced by >1 more checkers, which will again be 2/3 honest. Anyhow, this perfect DoS of all honest nodes is certainly the absolute worst case, so yes we have to be careful when to not raise a dispute. In general, if in doubt raise one.

For hardware faults being the cause, I think we should look into having error checking. E.g. we already suggest/require ECC memory - we should have something similar for disk/db. The node should be able to detect a corrupted db and not just read garbage data and then dispute stuff.

CPU hardware faults are extremely rare (I'm closely familiar with a single case, though a really interesting one), and memory faults are not so rare (I came across two cases in my life) but still not something that should be seriously taken into account (that's something that could be handled by governance reverting the slashes on a case-by-case basis), so do I understand correctly we are left with a single really concerning case of OOM? Can we tell the OOM from the other cases? Afair Linux's oom_kill_task() sends SIGKILL to the process instead of SIGSEGV, could we use it to detect the condition?

Can we tell the OOM from the other cases?

Someone (I believe @burdges?) had the idea of setting a flag before each allocation and unsetting it after -- would be easy to do if we had the allocator wrapper that's been discussed. And I agree that we should focus on OOM, since we've seen it happen in the wild.

roadmap/implementers-guide/src/node/utility/candidate-validation.md

sandreim · 2023-05-01T15:21:08Z

node/core/candidate-validation/src/lib.rs

+		// TODO: Should we stop retrying after some time has passed?
+		//       We would need to know if we came from backing or approval.


Why would a panic resolve itself after retry ? Maybe in case of memory pressure -> OOM ? But still, if we retry and wait too much time, malicious validators might have enough time to vote valid and the including block will be finalized - we currently have 2 relay chain blocks delay on finality which should help, but we'd have to be careful here.

…on.md Co-authored-by: Andrei Sandu <[email protected]>

- Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread.

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is.

…panics-in-execution

…ution' into mrcnski/pvf-catch-panics-in-execution

eskimor

What other internal errors are left and can we be more specific about them?

node/core/pvf/src/execute/worker_intf.rs

roadmap/implementers-guide/src/node/utility/candidate-validation.md

…panics-in-execution

mrcnski · 2023-05-03T09:27:40Z

What other internal errors are left and can we be more specific about them?

@eskimor That is a great question. I enumerated the internal errors here (should jump to InternalValidationError):

660dadd (#7155)

node/core/pvf/src/error.rs

mrcnski · 2023-05-16T21:00:54Z

bot merge

* master: (60 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...

…slashing-client * ao-past-session-slashing-runtime: (61 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...

mrcnski added 5 commits May 1, 2023 14:16

PVF: Vote invalid on panics in execution thread (after a retry)

aff4779

Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate.

Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…

fc481bc

…panics-in-execution

Address a couple of TODOs

b03e1b0

Addresses a couple of follow-up TODOs from #7153.

Add some documentation to implementer's guide

0fcedbe

mrcnski requested review from eskimor, slumber and s0me0ne-unkn0wn May 1, 2023 13:26

Fix compile error

c7e44d1

mrcnski commented May 1, 2023

View reviewed changes

mrcnski added 2 commits May 1, 2023 15:48

Fix compile errors

dc0e9c9

Fix compile error

67994f5

sandreim reviewed May 1, 2023

View reviewed changes

mrcnski and others added 8 commits May 2, 2023 03:36

Update roadmap/implementers-guide/src/node/utility/candidate-validati…

181e89f

…on.md Co-authored-by: Andrei Sandu <[email protected]>

Address comments + couple other changes (see message)

df22727

- Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread.

Catch panics in threads so we always notify condvar

d4eb740

Use WaitOutcome enum instead of bool condition variable

850d2c0

Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…

5885222

…panics-in-execution

Fix retry timeouts to depend on exec timeout kind

4f8f1cc

Merge remote-tracking branch 'origin/mrcnski/pvf-catch-panics-in-exec…

87437b8

…ution' into mrcnski/pvf-catch-panics-in-execution

mrcnski requested a review from sandreim May 2, 2023 15:32

Address review comments

f4c0b4b

eskimor reviewed May 2, 2023

View reviewed changes

node/core/pvf/src/execute/worker_intf.rs Outdated Show resolved Hide resolved

roadmap/implementers-guide/src/node/utility/candidate-validation.md Show resolved Hide resolved

Make the API for condvars in workers nicer

883952c

mrcnski added 5 commits May 3, 2023 07:32

Add a doc

cc89ab8

Use condvar for memory stats thread

1967ddb

Small refactor

6e7a13c

Merge branch 'mrcnski/pvf-remove-rayon-tokio' into mrcnski/pvf-catch-…

566a438

…panics-in-execution

Enumerate internal validation errors in an enum

660dadd

Fix comment

187528c

eskimor approved these changes May 3, 2023

View reviewed changes

Add a log

6d7cbf0

mrcnski mentioned this pull request Aug 24, 2023

Handle OOM gracefully in PVF execution paritytech/polkadot-sdk#767

Open

Fix test

a2d0c72

sandreim approved these changes May 4, 2023

View reviewed changes

node/core/pvf/src/error.rs Outdated Show resolved Hide resolved

Update variant naming

6c54e42

mrcnski self-assigned this May 15, 2023

mrcnski mentioned this pull request May 16, 2023

PVF: Remove rayon and some uses of tokio #7153

Merged

Base automatically changed from mrcnski/pvf-remove-rayon-tokio to master May 16, 2023 18:35

mrcnski added 2 commits May 16, 2023 15:33

Merge branch 'master' into mrcnski/pvf-catch-panics-in-execution

7da8cdf

Address a missed TODO

da7d191

paritytech-processbot bot merged commit 9420d7d into master May 16, 2023

paritytech-processbot bot deleted the mrcnski/pvf-catch-panics-in-execution branch May 16, 2023 21:01

github-actions bot mentioned this pull request Jun 15, 2023

Update substrate/polkadot/cumulus from v0.9.40 to v0.9.43 moonbeam-foundation/moonbeam#2354

Closed

kiltbot mentioned this pull request Jun 21, 2023

1️⃣ [AUTOMATIC] Update Polkadot dependencies from 0.9.42 to 0.9.43 KILTprotocol/kilt-node#531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Vote invalid on panics in execution thread (after a retry) #7155

PVF: Vote invalid on panics in execution thread (after a retry) #7155

mrcnski commented May 1, 2023

mrcnski May 1, 2023

sandreim May 1, 2023

s0me0ne-unkn0wn May 1, 2023

mrcnski May 2, 2023

mrcnski May 2, 2023

eskimor May 2, 2023

sandreim May 2, 2023

eskimor May 2, 2023

s0me0ne-unkn0wn May 2, 2023

mrcnski May 3, 2023

sandreim May 1, 2023

eskimor left a comment

mrcnski commented May 3, 2023

mrcnski commented May 16, 2023

		// TODO: Should we stop retrying after some time has passed?
		// We would need to know if we came from backing or approval.

PVF: Vote invalid on panics in execution thread (after a retry) #7155

PVF: Vote invalid on panics in execution thread (after a retry) #7155

Conversation

mrcnski commented May 1, 2023

PULL REQUEST

Overview

Related

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskimor left a comment

Choose a reason for hiding this comment

mrcnski commented May 3, 2023

mrcnski commented May 16, 2023