-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF host: Investigate how to better launch worker processes #880
Comments
pepyakin
changed the title
Investigate how to better launch worker processes
PVF host: Investigate how to better launch worker processes
Jan 14, 2022
mrcnski
referenced
this issue
in paritytech/polkadot
Aug 6, 2023
Clearing env vars with the `std::process::Command` API didn't get everything on Mac, namely `__CF_USER_TEXT_ENCODING` was still present. While we don't support Mac itself as a secure system, the same issue could exist on some Linux systems either now or in the future. So it is better to just clear it on the child-side and not worry about it. We may not use the `Command` API in the future, anyway: https://github.com/paritytech/polkadot/issues/4721
claravanstaden
added a commit
to Snowfork/polkadot-sdk
that referenced
this issue
Dec 8, 2023
* Upgrade ssz_rs crate. * Upgrade ssz_rs crate. --------- Co-authored-by: claravanstaden <Cats 4 life!>
claravanstaden
added a commit
to Snowfork/polkadot-sdk
that referenced
this issue
Dec 8, 2023
* Improve contracts * More improvements to contracts * Fix parachain build in proxy-contracts branch (paritytech#889) * Upgrade ssz_rs crate. (paritytech#880) * Upgrade ssz_rs crate. * Upgrade ssz_rs crate. --------- Co-authored-by: claravanstaden <Cats 4 life!> * Bump nixpkgs to use its geth package (paritytech#885) * Bump nixpkgs to use its geth package * Use source config in PolkadotListener * Match field order in struct def * Move info log to relay creation * Remove unused channel * Whitespace * Add context to errors * Add logs * Bump node & pnpm in workflow * Rename locals * Remove unused variable * Add Troubleshooting README section * Fix up .envrc-example files * Add note about pure shells * Update cumulus submodule (paritytech#886) * Inbound queue benchmarks (paritytech#876) * Start with inbound channel benchmarks. * Add method to set execution header storage for benchmark test. * Working on benchmarks * Basic working version * Cleanup * Removes cleanup. * Adds some comments for Alistair. * Adds branch name. * Makes note * Test transactions * Cleaning up beacon client deps. * Clean up comments. * Tests cleanup. * Fixes non-benchmark test runs. * Cleanup. * Update fixtures and generates benchmarks. * Revert relayer logs. * Cleanup BenchmarkHelper impl and inbound queue dependencies. * fmt * Cleanup imports. * Cleanup imports. * Touch * Adds weights in inbound queue pallet. * Fix tests. * Update cumulus. --------- Co-authored-by: claravanstaden <Cats 4 life!> * Fix parachain build * Move BalanceOf outside of pallet * remove benchmark for non existing method * downgrade cargo.lock to match cumulus * fix benchmarks --------- Co-authored-by: Clara van Staden <[email protected]> Co-authored-by: Alistair Singh <[email protected]> * forge install: openzeppelin-contracts v4.9.2 * sno-472 rebased (paritytech#890) * Halting & resuming bridge pallets * Ignore .env * Remove .env * Some polish * Set owner of bridge pallets * Upgrade ssz_rs crate. (paritytech#880) * Upgrade ssz_rs crate. * Upgrade ssz_rs crate. --------- Co-authored-by: claravanstaden <Cats 4 life!> * Update cumulus * Bump nixpkgs to use its geth package (paritytech#885) * Bump nixpkgs to use its geth package * Use source config in PolkadotListener * Match field order in struct def * Move info log to relay creation * Remove unused channel * Whitespace * Add context to errors * Add logs * Bump node & pnpm in workflow * Rename locals * Remove unused variable * Add Troubleshooting README section * Fix up .envrc-example files * Add note about pure shells * Relax RANDAO_COMMIT_DELAY for local setup * Update cumulus * Update cumulus submodule (paritytech#886) * Update cumulus * Inbound queue benchmarks (paritytech#876) * Start with inbound channel benchmarks. * Add method to set execution header storage for benchmark test. * Working on benchmarks * Basic working version * Cleanup * Removes cleanup. * Adds some comments for Alistair. * Adds branch name. * Makes note * Test transactions * Cleaning up beacon client deps. * Clean up comments. * Tests cleanup. * Fixes non-benchmark test runs. * Cleanup. * Update fixtures and generates benchmarks. * Revert relayer logs. * Cleanup BenchmarkHelper impl and inbound queue dependencies. * fmt * Cleanup imports. * Cleanup imports. * Touch * Adds weights in inbound queue pallet. * Fix tests. * Update cumulus. --------- Co-authored-by: claravanstaden <Cats 4 life!> * Fix test * Fix parachain build * Move BalanceOf outside of pallet * remove benchmark for non existing method * downgrade cargo.lock to match cumulus * fix benchmarks * Clara/sno 552 (paritytech#887) * Spacing * Spacing * Undo typo. * Minor updates. * Adds comment about IrrelevantUpdate. * One more comment. * Update error name. --------- Co-authored-by: claravanstaden <Cats 4 life!> * Halting & resuming bridge pallets (paritytech#883) * Halting & resuming bridge pallets * Ignore .env * Remove .env * Some polish * Set owner of bridge pallets * Update cumulus * Relax RANDAO_COMMIT_DELAY for local setup * Update cumulus * Update cumulus * Fix test * Fix Warnings * Fix test * Fix build & format * Fix benchmark test * Check for duplicate versions of substrate and polkadot (paritytech#891) * modified pre-commit * fixes * testing * testing * testing * fixed tests * Format * Some fix * Update cumulus * Update cumulus --------- Co-authored-by: Clara van Staden <[email protected]> Co-authored-by: David Dunn <[email protected]> Co-authored-by: Alistair Singh <[email protected]> * Proxy contracts Tests fixes (paritytech#892) * fixed tests * warnings and imports * rustfmt * updated cumulus * Revert rustfmt (paritytech#896) * Revert "rustfmt" This reverts commit b83cec7929cdcc6ac972a2b1cad3a0e1fde81870. * reverted xcm-builder * Create agent (paritytech#895) * base * removed location conversion * completed implementation * remove xcm-builder * update cumulus * update cumulus * use contains_key * Fix openzeppelin-contracts submodule --------- Co-authored-by: David Dunn <[email protected]> * improve API for sending tokens * Messy working version. * Cleanup. * More cleanup. * Rollback unnecessary changes. * Cleanup whitespace. * Fix tests. * Fuzz the submit method. * Revert rebase oopsies. * More fuzzing. * More fuzzing. * Last extrinsic. * Remove unnecessary dependency. * Revert readme. * Fix tests and feature issues. * Cleanup types. * More cleanup. * Call extrinsics directly. Adds readme. * Adds CI. * cd to correct dir * Update CI. * Correct nightly param. * Remove runs. * Own impl for SyncCommittee. * Remove rng. Cleans up workflow. * Last cleanup. * Update parachain/pallets/ethereum-beacon-client/fuzz/Cargo.toml Co-authored-by: David Dunn <[email protected]> * Adds cargo-fuzz to init script. * PR comment changes. * Update parachain/pallets/ethereum-beacon-client/fuzz/src/impls.rs Co-authored-by: David Dunn <[email protected]> * Update rust-toolchain.toml Co-authored-by: David Dunn <[email protected]> * Update .github/workflows/parachain.yml Co-authored-by: David Dunn <[email protected]> * Less runs for shorter Github actions. --------- Co-authored-by: Vincent Geddes <[email protected]> Co-authored-by: David Dunn <[email protected]> Co-authored-by: Alistair Singh <[email protected]> Co-authored-by: Ron <[email protected]> Co-authored-by: claravanstaden <Cats 4 life!>
helin6
pushed a commit
to boolnetwork/polkadot-sdk
that referenced
this issue
Feb 5, 2024
* Update substrate to polkadot-v0.9.30 - Update substrate to commit 1b1a5e1 - Upgrade rust-toolchain to rustc 1.64.0-nightly (7fe022f5a 2022-07-24) - Make benchmark command as an optional feature - Fix benchmark template and script - Upgrade web3 to v1.8 and update ts-tests/package-lock.json Signed-off-by: koushiro <[email protected]> * Remove useless dep Signed-off-by: koushiro <[email protected]> * Fix ident Signed-off-by: koushiro <[email protected]> * Install protoc * Update some deps * Fix test * update prometheus to v0.13.3 Signed-off-by: koushiro <[email protected]> * Use evm 0.36.0 Signed-off-by: koushiro <[email protected]> * update lru 0.8.0 ==> 0.8.1 Signed-off-by: koushiro <[email protected]> * update scale-codec and scale-info Signed-off-by: koushiro <[email protected]> * Fix fmt Signed-off-by: koushiro <[email protected]> Signed-off-by: koushiro <[email protected]>
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Mar 26, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Mar 27, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 8, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 9, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 10, 2024
* update CA certificates in relay images * clippy * clippy * clippy
serban300
pushed a commit
to serban300/polkadot-sdk
that referenced
this issue
Apr 10, 2024
* update CA certificates in relay images * clippy * clippy * clippy
bkchr
pushed a commit
that referenced
this issue
Apr 10, 2024
* update CA certificates in relay images * clippy * clippy * clippy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Right now the PVF validation host spawns worker processes for doing compilation of wasm blobs and executing wasm code during validation of candidates.
The processes are spawned by taking the polkadot executable and invoking it via
Command
with additional parameters that specify the type of the worker and the path to the IPC socket that the worker will receive commands with. At the same time, the polkadot binary provides command-line arguments that would pass the control to the workers if a special worker-type command was passed.This proved to be an awkward way of doing things. It is one of the issues what prevents us from testing the validation host thoroughly (#2342) and already shot us in the foot (e.g. this).
I did a preliminary research on how to do the things better.
First what comes to mind is
fork(2)
. We fork the process and then pass the control to the worker entrypoint. However, this seems to be rather fragile. The reason is that fork preserves the memory content of the process, it's opened fds and at the same time only the thread that calledfork
will be running. The implication of that, is that the state of the program in the child preserves all the mutexes and the control flow can be anywhere. One of obvious examples of what can go wrong, is that if another thread calledmalloc
and got the lock then if the thread that called thefork
callsmalloc
as well, then it will get deadlocked. So therefore, onlyasync-signal-safe
are callable. Theoretically this may be solved by resuming threads but that's also obviously a bad idea: an example problem is that the same thread will write into the same descriptor in both parent and child process. That can corrupt data and send gibberish into sockets.vfork
is a special case offork
which is typically followed byexecve
. The latter basically the primitive that is used byCommand
and requires the filename to execute, which basically brings us to the same problem as the existing solution.If we step back and think what we actually need here we would notice that we just need to reuse the same executable image but specify a different entrypoint. Ideally, this entrypoint should be represented by a function address and would not require the executable to handle special arguments for invoking the required worker.
On Linux, there is a
clone(2)
syscall which is basically a pumped upfork
. It allows for specifying the entrypoint as a function pointer and it also takes flags to specify what parts of the parent to be carried on to the child process. The parent can either share or not share the address space, can put the child into new namespace, and so on. (In fact, using this syscall is how pthread on Linux is implemented).If we were to use it it would also allow us to introduce sandboxing of the worker processes, a feature envisioned in the initial design of the overhauled PVF validation host.
So in our case the parent will invoke
clone
, sharing almost nothing with the child process, passing an entrypoint to the required worker.However, there is a problem with that. Even though Rust has a minimal runtime, it still has to be initialized. For that, we must make sure that the pre-main stuff is executed as if we were invoking a program but with a different
main
entrypoint. I am not sure ifclone(2)
handles that by default or if that should be taken care of manually. In the end, C also has pre-main and from what I have seen in C there is no special fluffing needed. In any case, this should be investigated.One implication of going this way, is that
clone(2)
is Linux only syscall with no obvious (to me) alternatives in macOS, and we still want to support that. There is this discussion about that #881The text was updated successfully, but these errors were encountered: