-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[DNM] Wrapper allocator PoC #7206
base: master
Are you sure you want to change the base?
Conversation
node/wrapper-allocator/src/lib.rs
Outdated
let allocated = ALLOCATOR_DATA.allocated.load(SeqCst); | ||
let old_cp = ALLOCATOR_DATA.checkpoint.swap(allocated, SeqCst); | ||
ALLOCATOR_DATA.peak.swap(allocated, SeqCst).saturating_sub(old_cp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While access to each field of WrapperAllocatorData
is atomic, this method (as others too) loads them one by one, and you can expect anything to happen between these loads. For example:
- Read
let allocated = ALLOCATOR_DATA.allocated.load(SeqCst);
. Scheduler puts our thread to sleep. - Another thread calls
alloc
and thencheckpoint
. - We wake up and checkpoint goes back in time due to
allocated
being outdated.
Either these values shouldn't depend on each other and work as plain counters or something, or the whole struct should go under mutex.
I mean, I understand this is our best effort, and thus it depends on what is it we're trying to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While access to each field of
WrapperAllocatorData
is atomic, this method (as others too) loads them one by one, and you can expect anything to happen between these loads.
Yes, that's exactly the concern I'm worried about! And I'm trying to convince myself it is okay. Not sure at all.
It's okay to have a global lock in the checkpoint()
. It's definitely not okay to have a global lock in the alloc()
. I've omitted a mutex in the checkpoint()
for now as it is supposed to be called from a single thread only, but we can add it there, checkpoint is a rare event. alloc()
is much more concerning.
Say we have two threads, as in your example, and they're allocating at nearly the same point in time. I'm indexing the values below for clarity.
- Thread1 executes
old_alloc1 = allocated.fetch_add(layout_size1)
. At this pointallocated
is updated and represents the new size. - Then the context switches to another thread
- Thread2 executes
old_alloc2 = allocated.fetch_add2(layout_size2)
. At this point,allocated
is updated again and represents the true allocated value. Bothold_alloc1
andold_alloc2
are set to proper values. - Thread2 executes
peak.fetch_max(old_alloc2 + layout_size2)
. The peak value is properly updated. - The context switches back and now thread1 executes
peak.fetch_max(old_alloc1 + layout_size1)
. Here's the tricky part. Although thread1 is not aware of theallocated
value has already been updated by another thread, it already has properold_alloc1
, and thepeak
already stores the proper peak value, including thread2's allocation. What is improper here is the assumption that for thread1old_alloc1 + layout_size1
is the new allocated value that should be compared with the peak. But do we care? I believe we don't because thread2 has already updated the peak. I mean, by definition,old_alloc1 + layout_size1 < old_alloc2 + layout_size2
becauseold_alloc2 == old_alloc1 + layout_size1
. So we have no chance of missing the new peak value.
Thus it seems we can avoid a global lock inside the alloc()
and its siblings, but I'm just afraid of missing something. It's the case where ten eyes are much better than two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a spinlock should probably be fine here as the critical section here's going to be very short anyway.
Another more complicated alternative that would help with thread contention would be to make this per thread: put the state in thread local storage and have one spinlock per TLS, and only when grabbing a checkpoint lock them all and collate the data. I don't think it's worth it though.
node/wrapper-allocator/src/lib.rs
Outdated
|
||
#[inline] | ||
unsafe fn alloc(&self, layout: Layout) -> *mut u8 { | ||
let old_alloc = ALLOCATOR_DATA.allocated.fetch_add(layout.size(), SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This two lines of code could be refactored out as method on WrapperAllocatorData
node/core/pvf/worker/src/prepare.rs
Outdated
@@ -109,8 +111,22 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) { | |||
// Spawn another thread for preparation. | |||
let prepare_fut = rt_handle | |||
.spawn_blocking(move || { | |||
#[cfg(feature = "wrapper-allocator")] | |||
ALLOCATOR_DATA.checkpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
- This will measure only the dynamically allocated memory, it will fail to monitor any unexpected increase in the stack, is that what we want ?
- How is this different from the bellow
get_max_rss_thread
and why isn't that enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How is this different from the bellow get_max_rss_thread and why isn't that enough ?
The issue with that is it is not deterministic enough. (Is mainly being used for gathering metrics right now.) Different kernel configurations/versions may manage memory differently (simple example is some validators may have swap enabled and some not). So they may get different values for the resident memory (how much is actually held in RAM). So some validators may reach the limit and others not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This will measure only the dynamically allocated memory, it will fail to monitor any unexpected increase in the stack, is that what we want ?
I hope yes, in the preparation phase we're not executing any untrusted code so we just presume that Wasmtime guys know what they're doing and won't abuse the stack usage. We only bother about malicious Wasm code that could force the compiler to allocate a lot of memory.
- How is this different from the bellow
get_max_rss_thread
and why isn't that enough ?
A lot of concerns here.
- Not supported on every platform
- Allocated memory is not necessarily resident
- Wasmtime may spawn its own threads and we'd lose their allocations in the per-thread accounting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some validators may reach the limit and others not.
We don't have disputes for preparation, but what can happen is the attacker gets lucky and gets the PVF through pre-checking without hitting the limits, but then the limits are actually hit when preparing for execution causing no-shows (since we don't dispute on preparation errors). I guess we can have a lower, stricter limit for pre-checking, which we should have anyway.
Wasmtime may spawn its own threads and we'd lose their allocations in the per-thread accounting
Good point. We can't use RUSAGE_SELF
instead of RUSAGE_THREAD
because there's no way to "reset" the max from a previous job.
node/wrapper-allocator/src/lib.rs
Outdated
let old_alloc = ALLOCATOR_DATA.allocated.fetch_add(layout.size(), SeqCst); | ||
ALLOCATOR_DATA.peak.fetch_max(old_alloc + layout.size(), SeqCst); | ||
self.0.alloc(layout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the memory ordering can just be Relaxed
right now, since the atomics are being used as counters and not to synchronize other memory operations. Relaxed
still gives you a total ordering between accesses to the same atomic.
Locks would make this slightly more deterministic but it doesn't seem worth the cost. Say you have a situation like thread1 calling alloc
and getting old_alloc
, thread2 stepping in and calling dealloc
, and then thread1 setting the max peak
to something higher than was actually allocated. You can prevent this with a lock but not totally solve it. There is still some indeterminacy where a worker may call the dealloc
thread first and not hit the max at all while other workers would hit it. It's an edge case and not sure how feasible it would be to craft an attack around that.
But I think we decided on using cgroups
to set the limit though which seems the way to go given the above. We can instead use this wrapper for detecting OOM by setting a flag before allocs and clearing it after.
node/wrapper-allocator/src/lib.rs
Outdated
if new_size > layout.size() { | ||
let old_alloc = ALLOCATOR_DATA.allocated.fetch_add(new_size - layout.size(), SeqCst); | ||
ALLOCATOR_DATA.peak.fetch_max(old_alloc + new_size - layout.size(), SeqCst); | ||
} else { | ||
ALLOCATOR_DATA.allocated.fetch_sub(layout.size() - new_size, SeqCst); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
's unnecessary if you make the counters isize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, I'm already thinking about making them isize
anyway so I can bother less about negative overflows. But I also don't want even try to update peak
when it's definitely not necessary.
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/1 |
Take 2 is here. I'm trying a completely different approach. I like the overall idea of what I'm trying to implement here, but I doubt it will make it into production, not in this form at least. Probably it can be improved to make it more safe and less ugly (any ideas are appreciated). If not, maybe it's worth giving a try to the spinlock approach proposed by @koute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, having a Vec
into which we gather the allocation deltas seems kinda unnecessary to me, considering the downsides (extra allocation is necessary every time we start tracking, the extra latency spike when it'll have to tally the deltas at the end, and the possibility of a panic since the buffer needs to be big enough or the extra complexity to make it resizable) and as far as I can see the only thing we're gaining is the lack of an exclusive lock during allocations which otherwise would only have to protect two very cheap operations. Seems like the tradeoff is not really worth it?
Here's the equivalent code using a spinlock:
struct AllocationTracker {
lock: AtomicBool,
current: AtomicIsize,
peak: AtomicIsize
}
impl AllocationTracker {
fn add(&self, delta: isize) {
loop {
// Try to acquire the lock.
if self.lock.compare_exchange_weak(false, true, Ordering::Acquire, Ordering::Relaxed).is_ok() {
break;
}
// We failed to acquire the lock; wait until it's unlocked.
//
// In theory this should result in less coherency traffic as unlike compare_exchange it is a read-only operation,
// so multiple cores can execute it simultaneously without taking an exclusive lock over the cache line.
while(self.lock.load(Ordering::Relaxed)) {}
}
let current = self.current.fetch_add(delta, Ordering::Relaxed);
self.peak.fetch_max(current, Ordering::Relaxed);
self.lock.store(false, Ordering::Release);
}
}
As you can see it's quite a bit simpler. (:
(With spin-locks there are also some other tricks/variations that one can attempt to speed it up, but it does depend on the specific situation and generally should be benchmarked to see whether it makes a difference. But we don't make that many allocations so they're probably unnecessary.)
node/wrapper-allocator/src/lib.rs
Outdated
// Lock allocations, move the allocated vector to our place and start tracking. | ||
let mut tracking = self.tracking.write().unwrap(); | ||
assert!(!*tracking); // Shouldn't start tracking if already tracking | ||
self.backlog = backlog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this will deadlock if start_tracking
is called twice? (Assigning a new backlog
will trigger a deallocation of the old backlog
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you're right 😕
Nah, the whole thing now seems FUBAR to me. I was thinking about using the nightly allocator_api
feature and Vec::with_capacity_in()
to allow side-allocations from the main allocator but everything I try to implement looks too hacky. I'll give a try to your spinlock approach, probably, the overhead will be even less than the overhead of allocating and initializing the 800 Mb array and then replaying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, why didn't you use std::hint::spin_loop()
inside while
in your spinlock example? Was it on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, why didn't you use
std::hint::spin_loop()
insidewhile
in your spinlock example? Was it on purpose?
No particular reason; I wrote that code from memory and simply forgot. :P Yeah, you should probably add that. (On AMD64 that will generate a pause
instruction, which is generally a good idea to use in a spinlock.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at your code a bit more... The critical section is under exclusive lock now, do current
and peek
still have to be Atomic
s? Would we cut off a bit of the overhead making them simple isize
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The critical section is under exclusive lock now, do
current
andpeek
still have to beAtomic
s? Would we cut off a bit of the overhead making them simpleisize
s?
AFAIK they don't. I just used them as I didn't feel like using an UnsafeCell
for the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating some things as my review yesterday got lost. The memory overhead per prepare job right now is too high to be viable, if we really do that many allocations. It is an interesting idea, but it does have the extra complexity of determining what is a safe amount to pre-allocate. Not sure if the latency spike at the end is a problem, we do it at the end of an already blocking operation, but ideally we benchmark the overall start-to-end time and get numbers for the different approaches. The panic can be worked around (just set some bit that we overflowed and return an error in end_tracking
). Overall the spinlock strategy does seem like the best approach so far.
Take 3: use a spinlock. Looks simple enough not to screw everything up. I wanted to keep |
Looking good, now just need performance numbers. :)
Preparation should do the vast majority of allocations in a worker, and tracking will be turned on for that, so this seems fine?
Can use |
Silly me 🤦 We're counting the current allocation and the peak, not the total of all the allocations, so it cannot overflow (unless we indeed try to allocate more than |
It's a good point about overflows though. It's not safe for a panic to happen in this code. Only possible in debug mode, but still, I would opt for extra caution here. |
Compiling the Moonbeam runtime in single-threaded mode, 100 iterations, times are in wallclock milliseconds:
Compiling the Moonbeam runtime in multi-threaded mode, 100 iterations, times are in wallclock milliseconds:
NB: I'm not sure that enabling the Predictably, multi-threaded performance has degraded. However, single-threaded doesn't show any significant deviation in the average value but diverges more between min and max. I'll do more experiments to ensure I'm comparing with Jemalloc and not with the Benchmarking code: for _ in 0..NITER {
let cpu_time_start = ProcessTime::now();
let start = Instant::now();
let _result = prepare_artifact(pvf.clone(), cpu_time_start);
let elapsed = start.elapsed().as_millis() as u64;
if elapsed < min {
min = elapsed;
}
if elapsed > max {
max = elapsed;
}
total += elapsed
}
println!("min {} avg {} max {}", min, total / NITER as u64, max); |
@s0me0ne-unkn0wn Can you also try the following variations and see how it affects performance in the multithreaded case?
Maybe we could also try coalescing small allocations in TLS and only update the global counters when they reach a certain threshold. For example as an extreme example, let's say a thread allocates a single byte and then deallocates it, and does this 100 times. Do we actually need to update the counters each time the thread does this, or could we wait, say, until the total delta is e.g. at least ~1KB and only then update the counters? (So in this example it'd not touch the global counter at all, since the total delta is at most equal to 1.) |
* Take into account size as well in weight limiting. * Fix logging. * More logs. * Remove randomized selection in provisioner No longer supported by runtime. * Fix and simplify weight calculation. Random filtering of remote disputes got dropped. * Make existing tests pass. * Tests for size limiting. * Fix provisioner. * Remove rand dependency. * Better default block length for tests. * ".git/.scripts/commands/bench/bench.sh" runtime kusama runtime_parachains::paras_inherent * ".git/.scripts/commands/bench/bench.sh" runtime polkadot runtime_parachains::paras_inherent * ".git/.scripts/commands/bench/bench.sh" runtime westend runtime_parachains::paras_inherent * Update runtime/parachains/src/paras_inherent/mod.rs Co-authored-by: Tsvetomir Dimitrov <[email protected]> * Update runtime/parachains/src/paras_inherent/mod.rs Co-authored-by: Chris Sosnin <[email protected]> * Add back missing line. * Fix test. * fmt fix. * Add missing test annotation --------- Co-authored-by: eskimor <[email protected]> Co-authored-by: command-bot <> Co-authored-by: Tsvetomir Dimitrov <[email protected]> Co-authored-by: Chris Sosnin <[email protected]>
* Update `NetworkPeers` trait interface * update lockfile for {"substrate"} --------- Co-authored-by: parity-processbot <>
* rename BEEFY `crypto` →`ecdsa_crypto` * - bump up `BeefyApi` to version 3 - deal with `PeerId` error. * update BEEFY dependency names for `fake-runtime` and `chain_spec` revert Cargo.toml * cargo fmt * Use master Cargo.lock * update lockfile for {"substrate"} --------- Co-authored-by: Davide Galassi <[email protected]> Co-authored-by: parity-processbot <>
…evious GlobalConsensusParachainConvertsFor) (#7517) * [xcm] `GlobalConsensusConvertsFor` for remote relay chain (based on previous GlobalConsensusParachainConvertsFor) * Typo * PR fix (constants in test) * Re-export of `GlobalConsensusConvertsFor` * assert to panic * Update xcm/src/v3/multiasset.rs Co-authored-by: joe petrowski <[email protected]> * Update xcm/xcm-builder/src/location_conversion.rs Co-authored-by: joe petrowski <[email protected]> * Update xcm/xcm-builder/src/location_conversion.rs Co-authored-by: joe petrowski <[email protected]> * Review fixes --------- Co-authored-by: parity-processbot <> Co-authored-by: joe petrowski <[email protected]>
* Fix flaky reputation change test * Remove fixme Co-authored-by: Oliver Tale-Yazdi <[email protected]> --------- Co-authored-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: parity-processbot <>
* Add license to crates This is required to publish to crates.io * Add more licenses
* move migration stuffs * Apply suggestions from code review Co-authored-by: Bastian Köcher <[email protected]> * Fix test * fix * lint * fix lint * rm extra space * fix lint * PR review * fixes * use saturating_accrue in fn * fix test --------- Co-authored-by: Bastian Köcher <[email protected]>
* Document non-uniqueness of SetTopic IDs * More comments on WithUniqueTopic
* Companion for 14412 * update lockfile for {"substrate"} * Trigger CI --------- Co-authored-by: parity-processbot <>
* remove SetStorageVersions runtime upgrade * remove unused imports
* Runtime companion changes * updates runtime configs * Fixes runtime-test runtime configs * Uses ElectionBounds and builder from own mod * updates new bounds mod * Fixes test-runtime mock * update lockfile for {"substrate"} --------- Co-authored-by: parity-processbot <>
* Add counter for unapproved candidates * Update metrics * Split metrics * Remove depth metric * Print only the oldest unapproved candidates * Update logging condition * Fix logging condition * Update logging * Update node/core/approval-voting/src/lib.rs Co-authored-by: Andrei Sandu <[email protected]> --------- Co-authored-by: Andrei Sandu <[email protected]>
* WIP * Add missing checkout * Add debuggin * Fix VAR name * Bug fix * Rework jobs * Revert "Rework jobs" This reverts commit 2bfa79f. * Add cache * Add temp default for testing * Add missing checkout * Fix patch * Comment out the GPG check for now * Rename polkadot_injected_release into a more appropriate polkadot_injected_debian * Refactoring / renaming * Introduce a generic image for binary injection * Flag files to be deleted and changes to be done * WIP * Fix multi binaries images * Add test build scripts * Remove old file, add polkadot build-injected script * Fix doc * Fix tagging * Add build of the injected container * Fix for docker * Remove the need for TTY * Handling container publishing * Fix owner and registry * Fix vars * Fix repo * Fix var naming * Fix case when there is no tag * Fix case with no tag * Handle error * Fix spacings * Fix tags * Remove unnecessary grep that may fail * Add final check * Clean up and introduce GPG check * Add doc * Add doc * Update doc/docker.md Co-authored-by: Mira Ressel <[email protected]> * type Co-authored-by: Mira Ressel <[email protected]> * Fix used VAR * Improve doc * ci: Update .build-push-image jobs to use the new build-injected.sh * ci: fix path to build-injected.sh script * Rename the release artifacts folder to prevent confusion due to a similar folder in the gitlab CI * ci: check out polkadot repo in .build-push-image This seems far cleaner than copying the entire scripts/ folder into our job artifacts. * feat(build-injected.sh): make PROJECT_ROOT configurable This lets us avoid a dependency on git in our CI image. * ci: build injected images with buildah * ci: pass full image names to zombienet * Add missing ignore --------- Co-authored-by: Mira Ressel <[email protected]>
* cli: move no-beefy flag to substrate sc-cli config * bump substrate ref --------- Signed-off-by: Adrian Catangiu <[email protected]>
* pvf: use test-utils feature to export test only * adding comment to test-utils feature * make prepare-worker and execute-worker as optional dependencies and add comments to test-utils * remove doc hidden from pvf testing * add prepare worker and execute worker entrypoints to test-utils feature * pvf: add sp_tracing as optional dependency of test-utils * add test-utils for polkadot and malus * add test-utils feature to prepare and execute workers script * remove required features from prepare and executing * Try to trigger CI again to fix broken jobs --------- Co-authored-by: Marcin S <[email protected]>
* Remove ENV for the artifacts folder
* Use same rustfmt.toml as Substrate Signed-off-by: Oliver Tale-Yazdi <[email protected]> * format format file Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Format with new config Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Add Substrate Clippy config Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Print Clippy version in CI Otherwise its difficult to reproduce locally. Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Make fmt happy Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Update node/core/pvf/src/error.rs Co-authored-by: Tsvetomir Dimitrov <[email protected]> * Update node/core/pvf/src/error.rs Co-authored-by: Tsvetomir Dimitrov <[email protected]> --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Tsvetomir Dimitrov <[email protected]>
If authority discovery is not enabled, `Overseer` is not enabled, meaning `NetworkBridge` is not started. Validation/collation protocols are, however, enabled even if the `NetworkBridge` is not started. Currently this results in normal Polkadot full nodes advertising these protocols, accepting inbound substreams and even establishing outbound substreams for the validation protocol. Since the `NetworkBridge` is not started and no protocol in Substrate is interested in these protocol events, the events are relayed to all protocol handlers but are getting discarded because no installed protocol is interested in them. Co-authored-by: parity-processbot <>
- Update some places where `cargo run` was used - Add note to error messages about `cargo build` before `cargo run` - Fix call to `cargo install` in readme
Co-authored-by: parity-processbot <>
…ain node type more explicit (#7617) * Remove superflous parameter `overseer_enable_anyways` We don't need this flag, as we don't need the overseer enabled when the node isn't a collator or validator. * Rename `IsCollator` to `IsParachainNode` `IsParachainNode` is more expressive and also encapsulates the state of the parachain node being a full node. Some functionality like the overseer needs to run always when the node runs alongside a parachain node. The parachain node needs the overseer to e.g. recover PoVs. Other things like candidate validation or pvf checking are only required for when the node is running as validator. * FMT * Fix CI
The CI pipeline was cancelled due to failure one of the required jobs. |
It's a PoC. Do not merge.
This is the first and minimalistic attempt to implement a bookkeeping global allocator (#6687) without any interthread locks. It works, but I'm unsure if its behavior is sound, so any comments from anyone who has more experience with atomics and memory ordering than me are highly appreciated.