Handle OOM gracefully in PVF execution #767

eskimor · 2022-10-04T12:08:32Z

The memory a wasm binary is allowed to use is already limited to my knowledge (1GiB?), so we should be able to have a PVF execution worker that pre-allocates enough memory on startup, so it will never allocate memory because of execution. I am talking about real memory here, not virtual memory. This should be enforceable by filling memory with random data. This might sound wasteful, but PVF execution is consensus critical so reserving memory for deterministic execution seems totally acceptable. Except of course for parallel execution.

Optimistic flex mem execution

To minimize memory usage for parallel workers, I am suggesting to only reserve memory for one worker, which is therefore special. All other parallel workers will allocate memory as needed. When a PVF gets executed, it gets scheduled on any worker just as of now, but if it fails due to AmbiguousWorkerDeath we try again, this time guaranteed on the one special pre-scheduled worker. Only if it also fails there on the second try we report Invalid.

Reasoning

By allocating max memory on startup, we guarantee that the worker will never allocate due to execution, hence it can not be the trigger of an OOM event.
This worker will be long running - started with Polkadot itself and only shutdown when Polkadot is shutdown. Hence it is also less likely to be the target of the OOM killer in case of memory pressure.
Because this worker is never allocating memory after startup, re-executing on this worker does not increase memory pressure.
If there is too little memory available the node will fail already at startup - so operators are forced to fix it immediately.

With this setup unjustified disputes due to memory pressure should become extremely unlikely and trying a second time should also effectively fix other instances of unjustified disputes.

Swap

Related to this, we should ensure memory won't get swapped out. In particular operators should be instructed to not enable swap. If a PVF worker gets swapped out to disk, we might run into timeout issues.

The text was updated successfully, but these errors were encountered:

burdges · 2022-10-04T23:14:03Z

Ask around if MADV_WILLNEED helps: https://man7.org/linux/man-pages/man2/madvise.2.html

sandreim · 2022-10-05T05:42:57Z

https://man7.org/linux/man-pages/man2/mlock.2.html is better, but depends on RLIMIT_MEMLOCK setting which might not be set at all defaulting to 64kb? Kernel default. We can dinamically use it if limit is high enough and advise the operators to configure this limit properly to improve performance. MADV_WILLNEED Is just a hint for the kernel and might do nothing when memory pressure is higher, but is worth experimenting with.

burdges · 2022-10-06T01:18:57Z

I'd think mlock would bring far more complexity, like more OOM crashes, etc.

sandreim · 2022-10-06T07:28:01Z

Yeah you are right, I was thinking more about performance rather than robustness.

mrcnski · 2022-10-06T13:05:29Z

When a PVF gets executed, it gets scheduled on any worker just as of now, but if it fails due to AmbiguousWorkerDeath we try again, this time guaranteed on the one special pre-scheduled worker.

Something to keep in mind is whether this will increase latency for the one "special" worker in charge of retries. I suppose this happens rarely in practice, so it's probably not an issue.

pepyakin · 2022-10-31T10:48:31Z

By allocating max memory on startup, we guarantee that the worker will never allocate due to execution, hence it can not be the trigger of an OOM event.

Because this worker is never allocating memory after startup, re-executing on this worker does not increase memory pressure.

Well, except, there could be other allocations besides the linear memory on the course of execution. Hence, "guarantee" is a too strong word here.

This worker will be long running - started with Polkadot itself and only shutdown when Polkadot is shutdown. Hence it is also less likely to be the target of the OOM killer in case of memory pressure.

TBH, I am not sure about that. My mental model of an OOM killer is that it mostly operates based on memory, and it strives to sacrifice children. TBH, I've never bothered to dive into the sources to check how it actually work though. What are your sources?

Related to this, we should ensure memory won't get swapped out. In particular operators should be instructed to not enable swap. If a PVF worker gets swapped out to disk, we might run into timeout issues.

To my knowledge, this can be dangerous and, counterintuitively, sometimes worsen the overall performance. Can you elaborate on this more? However, maybe we should move this discussion elsewhere since it seems tangential.

eskimor · 2022-10-31T11:02:28Z

Source is this, in particular:

This has been chosen to select a process that is using a large amount of memory but is not that long lived. Processes which have been running a long time are unlikely to be the cause of memory shortage so this calculation is likely to select a process that uses a lot of memory but has not been running long. If the process is a root process or has CAP_SYS_ADMIN capabilities, the points are divided by four as it is assumed that root privilege processes are well behaved. Similarly, if it has CAP_SYS_RAWIO capabilities (access to raw devices) privileges, the points are further divided by 4 as it is undesirable to kill a process that has direct access to hardware.

About swap: Interested in hearing arguments here, for PVF execution, especially if we timeout based on wall clock time, swapping is to be avoided at all cost - otherwise required time can easily skyrocket - it is better to just crash in that case. Even if we use CPU time, we should not swap - because then it would not necessarily timeout, but we would run into a situation where wall clock time will be way longer than CPU time, which is also to be avoided if we want to only timeout based on CPU time.

eskimor · 2022-10-31T11:09:07Z

The problem with disabling swap is, that it does not fully solve the problem as far as I know. Because there is a second mechanism in place: The kernel just dropping things from memory it already has on disk as well, like the code segment of a process. So yes, in some load scenarios swapping clearly unused memory will lead to better performance than forcing the kernel to remove code sections from running processes. Buuuut, we will also oom faster without swap - like the time the system is running with really bad performance is significnatly shortened. With swap the system can keep running forever (if you have lots of swap), but will be barely usable ... and with barely usable I mean, "not usable" it just works at a speed that for all practical purposes you can say it is not working - for consensus it is better to just crash, before running into that situation.

pepyakin · 2022-10-31T11:30:43Z

Source is this, in particular:

I think this one is outdated. I was able to find this saying it all was greatly simplified. It is old, but seems to be in line with the current source code (link).

Yes, I agree with your thinking regarding the swap. Thanks for the elaboration.

I am worried about the second-order effects. Specifically, I am unsure how well our software behaves in the "let it crash" approach. One thing that crossed my mind is that we are clearing the PVF artifact compilation cache after restarting. But it is also a gut feeling since we don't do much testing to verify that nodes behave well.

I don't have a good suggestion here though.

mrcnski · 2022-11-01T11:37:02Z

One thing that's not clear to me is, would we want to disable swap on just the special worker or all workers? If the latter, should it perhaps be a separate issue independent of other changes?

eskimor · 2022-11-01T12:27:17Z

Can we actually disable swap per process? I was thinking that the node operator would not enable it at all. If we can specifically mark PVF execution workers as: "Don't swap (in any way)" that would actually be the best solution.

eskimor · 2022-11-01T12:28:19Z

Wow from 2010 and it is more current than "the docs" ...and then people complain that our docs are outdated. Anyhow, thanks @pepyakin checking it out.

vstakhov · 2022-11-01T12:36:11Z

Can we actually disable swap per process?

mlock and in particular mlockall are designed for that purposes to my best knowledge.

sandreim · 2022-11-01T12:39:26Z

That’s exactly my initial suggestion but it might be a bad idea to do this when there is memory pressure.

vstakhov · 2022-11-01T12:41:40Z

We can limit data size for a process I suppose.

pepyakin · 2022-11-07T14:24:56Z

Here are several pointers.

First of all, rlimit.

First, there is RLIMIT_AS which limits the number of VM pages a process can reserve. If the process tries to get more pages it will most likely abort (e.g. here, inside of the Rust allocator, or in the long tail of other places although it might not abort). OTOH it is a normal side effect of limiting amount of memory. In fact, actually, if you limit swap & have overcommit, then each memory access can lead to a signal, keep that in mind.

Then, there is RLIMIT_RSS which is handy, but unfortunately, it does not work.

Hence, perhaps, it makes sense to look into cgroups. It allows a more fine-grained approach, but it is also harder to work with. Several caveats from the top of my head:

There are two versions of cgroups, and it's not impossible that we have to support both.
AFAIU, cgroups is an admin tool not an API. IIRC, it is really not that great to create cgroups from the app, but it's more idiomatic if the app receives the cgroups it has to use.
IIRC, even if the kernel supports it, cgroups can be turned off. The mounting points and other similar things could be changed by the operator.
OOM reaper can kill entire cgroups which might come in handy but only comes from fresh-ish kernels¹.

We do not have a policy on this (@eskimor wanna spin up a discussion in a seperate thread/issue?), but I was aligning with Debian. They are mainstream and conservative. See here ↩

burdges · 2022-11-07T18:10:12Z

You're worried the OOM error cannot be distinguished from another error type? If so, your allocator could flag that it's doing an allocation somewhere, and then remove the flag afterwards. I'm kinda surprised if OOM errors are not distinguished somehow though.

mrcnski · 2023-05-03T09:42:23Z

This should be prioritized after the sandboxing work since we have seen OOMs in the wild. See paritytech/polkadot#7155 (comment).

Since we are restricted to Linux now for security (see #881), we can go ahead and explore using cgroups.

If so, your allocator could flag that it's doing an allocation somewhere, and then remove the flag afterwards.

We can explore this in conjunction with a global allocator wrapper.

mrcnski · 2023-05-15T19:00:21Z

Buuuut, we will also oom faster without swap - like the time the system is running with really bad performance is significnatly shortened. With swap the system can keep running forever (if you have lots of swap), but will be barely usable ... and with barely usable I mean, "not usable" it just works at a speed that for all practical purposes you can say it is not working - for consensus it is better to just crash, before running into that situation.

I am worried about the second-order effects. Specifically, I am unsure how well our software behaves in the "let it crash" approach. One thing that crossed my mind is that we are clearing the PVF artifact compilation cache after restarting. But it is also a gut feeling since we don't do much testing to verify that nodes behave well.

Would the system really crash, or would the OOM killer start sacrificing individual processes? And it could really be any process; we don't know what the behavior will be unless we test it. One scenario is polkadot may keep going, but be nonfunctional - e.g. worker processes may get repeatedly killed leading to time outs and no-shows.

But as a thought exercise, one would think the kernel would first go after processes that use significant memory, but which haven't touched it in a while or are otherwise deemed unimportant. So we could instantiate a canary process with the lowest priority, that reserves some non-trivial amount of memory, locks it, and then goes to sleep. If polkadot detects that this process died, then we can safely initiate a shutdown including informing operators what happened.

(I'm more comfortable with this kind of solution now that we are officially restricting to Linux so we don't have to implement/test for other platforms.)

Edit: after reading the source linked above, the kernel mainly tries to find the process using the most memory which is not OOM-killer-disabled. Its exact strategy is described here. We can use the utility at that link to disable the OOM killer for polkadot and all child processes, and then re-enable it for the child canary process.

eskimor · 2023-05-16T07:55:15Z

I would not disable the OOM killer for Polkadot itself - that would make things worse. But if we can disable it for workers (which have an upper bound on memory) that would be perfect!

sandreim · 2023-05-16T11:43:36Z

This all sounds like a lot of complexity. I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully. Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

mrcnski · 2023-05-16T15:03:07Z

I would not disable the OOM killer for Polkadot itself - that would make things worse. But if we can disable it for workers (which have an upper bound on memory) that would be perfect!

In this case, assuming there are no other major processes running, the OOM killer would nuke polkadot, which wouldn't allow us to do any graceful exit (it sends a SIGKILL) - which may be fine?

I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully.

That sounds good, can you elaborate on how you would do that? Do you mean from within polkadot?

Thinking more about what we are trying to achieve, there are two separate problems both relating to determinism:

Under memory pressure we may have lots of swap and thrashing going on, leading to workers timing out and spuriously voting against candidates.
If we run out of swap memory we may also have workers repeatedly getting killed and the node spuriously voting against candidates.

Disabling swap on the workers can help with 1¹. An OOM flag can help with 2. "Safely" crashing when memory is low² seems to address both 1 and 2. It might actually simplify things in a sense, because we wouldn't have to think about memory anymore in discussions about determinism, i.e. there would be one less factor to consider. But maybe it's too extreme?

Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

Wondering out loud, would aborting polkadot when memory went way up have helped in any recent incidents?

Though as mentioned above, this may have unintended effects, e.g. I'm worried it may actually make thrashing and system performance worse. ↩
We can maybe do this by monitoring system memory like you said. ↩

sandreim · 2023-05-17T08:01:32Z

I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully.

That sounds good, can you elaborate on how you would do that? Do you mean from within polkadot?

The Polkadot process can make use of cgroups to detect the memory pressure:

cgroups v1: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt - we can setup a notifier via an eventfd see memory.pressure_level
cgroups v2: Seems to rather different interface: https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html
/proc/meminfo could also give information about presure and allow us to take some action

Thinking more about what we are trying to achieve, there are two separate problems both relating to determinism:

Under memory pressure we may have lots of swap and thrashing going on, leading to workers timing out and spuriously voting against candidates.

If we run out of swap memory we may also have workers repeatedly getting killed and the node spuriously voting against candidates.

Disabling swap on the workers can help with 11. An OOM flag can help with 2. "Safely" crashing when memory is low2 seems to address both 1 and 2. It might actually simplify things in a sense, because we wouldn't have to think about memory anymore in discussions about determinism, i.e. there would be one less factor to consider. But maybe it's too extreme?

Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

Wondering out loud, would aborting polkadot when memory went way up have helped in any recent incidents?

I like this question, which kind of returns back to the initial issue comments. AFAIK validator nodes are less pressured on memory than on CPU, network or disk. I have seen some instances of OOM killing happening due to maybe leaks in our code. Before the OOM killer goes in and does it's thing it is clear that swap usage (if not disabled) will degrade performance. If there might be additional processes running on the system which also eat a lot of memory, we can only protect ourselves by bailing out gracefully, which probably might be better than getting slashed for against valid (not implemented now) and loading the system with disputes.

alexggh · 2023-05-17T08:16:38Z

Just an idea instead of trying to prevent OOMs, which is a hard problem without controlling the full environment(Kernel configuration, OS, other processes running in system), have we explored the idea where we can have a slim superviser process that monitors our beefy executables and when OOM happens it just gracefully exists the network ?

eskimor · 2023-05-19T10:12:08Z

My impression would be that trying to prevent OOM by doing something before it happens is racy. If memory growth is happening too quickly, we might be too slow to react and the OOM happens anyway. Also this seems like something hard to configure, in a way that it will not backfire.

Having Polkadot killed by SIGKILL (or similar) is not ideal and we should make sure things like database corruptions can not happen due to this or if so we should at least be able to detect them. Other than that, is is better to kill it, than leaving it in a state where it no longer functions correctly.

sandreim · 2023-05-19T10:42:16Z

Indeed it will be racy, but unfortunately it isn't possible to make it perfect. I don't think we should worry that much about the OOM killer, since it will target usually the largest consumer of memory which are not the PVF workers, but the Polkadot binary itself, so if the growth happens too quickly we'll quickly be killed which solves the problem of not functioning optimally.

If we gracefully exit the PVF workers on memory pressure we should also notify the Polkadot process and it can decide based on the frequency of these events to exit, such that intermittent spikes in mem usage would not backfire.

Anyway, we should experiment with this and collect some data before we enable the functionality.

mrcnski · 2023-08-31T10:15:27Z

https://lwn.net/Articles/941614/

Out-of-memory victim selection with BPF

It seems in the future we may be able to control the OOM killer with BPF.

…ytech#767) Co-authored-by: claravanstaden <Cats 4 life!>

eskimor added the T4-parachains_engineering label Oct 4, 2022

eskimor mentioned this issue Oct 4, 2022

Don't raise a dispute on pre-validation and similar candidate unrelated errors paritytech/polkadot#6057

Closed

4 tasks

mrcnski mentioned this issue Oct 26, 2022

Retry failed PVF execution paritytech/polkadot#6195

Closed

mrcnski mentioned this issue Dec 22, 2022

Pre-checking: reject PVFs where preparation surpassed threshold memory usage #745

Open

mrcnski mentioned this issue Aug 24, 2023

PVF host: sandbox/harden worker process #882

Closed

mrcnski mentioned this issue Apr 17, 2023

Create complete and exhaustive list of sources of indeterminism in PVF #653

Open

mrcnski added the U1-asap label May 3, 2023

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added T8-parachains_engineering and removed T4-parachains_engineering labels Aug 25, 2023

the-right-joyce added this to parachains team board Sep 29, 2023

the-right-joyce moved this to Backlog in parachains team board Oct 5, 2023

eskimor moved this from To do to In progress in Parachains-core Oct 8, 2023

eskimor moved this from In progress to To do in Parachains-core Oct 8, 2023

the-right-joyce removed the T8-parachains_engineering label Oct 23, 2023

s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023

claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

Use ralexstokes ssz-rs crate after it has been no-std enabled. (parit…

2ed1155

…ytech#767) Co-authored-by: claravanstaden <Cats 4 life!>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle OOM gracefully in PVF execution #767

Handle OOM gracefully in PVF execution #767

eskimor commented Oct 4, 2022

burdges commented Oct 4, 2022

sandreim commented Oct 5, 2022

burdges commented Oct 6, 2022

sandreim commented Oct 6, 2022

mrcnski commented Oct 6, 2022 •

edited

Loading

pepyakin commented Oct 31, 2022

eskimor commented Oct 31, 2022

eskimor commented Oct 31, 2022

pepyakin commented Oct 31, 2022

mrcnski commented Nov 1, 2022

eskimor commented Nov 1, 2022

eskimor commented Nov 1, 2022

vstakhov commented Nov 1, 2022

sandreim commented Nov 1, 2022

vstakhov commented Nov 1, 2022

pepyakin commented Nov 7, 2022

burdges commented Nov 7, 2022

mrcnski commented May 3, 2023

mrcnski commented May 15, 2023 •

edited

Loading

eskimor commented May 16, 2023

sandreim commented May 16, 2023

mrcnski commented May 16, 2023

sandreim commented May 17, 2023

alexggh commented May 17, 2023

eskimor commented May 19, 2023

sandreim commented May 19, 2023

mrcnski commented Aug 31, 2023

Handle OOM gracefully in PVF execution #767

Handle OOM gracefully in PVF execution #767

Comments

eskimor commented Oct 4, 2022

Optimistic flex mem execution

Reasoning

Swap

burdges commented Oct 4, 2022

sandreim commented Oct 5, 2022

burdges commented Oct 6, 2022

sandreim commented Oct 6, 2022

mrcnski commented Oct 6, 2022 • edited Loading

pepyakin commented Oct 31, 2022

eskimor commented Oct 31, 2022

eskimor commented Oct 31, 2022

pepyakin commented Oct 31, 2022

mrcnski commented Nov 1, 2022

eskimor commented Nov 1, 2022

eskimor commented Nov 1, 2022

vstakhov commented Nov 1, 2022

sandreim commented Nov 1, 2022

vstakhov commented Nov 1, 2022

pepyakin commented Nov 7, 2022

Footnotes

burdges commented Nov 7, 2022

mrcnski commented May 3, 2023

mrcnski commented May 15, 2023 • edited Loading

eskimor commented May 16, 2023

sandreim commented May 16, 2023

mrcnski commented May 16, 2023

Footnotes

sandreim commented May 17, 2023

alexggh commented May 17, 2023

eskimor commented May 19, 2023

sandreim commented May 19, 2023

mrcnski commented Aug 31, 2023

mrcnski commented Oct 6, 2022 •

edited

Loading

mrcnski commented May 15, 2023 •

edited

Loading