Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle OOM gracefully in PVF execution #767

Open
eskimor opened this issue Oct 4, 2022 · 27 comments
Open

Handle OOM gracefully in PVF execution #767

eskimor opened this issue Oct 4, 2022 · 27 comments

Comments

@eskimor
Copy link
Member

eskimor commented Oct 4, 2022

The memory a wasm binary is allowed to use is already limited to my knowledge (1GiB?), so we should be able to have a PVF execution worker that pre-allocates enough memory on startup, so it will never allocate memory because of execution. I am talking about real memory here, not virtual memory. This should be enforceable by filling memory with random data. This might sound wasteful, but PVF execution is consensus critical so reserving memory for deterministic execution seems totally acceptable. Except of course for parallel execution.

Optimistic flex mem execution

To minimize memory usage for parallel workers, I am suggesting to only reserve memory for one worker, which is therefore special. All other parallel workers will allocate memory as needed. When a PVF gets executed, it gets scheduled on any worker just as of now, but if it fails due to AmbiguousWorkerDeath we try again, this time guaranteed on the one special pre-scheduled worker. Only if it also fails there on the second try we report Invalid.

Reasoning

  • By allocating max memory on startup, we guarantee that the worker will never allocate due to execution, hence it can not be the trigger of an OOM event.
  • This worker will be long running - started with Polkadot itself and only shutdown when Polkadot is shutdown. Hence it is also less likely to be the target of the OOM killer in case of memory pressure.
  • Because this worker is never allocating memory after startup, re-executing on this worker does not increase memory pressure.
  • If there is too little memory available the node will fail already at startup - so operators are forced to fix it immediately.

With this setup unjustified disputes due to memory pressure should become extremely unlikely and trying a second time should also effectively fix other instances of unjustified disputes.

Swap

Related to this, we should ensure memory won't get swapped out. In particular operators should be instructed to not enable swap. If a PVF worker gets swapped out to disk, we might run into timeout issues.

@burdges
Copy link

burdges commented Oct 4, 2022

Ask around if MADV_WILLNEED helps: https://man7.org/linux/man-pages/man2/madvise.2.html

@sandreim
Copy link
Contributor

sandreim commented Oct 5, 2022

https://man7.org/linux/man-pages/man2/mlock.2.html is better, but depends on RLIMIT_MEMLOCK setting which might not be set at all defaulting to 64kb? Kernel default. We can dinamically use it if limit is high enough and advise the operators to configure this limit properly to improve performance. MADV_WILLNEED Is just a hint for the kernel and might do nothing when memory pressure is higher, but is worth experimenting with.

@burdges
Copy link

burdges commented Oct 6, 2022

I'd think mlock would bring far more complexity, like more OOM crashes, etc.

@sandreim
Copy link
Contributor

sandreim commented Oct 6, 2022

Yeah you are right, I was thinking more about performance rather than robustness.

@mrcnski
Copy link
Contributor

mrcnski commented Oct 6, 2022

When a PVF gets executed, it gets scheduled on any worker just as of now, but if it fails due to AmbiguousWorkerDeath we try again, this time guaranteed on the one special pre-scheduled worker.

Something to keep in mind is whether this will increase latency for the one "special" worker in charge of retries. I suppose this happens rarely in practice, so it's probably not an issue.

@pepyakin
Copy link
Contributor

  • By allocating max memory on startup, we guarantee that the worker will never allocate due to execution, hence it can not be the trigger of an OOM event.

Because this worker is never allocating memory after startup, re-executing on this worker does not increase memory pressure.

Well, except, there could be other allocations besides the linear memory on the course of execution. Hence, "guarantee" is a too strong word here.

  • This worker will be long running - started with Polkadot itself and only shutdown when Polkadot is shutdown. Hence it is also less likely to be the target of the OOM killer in case of memory pressure.

TBH, I am not sure about that. My mental model of an OOM killer is that it mostly operates based on memory, and it strives to sacrifice children. TBH, I've never bothered to dive into the sources to check how it actually work though. What are your sources?

Related to this, we should ensure memory won't get swapped out. In particular operators should be instructed to not enable swap. If a PVF worker gets swapped out to disk, we might run into timeout issues.

To my knowledge, this can be dangerous and, counterintuitively, sometimes worsen the overall performance. Can you elaborate on this more? However, maybe we should move this discussion elsewhere since it seems tangential.

@eskimor
Copy link
Member Author

eskimor commented Oct 31, 2022

Source is this, in particular:

This has been chosen to select a process that is using a large amount of memory but is not that long lived. Processes which have been running a long time are unlikely to be the cause of memory shortage so this calculation is likely to select a process that uses a lot of memory but has not been running long. If the process is a root process or has CAP_SYS_ADMIN capabilities, the points are divided by four as it is assumed that root privilege processes are well behaved. Similarly, if it has CAP_SYS_RAWIO capabilities (access to raw devices) privileges, the points are further divided by 4 as it is undesirable to kill a process that has direct access to hardware.

About swap: Interested in hearing arguments here, for PVF execution, especially if we timeout based on wall clock time, swapping is to be avoided at all cost - otherwise required time can easily skyrocket - it is better to just crash in that case. Even if we use CPU time, we should not swap - because then it would not necessarily timeout, but we would run into a situation where wall clock time will be way longer than CPU time, which is also to be avoided if we want to only timeout based on CPU time.

@eskimor
Copy link
Member Author

eskimor commented Oct 31, 2022

The problem with disabling swap is, that it does not fully solve the problem as far as I know. Because there is a second mechanism in place: The kernel just dropping things from memory it already has on disk as well, like the code segment of a process. So yes, in some load scenarios swapping clearly unused memory will lead to better performance than forcing the kernel to remove code sections from running processes. Buuuut, we will also oom faster without swap - like the time the system is running with really bad performance is significnatly shortened. With swap the system can keep running forever (if you have lots of swap), but will be barely usable ... and with barely usable I mean, "not usable" it just works at a speed that for all practical purposes you can say it is not working - for consensus it is better to just crash, before running into that situation.

@pepyakin
Copy link
Contributor

Source is this, in particular:

I think this one is outdated. I was able to find this saying it all was greatly simplified. It is old, but seems to be in line with the current source code (link).

Yes, I agree with your thinking regarding the swap. Thanks for the elaboration.

I am worried about the second-order effects. Specifically, I am unsure how well our software behaves in the "let it crash" approach. One thing that crossed my mind is that we are clearing the PVF artifact compilation cache after restarting. But it is also a gut feeling since we don't do much testing to verify that nodes behave well.

I don't have a good suggestion here though.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 1, 2022

One thing that's not clear to me is, would we want to disable swap on just the special worker or all workers? If the latter, should it perhaps be a separate issue independent of other changes?

@eskimor
Copy link
Member Author

eskimor commented Nov 1, 2022

Can we actually disable swap per process? I was thinking that the node operator would not enable it at all. If we can specifically mark PVF execution workers as: "Don't swap (in any way)" that would actually be the best solution.

@eskimor
Copy link
Member Author

eskimor commented Nov 1, 2022

Wow from 2010 and it is more current than "the docs" ...and then people complain that our docs are outdated. Anyhow, thanks @pepyakin checking it out.

@vstakhov
Copy link
Contributor

vstakhov commented Nov 1, 2022

Can we actually disable swap per process?

mlock and in particular mlockall are designed for that purposes to my best knowledge.

@sandreim
Copy link
Contributor

sandreim commented Nov 1, 2022

That’s exactly my initial suggestion but it might be a bad idea to do this when there is memory pressure.

@vstakhov
Copy link
Contributor

vstakhov commented Nov 1, 2022

We can limit data size for a process I suppose.

@pepyakin
Copy link
Contributor

pepyakin commented Nov 7, 2022

Here are several pointers.

First of all, rlimit.

First, there is RLIMIT_AS which limits the number of VM pages a process can reserve. If the process tries to get more pages it will most likely abort (e.g. here, inside of the Rust allocator, or in the long tail of other places although it might not abort). OTOH it is a normal side effect of limiting amount of memory. In fact, actually, if you limit swap & have overcommit, then each memory access can lead to a signal, keep that in mind.

Then, there is RLIMIT_RSS which is handy, but unfortunately, it does not work.

Hence, perhaps, it makes sense to look into cgroups. It allows a more fine-grained approach, but it is also harder to work with. Several caveats from the top of my head:

  1. There are two versions of cgroups, and it's not impossible that we have to support both.
  2. AFAIU, cgroups is an admin tool not an API. IIRC, it is really not that great to create cgroups from the app, but it's more idiomatic if the app receives the cgroups it has to use.
  3. IIRC, even if the kernel supports it, cgroups can be turned off. The mounting points and other similar things could be changed by the operator.
  4. OOM reaper can kill entire cgroups which might come in handy but only comes from fresh-ish kernels1.

Footnotes

  1. We do not have a policy on this (@eskimor wanna spin up a discussion in a seperate thread/issue?), but I was aligning with Debian. They are mainstream and conservative. See here

@burdges
Copy link

burdges commented Nov 7, 2022

You're worried the OOM error cannot be distinguished from another error type? If so, your allocator could flag that it's doing an allocation somewhere, and then remove the flag afterwards. I'm kinda surprised if OOM errors are not distinguished somehow though.

@mrcnski
Copy link
Contributor

mrcnski commented May 3, 2023

This should be prioritized after the sandboxing work since we have seen OOMs in the wild. See paritytech/polkadot#7155 (comment).

Since we are restricted to Linux now for security (see #881), we can go ahead and explore using cgroups.

If so, your allocator could flag that it's doing an allocation somewhere, and then remove the flag afterwards.

We can explore this in conjunction with a global allocator wrapper.

@mrcnski
Copy link
Contributor

mrcnski commented May 15, 2023

Buuuut, we will also oom faster without swap - like the time the system is running with really bad performance is significnatly shortened. With swap the system can keep running forever (if you have lots of swap), but will be barely usable ... and with barely usable I mean, "not usable" it just works at a speed that for all practical purposes you can say it is not working - for consensus it is better to just crash, before running into that situation.

I am worried about the second-order effects. Specifically, I am unsure how well our software behaves in the "let it crash" approach. One thing that crossed my mind is that we are clearing the PVF artifact compilation cache after restarting. But it is also a gut feeling since we don't do much testing to verify that nodes behave well.

Would the system really crash, or would the OOM killer start sacrificing individual processes? And it could really be any process; we don't know what the behavior will be unless we test it. One scenario is polkadot may keep going, but be nonfunctional - e.g. worker processes may get repeatedly killed leading to time outs and no-shows.

But as a thought exercise, one would think the kernel would first go after processes that use significant memory, but which haven't touched it in a while or are otherwise deemed unimportant. So we could instantiate a canary process with the lowest priority, that reserves some non-trivial amount of memory, locks it, and then goes to sleep. If polkadot detects that this process died, then we can safely initiate a shutdown including informing operators what happened.

(I'm more comfortable with this kind of solution now that we are officially restricting to Linux so we don't have to implement/test for other platforms.)

Edit: after reading the source linked above, the kernel mainly tries to find the process using the most memory which is not OOM-killer-disabled. Its exact strategy is described here. We can use the utility at that link to disable the OOM killer for polkadot and all child processes, and then re-enable it for the child canary process.

@eskimor
Copy link
Member Author

eskimor commented May 16, 2023

I would not disable the OOM killer for Polkadot itself - that would make things worse. But if we can disable it for workers (which have an upper bound on memory) that would be perfect!

@sandreim
Copy link
Contributor

This all sounds like a lot of complexity. I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully. Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

@mrcnski
Copy link
Contributor

mrcnski commented May 16, 2023

I would not disable the OOM killer for Polkadot itself - that would make things worse. But if we can disable it for workers (which have an upper bound on memory) that would be perfect!

In this case, assuming there are no other major processes running, the OOM killer would nuke polkadot, which wouldn't allow us to do any graceful exit (it sends a SIGKILL) - which may be fine?

I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully.

That sounds good, can you elaborate on how you would do that? Do you mean from within polkadot?

Thinking more about what we are trying to achieve, there are two separate problems both relating to determinism:

  1. Under memory pressure we may have lots of swap and thrashing going on, leading to workers timing out and spuriously voting against candidates.
  2. If we run out of swap memory we may also have workers repeatedly getting killed and the node spuriously voting against candidates.

Disabling swap on the workers can help with 11. An OOM flag can help with 2. "Safely" crashing when memory is low2 seems to address both 1 and 2. It might actually simplify things in a sense, because we wouldn't have to think about memory anymore in discussions about determinism, i.e. there would be one less factor to consider. But maybe it's too extreme?

Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

Wondering out loud, would aborting polkadot when memory went way up have helped in any recent incidents?

Footnotes

  1. Though as mentioned above, this may have unintended effects, e.g. I'm worried it may actually make thrashing and system performance worse.

  2. We can maybe do this by monitoring system memory like you said.

@sandreim
Copy link
Contributor

I would go for a simpler approach like monitoring memory pressure on the system and handling that gracefully.

That sounds good, can you elaborate on how you would do that? Do you mean from within polkadot?

The Polkadot process can make use of cgroups to detect the memory pressure:

Thinking more about what we are trying to achieve, there are two separate problems both relating to determinism:

  1. Under memory pressure we may have lots of swap and thrashing going on, leading to workers timing out and spuriously voting against candidates.
  2. If we run out of swap memory we may also have workers repeatedly getting killed and the node spuriously voting against candidates.

Disabling swap on the workers can help with 11. An OOM flag can help with 2. "Safely" crashing when memory is low2 seems to address both 1 and 2. It might actually simplify things in a sense, because we wouldn't have to think about memory anymore in discussions about determinism, i.e. there would be one less factor to consider. But maybe it's too extreme?

Unless there is a mem leak in our code it should be the operator who has to ensure that the system has enough memory to run a validator node.

Wondering out loud, would aborting polkadot when memory went way up have helped in any recent incidents?

I like this question, which kind of returns back to the initial issue comments. AFAIK validator nodes are less pressured on memory than on CPU, network or disk. I have seen some instances of OOM killing happening due to maybe leaks in our code. Before the OOM killer goes in and does it's thing it is clear that swap usage (if not disabled) will degrade performance. If there might be additional processes running on the system which also eat a lot of memory, we can only protect ourselves by bailing out gracefully, which probably might be better than getting slashed for against valid (not implemented now) and loading the system with disputes.

@alexggh
Copy link
Contributor

alexggh commented May 17, 2023

Just an idea instead of trying to prevent OOMs, which is a hard problem without controlling the full environment(Kernel configuration, OS, other processes running in system), have we explored the idea where we can have a slim superviser process that monitors our beefy executables and when OOM happens it just gracefully exists the network ?

@eskimor
Copy link
Member Author

eskimor commented May 19, 2023

My impression would be that trying to prevent OOM by doing something before it happens is racy. If memory growth is happening too quickly, we might be too slow to react and the OOM happens anyway. Also this seems like something hard to configure, in a way that it will not backfire.

Having Polkadot killed by SIGKILL (or similar) is not ideal and we should make sure things like database corruptions can not happen due to this or if so we should at least be able to detect them. Other than that, is is better to kill it, than leaving it in a state where it no longer functions correctly.

@sandreim
Copy link
Contributor

Indeed it will be racy, but unfortunately it isn't possible to make it perfect. I don't think we should worry that much about the OOM killer, since it will target usually the largest consumer of memory which are not the PVF workers, but the Polkadot binary itself, so if the growth happens too quickly we'll quickly be killed which solves the problem of not functioning optimally.

If we gracefully exit the PVF workers on memory pressure we should also notify the Polkadot process and it can decide based on the frequency of these events to exit, such that intermittent spikes in mem usage would not backfire.

Anyway, we should experiment with this and collect some data before we enable the functionality.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@mrcnski
Copy link
Contributor

mrcnski commented Aug 31, 2023

https://lwn.net/Articles/941614/

Out-of-memory victim selection with BPF

It seems in the future we may be able to control the OOM killer with BPF.

@eskimor eskimor moved this from To do to In progress in Parachains-core Oct 8, 2023
@eskimor eskimor moved this from In progress to To do in Parachains-core Oct 8, 2023
@s0me0ne-unkn0wn s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Status: To do
Development

No branches or pull requests

8 participants