Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

Closed
webmaster128 opened this issue Dec 23, 2023 · 6 comments · Fixed by #1982
Closed

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

webmaster128 opened this issue Dec 23, 2023 · 6 comments · Fixed by #1982

Comments

@webmaster128
Copy link
Member

webmaster128 commented Dec 23, 2023

We got reports from multiple network operators that after an upgrade to CosmWasm 1.4 or 1.5 the memory usage increases a lot over time. This is clearly a bug in CosmWasm for which at the point of writing there is no fix. However, there are good mitigation strategies which I'll elaborate in here.

What's happening

When you run a node with wasmvm 1.4 or 1.5, the memory usage of the process increases over time. The memory usage profile looks like this:

mem_usage
mem_usage2
mem_usage3

You might see also experiences the consequences such as:

  1. Node unable to stay in sync with the network because swap is used and the operation is getting too slow
  2. Node crashing because it cannot allocate memory. This might e.g. lead to crashes in the Go space or aborts in the Rust code like here:
    SIGABRT: abort
    PC=0x2b998f1 m=9 sigcode=18446744073709551610
    signal arrived during cgo execution
    
    goroutine 10416 [syscall]:
    runtime.cgocall(0x2121300, 0xc00a78ef58)
    	runtime/cgocall.go:157 +0x4b fp=0xc00a78ef30 sp=0xc00a78eef8 pc=0x456f0b
    github.com/CosmWasm/wasmvm/internal/api._C2func_save_wasm(0x7f2abb664810, {0x0, 0xc00a8d0000, 0x69ab6}, 0x0, 0xc005400ca0)
    	_cgo_gotypes.go:662 +0x65 fp=0xc00a78ef58 sp=0xc00a78ef30 pc=0x135e865
    github.com/CosmWasm/wasmvm/internal/api.StoreCode.func1({0x54c9da0?}, {0xd8?, 0xc00a8d0000?, 0x0?}, 0x0?)
    	github.com/CosmWasm/[email protected]/internal/api/lib.go:65 +0x97 fp=0xc00a78eff0 sp=0xc00a78ef58 pc=0x13618f7
    github.com/CosmWasm/wasmvm/internal/api.StoreCode({0x1?}, {0xc00a8d0000?, 0x0?, 0x14?})
    

Why it is happening

Every time you load a contract from the file system cache, the memory usage increase (this is the bug). If contracts kick out each other from the in-memory cache, this happens often. If the cache is large enough to hold the majority of actively used contracts, this happens very rarely.

Workaround

To mitigate the problem, increase the config wasm.memory_cache_size in app.toml from 100 MiB to a much larger value depending on the network such as e.g. 2000 MiB:

[wasm]
# other wasm config entries
memory_cache_size = 2000 # MiB

This is a per-node configuration and needs to be done on every node.

How lage should the cache be?

This depends on the usage patterns of the network and the size of the compiled modules. Being able to store all contracts in memory would be one extreme that might make sense for permissioned CosmWasm chains. Permissionless chains are likely to have contracts that are almost never used.

To get a rough idea of the oder of magnitude, you can check the size of the modules using something like this:

  • CosmWasm 1.3: du -hs ~/.myd/wasm/wasm/cache/modules/v6-*
  • CosmWasm 1.4: du -hs ~/.myd/wasm/wasm/cache/modules/v7-*
  • CosmWasm 1.5: du -hs ~/.myd/wasm/wasm/cache/modules/v8-*

Complementary strategies

The above setting is the most important thing. But there is more you can do, like

  • Increase memory
  • Observe memory usage. The sympthoms are different for every blockchain and every node.
  • Consider memory usage alerting
  • Enable swap to avoid immediate hard crashes in case of overusage
  • Schedule clean node restarts from time to time

Overall bear in mind I am not a node operator and I don't know the specifics of your blockchain or system. So I cannot make complete and final recommendations.

The bug

The bug can be reproducted locally in a pure-Rust example using heap profiling shown in #1955. The tools shows us that the memory usage increases over time but is almost zero when the process is ending cleanly. This means this is not a memory leak but rather an undesired memory usage pattern.

Bildschirmfoto 2023-12-23 um 10 30 03

This is where the allocations are made. At max memory usage time (t-gmax), 96% are coming through cosmwasm_vm::modules::file_system_cache::FileSystemCache::load.

At this point it is not clear to me if this is a bug in Wasmer, rkyv or cosmwasm-vm.

@webmaster128
Copy link
Member Author

Wasmer issue and minimal reproducible example: wasmerio/wasmer#4377

@webmaster128 webmaster128 changed the title 1.2 -> 1.3 regression bug: Memory usage increases over time 1.3 -> 1.4 regression bug: Memory usage increases over time Dec 31, 2023
@yzang2019
Copy link

Thanks for the report and adding all the details, we will start taking a look

@yzang2019
Copy link

@webmaster128 It seems workaround would lead to higher memory usage, and that's more like a bandaid fix, our module cache size is larger than the RAM so it won't really work, is there a real fix for this root cause of this issue?

@yzang2019
Copy link

Just tried the work around to set cache size to 2GB, but it doesn't seem to help

@yzang2019
Copy link

Sorry to clarify, we are seeing a different memory leak issue than this reported bug, so it might not be a successful repro.

mergify bot pushed a commit that referenced this issue Jan 12, 2024
(cherry picked from commit 1b110c6)
mergify bot pushed a commit that referenced this issue Jan 12, 2024
(cherry picked from commit 1b110c6)
chipshort pushed a commit that referenced this issue Jan 12, 2024
(cherry picked from commit 1b110c6)
chipshort pushed a commit that referenced this issue Jan 12, 2024
(cherry picked from commit 1b110c6)
chipshort pushed a commit that referenced this issue Jan 12, 2024
(cherry picked from commit 1b110c6)
@webmaster128
Copy link
Member Author

Fixes released as part of wasmvm 1.4.3 and 1.5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants