1.3 -> 1.4 regression bug: Memory usage increases over time #1978

webmaster128 · 2023-12-23T09:36:29Z

We got reports from multiple network operators that after an upgrade to CosmWasm 1.4 or 1.5 the memory usage increases a lot over time. This is clearly a bug in CosmWasm for which at the point of writing there is no fix. However, there are good mitigation strategies which I'll elaborate in here.

What's happening

When you run a node with wasmvm 1.4 or 1.5, the memory usage of the process increases over time. The memory usage profile looks like this:

You might see also experiences the consequences such as:

Node unable to stay in sync with the network because swap is used and the operation is getting too slow

Node crashing because it cannot allocate memory. This might e.g. lead to crashes in the Go space or aborts in the Rust code like here:

SIGABRT: abort
PC=0x2b998f1 m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 10416 [syscall]:
runtime.cgocall(0x2121300, 0xc00a78ef58)
	runtime/cgocall.go:157 +0x4b fp=0xc00a78ef30 sp=0xc00a78eef8 pc=0x456f0b
github.com/CosmWasm/wasmvm/internal/api._C2func_save_wasm(0x7f2abb664810, {0x0, 0xc00a8d0000, 0x69ab6}, 0x0, 0xc005400ca0)
	_cgo_gotypes.go:662 +0x65 fp=0xc00a78ef58 sp=0xc00a78ef30 pc=0x135e865
github.com/CosmWasm/wasmvm/internal/api.StoreCode.func1({0x54c9da0?}, {0xd8?, 0xc00a8d0000?, 0x0?}, 0x0?)
	github.com/CosmWasm/[email protected]/internal/api/lib.go:65 +0x97 fp=0xc00a78eff0 sp=0xc00a78ef58 pc=0x13618f7
github.com/CosmWasm/wasmvm/internal/api.StoreCode({0x1?}, {0xc00a8d0000?, 0x0?, 0x14?})

Why it is happening

Every time you load a contract from the file system cache, the memory usage increase (this is the bug). If contracts kick out each other from the in-memory cache, this happens often. If the cache is large enough to hold the majority of actively used contracts, this happens very rarely.

Workaround

To mitigate the problem, increase the config wasm.memory_cache_size in app.toml from 100 MiB to a much larger value depending on the network such as e.g. 2000 MiB:

[wasm]
# other wasm config entries
memory_cache_size = 2000 # MiB

This is a per-node configuration and needs to be done on every node.

How lage should the cache be?

This depends on the usage patterns of the network and the size of the compiled modules. Being able to store all contracts in memory would be one extreme that might make sense for permissioned CosmWasm chains. Permissionless chains are likely to have contracts that are almost never used.

To get a rough idea of the oder of magnitude, you can check the size of the modules using something like this:

CosmWasm 1.3: du -hs ~/.myd/wasm/wasm/cache/modules/v6-*
CosmWasm 1.4: du -hs ~/.myd/wasm/wasm/cache/modules/v7-*
CosmWasm 1.5: du -hs ~/.myd/wasm/wasm/cache/modules/v8-*

Complementary strategies

The above setting is the most important thing. But there is more you can do, like

Increase memory
Observe memory usage. The sympthoms are different for every blockchain and every node.
Consider memory usage alerting
Enable swap to avoid immediate hard crashes in case of overusage
Schedule clean node restarts from time to time

Overall bear in mind I am not a node operator and I don't know the specifics of your blockchain or system. So I cannot make complete and final recommendations.

The bug

The bug can be reproducted locally in a pure-Rust example using heap profiling shown in #1955. The tools shows us that the memory usage increases over time but is almost zero when the process is ending cleanly. This means this is not a memory leak but rather an undesired memory usage pattern.

This is where the allocations are made. At max memory usage time (t-gmax), 96% are coming through cosmwasm_vm::modules::file_system_cache::FileSystemCache::load.

At this point it is not clear to me if this is a bug in Wasmer, rkyv or cosmwasm-vm.

The text was updated successfully, but these errors were encountered:

webmaster128 · 2023-12-28T09:13:38Z

Wasmer issue and minimal reproducible example: wasmerio/wasmer#4377

yzang2019 · 2024-01-02T16:11:55Z

Thanks for the report and adding all the details, we will start taking a look

yzang2019 · 2024-01-03T02:27:41Z

@webmaster128 It seems workaround would lead to higher memory usage, and that's more like a bandaid fix, our module cache size is larger than the RAM so it won't really work, is there a real fix for this root cause of this issue?

yzang2019 · 2024-01-03T06:20:37Z

Just tried the work around to set cache size to 2GB, but it doesn't seem to help

yzang2019 · 2024-01-03T20:27:21Z

Sorry to clarify, we are seeing a different memory leak issue than this reported bug, so it might not be a successful repro.

(cherry picked from commit 1b110c6)

webmaster128 · 2024-01-18T11:48:30Z

Fixes released as part of wasmvm 1.4.3 and 1.5.2

webmaster128 mentioned this issue Dec 28, 2023

Long living engine consumes endless memory when modules are loaded from file wasmerio/wasmer#4377

Open

webmaster128 changed the title ~~1.2 -> 1.3 regression bug: Memory usage increases over time~~ 1.3 -> 1.4 regression bug: Memory usage increases over time Dec 31, 2023

webmaster128 mentioned this issue Jan 6, 2024

Store engine together with Module to mitigate memory increase issue #1982

Merged

webmaster128 added a commit that referenced this issue Jan 11, 2024

Add CHANGELOG entry for fixing #1978

1b110c6

webmaster128 closed this as completed in #1982 Jan 11, 2024

mergify bot pushed a commit that referenced this issue Jan 12, 2024

Add CHANGELOG entry for fixing #1978

6438da8

(cherry picked from commit 1b110c6)

mergify bot pushed a commit that referenced this issue Jan 12, 2024

Add CHANGELOG entry for fixing #1978

40eb5ac

(cherry picked from commit 1b110c6)

chipshort pushed a commit that referenced this issue Jan 12, 2024

Add CHANGELOG entry for fixing #1978

cc835b5

(cherry picked from commit 1b110c6)

chipshort pushed a commit that referenced this issue Jan 12, 2024

Add CHANGELOG entry for fixing #1978

9087c9f

(cherry picked from commit 1b110c6)

chipshort pushed a commit that referenced this issue Jan 12, 2024

Add CHANGELOG entry for fixing #1978

61c5e53

(cherry picked from commit 1b110c6)

This was referenced Jan 18, 2024

feat(v2.9): upgrade wasmvm 1.5.2 terra-money/core#257

Merged

feat(v2.6): upgrade wasmvm 1.4.3 terra-money/core#258

Merged

spoo-bar mentioned this issue Jan 19, 2024

chore: bump wasmvm to 1.5.2 archway-network/archway#534

Merged

jim380 mentioned this issue Jan 20, 2024

Upgrade WASMVM to Include the Latest Memory Leak Fix sedaprotocol/seda-chain#176

Merged

philipsu522 mentioned this issue Mar 20, 2024

Bump wasmvm to v1.5.2 sei-protocol/sei-wasmd#44

Merged

achilleas-kal mentioned this issue Jun 17, 2024

Regularly experiencing OOM running node even with 128gb RAM InjectiveFoundation/injective-core#5

Closed

xinzhongyoumeng mentioned this issue Jun 21, 2024

[bug]Regularly experiencing OOM running node even with 24GB RAM NibiruChain/nibiru#1934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

webmaster128 commented Dec 23, 2023 •

edited

Loading

webmaster128 commented Dec 28, 2023

yzang2019 commented Jan 2, 2024

yzang2019 commented Jan 3, 2024

yzang2019 commented Jan 3, 2024

yzang2019 commented Jan 3, 2024

webmaster128 commented Jan 18, 2024

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

Comments

webmaster128 commented Dec 23, 2023 • edited Loading

What's happening

Why it is happening

Workaround

How lage should the cache be?

Complementary strategies

The bug

webmaster128 commented Dec 28, 2023

yzang2019 commented Jan 2, 2024

yzang2019 commented Jan 3, 2024

yzang2019 commented Jan 3, 2024

yzang2019 commented Jan 3, 2024

webmaster128 commented Jan 18, 2024

webmaster128 commented Dec 23, 2023 •

edited

Loading