-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #1037
Comments
I have already run a node for over 24 hours and did not observe anything out of the ordinary. The node had 16Gb RAM, 16Gb swap, 2vCPUs, and 300GB disk. All of these are below recommended settings and were used to test under rigorous conditions. The node was collecting Prometheus metrics which I visualized with a Grafana dashboard. I could see that during the epoch block processing, the virtual memory would spike but the resident memory not so much. However, eventually, the resident memory goes back to normal. The resident memory is not released to the OS. As a result, the pprof samples did not indicate the possibility of any leaked or blocked goroutines either. Unfortunately, I didn't save those metrics. As a sanity check, I'm currently rerunning a v7.0.3 node over 24 hours to collect the samples again |
@p0mvn It is almost certainly CosmWasm. Why do I say this?
|
@faddat Thanks for the info. Interestingly, I'm not seeing any RAM issues from my side I've been running a node for over 24 hours, and it seems to be relatively stable in terms of resident memory used |
Here, we can see 2 spikes in memory usage. They happened due to the epoch block processing. We can also observe that the resident memory goes back to normal after the epoch block. However, the virtual memory is never reclaimed by the OS. I think this is the reason why many people may get the impression that RAM is increasing |
I will keep the node running for 24 more hours to monitor one more epoch. If everything is stable, I think we can mark this issue as resolved |
I wonder if theres anything related to queries that causes memory increases, should we post one of the nodes for people to query against? |
Why do you think so? Is it from the correlation between "Query Count Increase" and RAM usage in the dashboard above? I think that correlation is actually due to the epoch processing. "Query Count Increase" actually means the increase in "IAVL gets over the last minute". My choice of the name for that dashboard wasn't very accurate. I'm guessing that we do more "IAVL gets" during the epoch processing than usual. In addition, I'm seeing in the logs that the execution flow slows during epochs (there are periods of > 1min where no logs can be observed). I suspect that might be affecting the graph as well. If there is another reason, please let me know. I'm also happy to write a script sending a bunch of queries to my node to test. |
oh no reason to suspect it in particular, I was just wondering what could other peoples nodes be potentially doing that we're not seeing in the test environment |
I wrote a simple perf-test tool to see how the node behaves under heavy query load: https://github.com/p0mvn/perf-osmo Right now, only 2 queries are supported. The tool continuously spams these 2 queries at random heights. I tested a validator node on testnet with 16Gb RAM, 16Gb swap. The configuration is:
I also tried with 100 connections and 10000 RPC calls per connection. In both cases, the graphs look something like this: So, memory usage does spike but eventually comes back to normal once the query load ends. The number of goroutines stabilizes as well and equals the number it was before the load started. The resident memory before the load started was 4.44 Gib. 40 minutes after it started - 4.23 Gib. According to pprof, there are no blocked goroutines after the load ends. Based on these results, I don't think there are IAVL-specific memory leaks in queries. If there is one, it must be somewhere else in the system. |
My mainnnet node has been running with no observed RAM issues for over 4 days. I'm going to close this issue for now |
Background
Some validators have reported getting large RAM usage and, sometimes, getting OOM killed.
We should investigate this issue and document all the findings. If the issue stems from a functional error, a follow-up task needs to be created to address this.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: