-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: many large, concurrent INSERTs can overload intent resolution and lead to OOM #76282
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This doc says that the memory usage of Cockroach can be divided into the following three parts.
The following is the memory usage dashboard of a crash node. Right before the crashing, the Go allocated part took 10.85GB (varies by run, the range is 6 ~11GB), which includes both live objects and unclaimed garbage. The CGo part took 6.77GB. There are still 24.55 - 10.85 - 6.77 = 6.93 GB gap taken by the overhead for this particular run. After reproducing the crash many times, the following clues were found in the logs of oom-killer:
According to the Linux kernel's cgroups documentation, when a page of the page cache is loaded into physical memory, it is accounted for in the RSS of the user process. In order to verify the hypotheis that page cache occupied the rest memory, the following experiment was done: run a task on the cockroach pod that periodically drops the page cache from the system. It turned out that oom-killing was mitigated. Page cacheIn order to figure out which part of the cockroach logic loads a lot of page cache, a perf profiling was done. The results showed that pebble always load sstable files into page cache. In our particular use case, there were heavy page cache loading when pebble was doing compactions. Cockroach read a lot of small sstables, did merge, then wrote them into large sstables. The whole process does not use direct I/O. Also, cockroach has no way to enable
Go allocated memoryMeanwhile, the memory usage of Go allocated part is much larger than expected. The following CockroachDB data structures occupy the majority of the heap, according to heap dumps:
See this image for detailed information. Triage: Sysbench-tpcc insert large bulks of data into stock tables. A single insertion can include up to 100K entries, or around 36MB. The load of these data requires around 4.6GB of RAM if 128 threads are all active. Cockroach consumes these bulks by allocating a large amount of heap space rather than operating them one-by-one like consuming a stream. Cgo allocated memoryAlthough Pebble added a cache implementation to limit the size of Cgo allocated memory, we still noticed cases where the Cgo allocated part exceeded the limit. Below are the Cgo memory stats generated by CockroachDB.
Jemalloc profile only gives us the leaf function who allocated the memory:
Perf profiling shows that majority of the native memory allocation come from sstable reading:
Summary
|
The thrust of this issue is absolutely correct: cockroach should not OOM and should manage its heap usage well is completely legitimate. We've been fighting that fight for a long time (#10320) and will continue to fight that fight and improve out memory accounting. The heap profile you're providing here is very valuable. The go heap usage you see there related to intent resolution is a good find and the reproduction steps are much appreciated. Clearly there's an issue here related to memory usage underneath the asynchronous intent resolver. Not excusing the OOM, it should never happen, I suspect that sysbench-tpcc may be running into some pathological behavior related to contention and concurrency control. Analyzing that workload on its own is likely to yield to some interesting discoveries. All that being said, I think there's definitely some things to do here both to bound the memory usage in the intent resolver and account for that in our memory monitoring framework. Regarding the performance implications of thrashing the page cache, that's out of my area of expertise. It seems that there was some exploration at some point (#13620). |
The huge amount of memory allocated due to Here we configure the IntentResolver: cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go Lines 228 to 236 in 7b7da49
However, we neglect to set cockroach/pkg/internal/client/requestbatcher/batcher.go Lines 107 to 109 in 7b7da49
cockroach/pkg/internal/client/requestbatcher/batcher.go Lines 148 to 151 in 7b7da49
Ideally we'd do some proper memory monitoring which could dynamically tune the concurrency and track the usage. We're in only the earliest days of that sort of heap management. Most of our higher level heap management revolves around failing queries which exceed the available budget and bounding the working set size for everything else. Here we fail to do either. I suspect we can come up with something for backport to avoid the more pathological cases we're seeing here. I want to more deeply understand what it means for the allocated block sizes in that heap profile to be so big. |
Thank you for opening this great issue, @coderplay! |
Okay, we've taken a slightly deeper look. We see this is the bulk loading phase of the workload where there's 100000 inserts in a batch. Can I ask you to run a quick experiment if it's not too much hassle: Try setting Currently we're thinking there's a confluence of things here whereby optimizations to go fully async in replication for certain operations remained in effect when they shouldn't. Given that, decision, the system commits to doing a lot of asynchronous work and not providing any backpressure to the client. It's certainly not the intention of the code to be hitting these asynchronous code paths for bulk writing operations. The detail in this issue is great. Thanks especially for the heap profiles and workload details. They made the process of unpacking this issue much easier. |
+1 to what Andrew said. It would be useful to check whether I've taken a look at the workload in an attempt to reproduce myself. It's worth pointing out what makes this workload unusual. |
Specifically, we suspect that the ingestion is running into issues because of: cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go Lines 472 to 474 in 3ac48db
Here's some evidence that this is the case. When running the test on my local machine, I see a constant 129 open SQL connections: However, we see a gradual increase in goroutines as the ingestion runs, indicating a buildup of async work: A goroutine dump shows the following: We see about 1000 ( Clearly, something needs to give here. The first would be to address the TODO and limit the amount of async work we can run concurrently from |
@nvanbenschoten @ajwerner sysbench client crashed after setting I guess you didn't see the oom because client side was down |
Why did the client crash? |
Still figuring it out, sysbench faced a fatal error from cockroachdb
|
Looks like cockroach closed the connection. |
This is all surprising and smells of some sort of client misconfiguration. You can set that cluster setting independently of the test; cluster settings are global and shared. Consider setting it once on your cluster and then running the unmodified workload. |
Or, perhaps you're saying you did that. In which case, is there anything interesting going on in the cockroach logs? |
Yep, I set the cluster setting ahead of time in another session. There are so many warns and errors in the logs, I was lost :) |
I was able to reproduce the issue on similar hardware with 128 client threads. Then dropped the client concurrency down to 32 threads and still saw the overload + OOMs. During the tests, I saw a few interesting results:
I then switched from testing on On SET CLUSTER SETTING admission.kv.enabled = true;
SET CLUSTER SETTING admission.sql_kv_response.enabled = true;
SET CLUSTER SETTING admission.sql_sql_response.enabled = true; However, when I jumped back up to 128 client threads, I did eventually still see an OOM. So while admission control helps, it's not a full replacement for individual components protecting themselves by enforcing memory limits. Here are some takeaways for how we can improve the situation:
I'll pull these into separate issues. @coderplay what are your intentions for this issue? You've demonstrated a data ingestion workload that currently overloads CRDB which we can work to resolve, but I want to make sure you're not stuck waiting for that. If this is blocking you, you can drop the client concurrency during sysbench's |
That leads to another issue we noticed. We actually created a LoadBalancer based service for our CockroachDB cluster, and benchmarked with
Did you reproduce the sysbench segfault after setting
We might be able to workaround for this particular case. What worries me more is that Cockroach lacks a systematic |
Any updates on this? |
Hey @coderplay, thanks again for filing the issue and sorry for the delay in response here. We're busy trying to wrap up work for our upcoming 22.1 release. Improving CockroachDB's memory management is an important goal for us, and one that we're continually working on. We're in agreement with you that the behavior that you pointed out is problematic. Just to respond to your last point here:
We've put a lot of effort into CockroachDB's memory monitoring system over the years, and it's not true that we lack a systematic mechanism to bound memory usage. If you're interested to learn about how it works, have a look at this file and the places that it's used throughout the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/mon/bytes_usage.go The system allows components of the database to register the memory that they use with a hierarchy of monitors that begin at the SQL operator level and ladder up through queries, sessions, and ultimately to the limit that's governed by While simple, this system is powerful enough to prevent the database from running out of memory in most workloads that we test with. The main downside to the system is that it's cooperative: systems within the database must know to register memory, especially memory that's allocated proportionally to any user-input value. As you've correctly pointed out, the database fails to manage memory properly under the workload that you've been testing with: a very highly-concurrent implementation of a bulk ingest of TPCC data. We've filed issues that will resolve this problem when closed. However, I do want to point out that the case you're testing is somewhat extreme. Of course, we aspire to make CockroachDB work in every scenario, and the fact that it doesn't in this case is certainly a problem. But it's important to note that this somewhat extreme use case is pretty far from workloads that we see in the wild. |
One last note: I want to re-express gratitude for your hard work in filing this detailed, helpful issue. We really appreciate it! |
We have marked this issue as stale because it has been inactive for |
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
There are many issues found when we were doing the tpcc data preparation with CockroachDB. This post only focuses on memory issues.
One of the CockroachDB pod was evicted by Kubernetes for exceeding the pod memory limit during the data preparation, therefore the pod get restarted. We assumed it was caused by the Pod QoS class, because we set pod’s memory request (16GB) far lower than the memory limit (26GB). When a node runs out of memory, Kubernetes is more likely to evict the Burstable pod. So we tried setting request=limit (26GB), but the container was OOMKilled after running the data preparation for a while.
The Go heap dump showed that CockroachDB only allocated 4~6GB (varies by run), plus the 26 GB * 25% = 6.5 GB we set aside for RocksDB/pebble cache. There is still around 13 GB headroom before witnessing OOM. We suspected that many unreachable garbage objects were undetected by heap profiling. So we tried lowering the GOGC parameter (which is set to 100 by default) to make CockroachDB more aggressive in cleaning up the garbage, but it didn’t work.
One of the other runs showed that the Go heap usage can reach 11GB, while the RocksDB cache (precisely CGo native) part can reach 15GB despite the 26 GB * 25% = 6.5 GB setting.
A similar issue was reported (*: node crash with OOM despite memory limits set) in 2017, but it was left unresolved.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
We chose to use sysbench-tpcc because we believe third-party testing tools will be more impartial.
Environment:
Hardware
Software
Expected behavior
Jira issue: CRDB-13059
The text was updated successfully, but these errors were encountered: