-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One node eats all CPU and an increasing amount of memory #38788
Comments
Huh, this is pretty weird. There are a bunch of stuck goroutines in C++ MVCCScan:
@petermattis or @ajkr any clues about what this could mean? |
I'm leaving the process running if you need it. BTW, still spinning. |
This looks well qualified - I'm moving over to core. If there are any objections or basics to cover, let me know. |
Thanks for reporting. The "debug/nodes/*/enginestats" files from the debug.zip could be useful to look at. Also since the infinite loop appears to be inside a CGo call the Goroutine dump isn't enough. If you have time to run quickstack (https://github.com/yoshinorim/quickstack) on your process and report that output, it could be helpful as well. |
Also it's worth noting that from the Goroutine dump the stuck scans are all |
Attached enginestats outputs |
@ajkr, sorry I couldn't get quickstack working. Issuing a pstack several times gave this:
Saved a core for future reference. Attached to it with gdb:
BTW, according to top, there are many threads which spin like hell, not just one:
Thread IDs for them:
Picking one with the most burned CPU credits:
Printing backtrace for that thread:
Taking a look at another:
And another:
bt full for this thread:
Took a core some minutes later and inspected the last thread again:
Please tell me if you need anything else. Hope this helps. |
Thanks a lot for all the info you provided. Haven't found a cause yet but this certainly helps us know where to look. |
OK, restarting and trying to reproduce #38778. |
So far I could only reproduce this one (now with v19.1.3). It's strange that it happened on the same machine again (they are equal in every aspect, both HW and SW wise) and AFAIR with the same node ID (1), the first one which I started the cluster with. Don't know whether it has any relevance, or not... |
New infos:
Second run:
So after around 7.2 million operations (these cover two kind of transactions, one for multiple inserts and another for deletes, details in #38778) the phenomenon kicks in reliably. Any news on this? |
It happened when multiple versions of a key existed as the smallest key in the reverse-scanned range, and a certain number of those versions were newer than the scan timestamp. See test case for full root-cause details. Fixes cockroachdb#38788. Release note (bug fix): Fixed a potential infinite loop in queries involving reverse scans.
I did eventually find an infinite loop bug that produces the same stack traces as yours did. It is fixed in #39084. Thanks again for the bug report and all the info - without it this bug could've gone unfixed for a long time. Are you building from source? If so it'd be very interesting to hear whether that patch fixes the issue for your workload. If not, well it should be available in an upcoming 19.1.x release. |
39084: libroach: fix infinite loop in reverse `MVCCScan()` r=ajkr a=ajkr It happened when multiple versions of a key existed as the smallest key in the reverse-scanned range, and a certain number of those versions were newer than the scan timestamp. See test case for full root-cause details. Fixes #38788. Release note (bug fix): Fixed a potential infinite loop in queries involving reverse scans. Co-authored-by: Andrew Kryczka <[email protected]>
Thanks @ajkr, restarted the data load and it still runs, never got so far before the fix. |
It happened when multiple versions of a key existed as the smallest key in the reverse-scanned range, and a certain number of those versions were newer than the scan timestamp. See test case for full root-cause details. Fixes cockroachdb#38788. Release note (bug fix): Fixed a potential infinite loop in queries involving reverse scans.
It happened when multiple versions of a key existed as the smallest key in the reverse-scanned range, and a certain number of those versions were newer than the scan timestamp. See test case for full root-cause details. Fixes cockroachdb#38788. Release note (bug fix): Fixed a potential infinite loop in queries involving reverse scans.
@ajkr : checked out v19.2.0-alpha.20190805 and I can't see this change in that. May I use that version for further testing (it contains a fix in a different form) or should I apply the fix by hand? |
Hi @bra-fsn, thanks for pointing that out. I will try to make sure it's included in the next 19.2 alpha release. In the meantime you can apply the patch manually. Sorry for the inconvenience! |
My issue is similar with 20.1.6: #56362. Usually, when this happens I saw KV store latency instability and growth of Goutines count. Experienced this at least 2 times. |
Describe the problem
Doing the same insert again as in #38778 on the same setup, but now with different result.
Queries don't time out (yet?) as in the other issue, there are no logs about slow latch acquisitions, but one node eats up all CPU it can, causing the cluster to perform poorly. So far it seems to be a different issue, hence opening a separate ticket for it.
It's different in the given node consumes more and more memory and it uses not just one CPU core, but all of them. I could've observed this once a day ago, where it went to 30+ GiB until it brought the machine to its knees.
To Reproduce
The same as in the above linked ticket: filling up some tables.
Expected behavior
I would expect a constant performance without any steep drops.
Additional data / screenshots
The ingest job logs its progress. Columns:
As you can see, when the phenomenon started, the throughput declined drastically.
Memory usage increasing:
The only anomaly I could find is this:
From the CPU graph it can be seen that it's something happening inside the process (and doing much less work in the kernel) eating up all available resources (CRDB times are UTC, system graphs are UTC+2):
Saved debug zip and have a full goroutine dump at:
https://gist.github.com/bra-fsn/27848c0b2af2c6c90b9ba07fb1166746
According to a trace to the process (and to the graphs as well), it does very few syscalls.
The toplist is (ran the trace for 10 seconds):
(number of occurrences, and the actuall calls)
It basically does some chatting with its cluster-buddies and doing very few writes to the log files (both in logs/ and
/data/crdb/store2/034254.log
like files. So judging from these, it doesn't do any IO intensive tasks, like compaction, merging or anything similar.Also, I've stopped the SQL ingest and the process still eats all the CPUs. Leaving it in this state for a while to see what happens.
Environment:
The text was updated successfully, but these errors were encountered: