-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: abnormal CPU usage growth with increasing node count #51838
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Thanks @Drahflow for filing this. Could you provide a bit more data from your experiments? What analysis did you carry out and how did you compute your conclusions? Looking at your experiments would inform (and accelerate) our own investigation. Thanks |
CRDB does not maintain n^2 gossip connections, it should be logarithmic. The gossip information disseminated cluster-wide should be linear to the number of nodes in the system, so it's a bit surprising to observe what you're observing. I'd be curious to see this data/profile as well. |
Cockroach version was My experiments were as follows, always starting an fresh cluster, only Setup 1: 16 instances of CockroachDB per Setup 2: 8 instances of CockroachDB per Free memory was plenty during all runs (as reported per I clicked the "Profile (with labels)" link in the built-in UI on both occasions when the cluster became unstable (which btw. is to say there were continuously some ranges unavailable, many underreplicated). In both cases the profile was >60% full of functions which had "Gossip" in their name and the rest looked like it might be related to I/O or network scheduling, but I didn't look very closely. In both cases the jump from 128 to 256 nodes increased CPU consumption about 4-fold, which strongly suggest something is O(nodes^2). In any case, it's easy enough to replicate: Try to bring up a 512 node cluster of 20.1.3 and marvel at the CPU consumption. If you only use one node per physical machine it could potentially even work, but even then the CPU profile should be perfectly clear. |
Re-reading the above, one correction:
is not correct. I setup a new cluster (incl. new machines) between experiment setup 1 and 2. However, within each setup I added more nodes to an existing cluster. |
In preparation of a larger-scale load test run, I today confirmed the same problem on 256 t3.medium instances hosting one cockroach instance each (so it's not some artifact of shared IPs). Also I checked the CPU profile more closely, the non-gossip related functions are golang garbage collection. |
I have looked at the code and I have found something "interesting": // maxPeers returns the maximum number of peers each gossip node
// may connect to. This is based on maxHops, which is a preset
// maximum for number of hops allowed before the gossip network
// will seek to "tighten" by creating new connections to distant
// nodes.
λ maxPeers(nodeCount int) int {
// ...
maxPeers ≔ int(math.Ceil(math.Exp(math.Log(float64(nodeCount)) / float64(maxHops-2))))
if maxPeers < minPeers {
return minPeers
}
return maxPeers
} So on each node we will have F(n) = ceil(exp(log(n)/3)) connections to other nodes. So if we look at n nodes together, we have O(n * n) inter-node connections. So the behavior is indeed quadratic (in the worst case). |
Sorry I was wrong, this "/3" does matter, it turns the complexity into a cubic root. So we have O(n^(1/3)) conns per node, and thus O(n^(4/3)) connections for the cluster as a whole. That's much better. |
Hi @Drahflow thanks again for your investigation. Our current standpoint is that while the behavior you found is concerning, because it seems to violate our design objectives, the CPU overhead you're measuring on 200 nodes is not high enough yet for us to consider this issue as "urgent". We feel that users will be OK with 1-5% CPU overhead per CockroachDB node just for gossip up to 500 nodes. Of course that does not mean we want to ignore the problem and that is why we queued an investigation. We plan to look at this in the v21.1 timeframe. Again thank you for your analysis. |
For the record: I saw a (superficially similar) instability today on a 365 node cluster (mix of c5.9xlarge / m5.8xlarge) which was not CPU (nor RAM) starved. Is the gossip handling happening in a only one or two go-routines (and thus limited to a single core or two cores)? The cluster initially appeared healthy, I started actual application load whereupon most nodes became "suspect". I then stopped the application load but >350 nodes remained in "suspect" status over the next 15 minutes (until I had to stop the experiment). Nodes occasionally moved to "healthy" for a short time, until again becoming "suspect". CPU utilization as reported in |
Which version were you running? How are you deploying the database (i.e. via k8s, systemd, etc). Can you look at the |
Version:
(with the understanding that further invocations will immediately terminate upon finding the running instance had locked the storage). The exact same setup (but on "only" 265 instances) worked 100% fine under more than 50 times the load. |
I looked into this further today (via
(But maybe this is as expected.) |
I just started up a fresh 256-node cluster on Graphviz version: |
Thanks for checking. Is that more or less than in the previous release? (We're expecting about the same) |
I didn't compare to the previous, I've been using this release for the past month or two since I need some of the changes in it. |
@dankinder, do you mind grabbing a profile for this running cluster, if still running? You could do it using the following (using the real address of course):
|
This includes a 5-second CPU profile, heap, and heap delta: Archive.zip |
When I tried testing a large cluster (95 nodes) with a bit of load on it I already started seeing some bad behavior from gossip. Specifically I'd see a small number of nodes occasionally reporting a large fraction of other nodes in the cluster as unlive (going by their After seeing this I decided to see what was going on. The first thing that stood out is how ridiculously often This is almost certainly worth fixing. We currently call While we could probably go farther, a simple time-based check that ensured we only called it once a second (with an exception allowing it to run again sooner if the previous call actually did some work to tighten the network rather than deciding it wasn't needed). This change would be trivial and I'm not aware of any real downside. The other more frustrating thing I found when poking around was just how much of the gossip network's capacity is eaten up by "gossip-clients:" infos. They get gossiped more than twice as frequently as liveness information:
The above logspy-gossip-client.txt came from collecting 2000 log lines from There's an old comment in gossip.go that mentions the gossip clients key has to be gossipped very frequently because "we want to detect gossip partitions faster that the node liveness timeout (9s)", but that looks like it was added by @petermattis in 18ce86b (#28963), and the main goal of that PR was to make it easier to debug gossip. It appears he also threw in the side goal of attempting to get rid of the sentinel key in the future, but but it looks like that never actually happened -- the sentinel key still appears to be what we use to detect partitions/stalls. The "gossip-clients" keys look like they're only used for logging the gossip graph, which is definitely not worth sending one info per-node every two seconds... And I think it makes sense to keep it that way given how much cheaper it is to frequently propagate a single sentinel key than one "gossip-clients" key per node. It seems to me that we should be sending the gossip-clients keys much less frequently given their very limited use. There's also some other minor inefficiencies around the key names for each gossip info being longer and more verbose than is really needed -- the infos themselves are protocol buffers, but the keys aren't so we pay for each key name on every send. And even pickier than that, we re-gossip all the fields in the store descriptor every minute (or less) even though only some of them can actually change. But these are significantly more minor concerns -- it looks like |
Touches cockroachdb#51838. See the thread there for more detail, but the gist is that tightenNetwork gets called hundreds of times per second (taking a full mutex lock each time) on large clusters, even when nothing needs tightening. Release note (performance improvement): Avoid wasteful contention on the gossip mutex caused by checking if the network needs tightening hundreds of times per second.
Dumb example fix for the |
Touches cockroachdb#51838. See the thread there for more detail, but the gist is that tightenNetwork gets called hundreds of times per second (taking a full mutex lock each time) on large clusters, even when nothing needs tightening. Release note (performance improvement): Avoid wasteful contention on the gossip mutex caused by checking if the network needs tightening hundreds of times per second.
Looking at it a bit more,
Splitting up the infos map would probably be a medium sized change but I wouldn't expect it to have any big issues. I don't know, however, whether there's some reason those old keys are needed for decommissioned node IDs for the lifetime of the cluster. It does seem relevant that while the liveness keys for decommissioned nodes are actively re-gossiped regularly, it looks like the distsql keys haven't been gossiped since the corresponding node was shut down and they're only still around because they got given a ttlStamp of MaxInt64. |
I suppose that given that they're never re-gossiped once the nodes are removed, something like the new |
88472: gossip: Don't check if the network should be tightened so often r=kvoli,nvanbenschoten a=a-robinson Touches #51838. See the thread there for more detail, but the gist is that tightenNetwork gets called hundreds of times per second (taking a full mutex lock each time) on large clusters, even when nothing needs tightening. Release note (performance improvement): Avoid wasteful contention on the gossip mutex caused by checking if the network needs tightening hundreds of times per second. 88482: kv: remove ConditionalPutRequest.DeprecatedExpValue r=AlexTalks a=nvanbenschoten This commit removes the ConditionalPutRequest.DeprecatedExpValue field, which was replaced by the ExpBytes introduced in 9e14f24. This has been possible since v21.1. Release justification: None, wait. Release note: None. 88486: github: route admission prs to targeted github team r=irfansharif a=irfansharif Release note: None Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: irfan sharif <[email protected]>
@kvoli @nvanbenschoten I'm also up for helping with a change that meaningfully changes the frequency of gossiping "gossip-clients" keys given that all they're used for is the periodic logging of the connectivity graph (and manual debugging of gossip issues I guess, although I'd be surprised if that comes up often). But I don't want to write up a change that you guys aren't interested in accepting/maintaining since that may be a bit more contentious than #88472. For example I personally think a cluster setting to control whether we gossip the gossip-clients keys at all would be reasonable (instead of just gossiping them less frequently), but I don't know what you guys think. Or if one of the other ideas is more enticing as something you'd want, let me know. But the gossip-clients keys appear to be the most egregious waste of cpu / mutex time. |
Changing the frequency of
|
Which sounds good - a cluster setting to change the frequency or to stop gossiping it entirely? To be clear, if we stop gossiping it entirely then the only real consequences are:
I'm not sure whether I'm misunderstanding your comment or you missed the link to #85013, but that function already exists as of #85505. I tried it out and it worked well for clearing out old decommissioned nodes' "distsql-version" and "distsql-draining" keys. |
It's fine if you aren't interested in a patch at the moment, but I thought I'd share some data from having completely ripped out the "gossip-clients" keys just to give a bit more evidence of how wasteful they are, because it's even bigger than I expected. I took v22.1.6, then applied two patches: #88472 and a-robinson@952c501 Testing on that 95 node cluster that had been fairly bogged down by gossip, deploying the change made a huge difference in the Before (aggregated across all nodes in the cluster): After (aggregated across all nodes in the cluster): CPU usage per node dropped by nearly a full core (by about 90 CPU seconds/second on the 95 node cluster). The savings are also pretty dramatic at somewhat smaller sizes too, with a 75 node cluster saving 40-50 cores and having a before/after of: Before: After: The numbers were a bit less dramatic on a completely fresh 75 node cluster that had no data in it, load on it, or history of decommissioned nodes, but it still cut out about two-thirds of the gossiped bytes/infos and saved about 10 cores. |
That's a great improvement. I think it would be great to patch in. Thanks for taking a look into this further @a-robinson. |
Sorry for not being totally sure what you mean, but by "it" do you mean a complete rip out like in a-robinson@952c501 and accepting the minor changes called out in #51838 (comment)? Or something else like adding a cluster setting that disables it? Personally I'm still in the camp of just ripping it out since I don't think I ever benefited from it even back when I worked on cockroach full time, but I don't have any visibility into how often it comes in handy for you guys these days. |
yo @a-robinson, a-robinson@952c501 LGTM. These internal vtables are not that useful, definitely not worth the cost you're seeing. Some of them are being used in roachtests that have either already aged or could look at alternative sources instead. The connectivity graph has come in handy in some recent incidents, but again, I don't think it's worth the cost. If you want to polish up a-robinson@952c501 for a PR that SGTM. If there's roachtest fallout from it, we're happy to squash them ourselves. I don't think these changes are going to be backportable -- even if we tack on release notes, it seems unwise to rip out vtables in already published releases. So if you care about backporting it, I'd do so by introducing a cluster version to stop gossiping the gossip-clients key but left disabled by default. On master, and perhaps 22.2 (which will be out in a few weeks), we can enable the cluster setting. Does that sound all right to you? Also happy to take over backport duty to 22.2 since it's not out yet. |
These aged roachtests could also just explicitly enable this cluster setting if we want to keep them (== roachtests) around, but that's all just unnecessary hedginess -- this feels fine to rip out altogether on master. |
Hey @irfansharif , thanks for clarifying. I'm not particularly interested in doing the work to add a cluster setting and backport it, but given that it sounds like you're cool with it I can submit a slightly cleaned up version of a-robinson@952c501 to master sometime within the next week. Waiting until v23.1 for it to be in a mainline release is fine -- I can carry it as a patch in the meantime. Although if large cluster performance/overhead/stability becomes a priority for you at all before then, I would think the cluster setting is something you may want sooner.
FWIW if you really are interested in keeping it, it still seems totally plausible to construct it when needed, either by:
|
Ack, can cluster-settings-ify, etc. as needed. Thanks for all the digging, this area continues to be as unloved as it was way back when, so this is all much appreciated. |
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>
One minor point about the gossip network and growth relative to cluster size is the "amount of gossiped data".
However note that from a nodes perspective, item 2 above means that as the cluster grows the gossip burden on each individual node grows linearly also and because of item 1 above it is slightly faster than linear growth. |
117011: gossip: Include highWaterStamps map in gossip debug info r=shralex a=a-robinson This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include. Epic: none Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster. --- In my particular case, it was valuable when looking into why the amount of data being gossiped was so large (as measured via the `gossip_(bytes|infos)_(sent|received)` prometheus metrics). Given a large cluster where nodes many nodes are decommissioned over time, you can end up with an ever-accumulating amount of gossip from old nodes -- "distsql-draining:<node-id>" and "distsql-version:<node-id>". These keys can stick around forever (as I've previously called out on #51838), and if you don't manually clear them (using `crdb_internal.unsafe_clear_gossip_info()`) then they can have a larger effect on gossip than I'd have expected because they cause the old decommmissioned node IDs to be kept around in every node's highWaterStamps map, and that highWaterStamp map gets copied into [*every* gossip Request and Response](https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/gossip.proto#L29-L66), which can add up to a ton of extra data getting shipped around if you have a lot of decommissioned nodes. I have a bunch more thoughts on inefficiencies in the gossip network and how it scales at larger cluster sizes and/or when there are a lot of decommissioned nodes, but don't know how much interest there is in them. If y'all are interested, let me know and I'm happy to share some notes and ideas. Co-authored-by: Alex Robinson <[email protected]>
Is your feature request related to a problem? Please describe.
I am trying to estimate maximum reachable OLTP performance for a client of mine. To my frustration I was not able to scale a CockroachDB cluster to significantly more than 256 nodes, due to high CPU load when adding more nodes (most of which is taken up by gossip-protocol related functions according to profiling). My measurements suggest that the work done for gossiping on each node scales quadratically in the number of nodes, which puts an upper limit on the maximum cluster size at about 1000 nodes.
Describe the solution you'd like
The gossip protocol should only perform linear work in the number of nodes.
Describe alternatives you've considered
The gossip protocol intervals could be configurable so larger clusters could be achievable by trading away DDL and node failure detection speed. However, this would only add a small factor to the maximum size until the quadratic growth would have pushed failure detection times too high.
Jira issue: CRDB-4006
The text was updated successfully, but these errors were encountered: