-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add limit on number of client connections per node #35428
Comments
I had some network issues and a rogue app flood our nodes w/ connections, so this would be very useful in protecting the nodes for the common good. |
cc: @knz since you are investigating rate limiting. |
The experiment described in #25404 would be useful for determining how we should tune and document this setting. |
Can this be closed in favor or #54653? |
This would have helped with a recent performance incident on a 20.2.9 cluster. During the incident, a normally fast type of query took much longer than usual to execute (possibly due to some bad statistics). In spite of the increase in query execution time, the application continued to open new connections to every node at the same rate, leading to a large spike in open connections and queued distsql flows. After several hours two nodes OOM'd. With a connection limit, the OOMs probably would not have happened. The original performance degradation would still have happened, but it would not have been exacerbated by a glut of open connections trying to execute more queries at the same time. The performance slowdown would have propagated up to the application in the form of "connection errors" rather than in the form of "hung queries", which I believe are easier to respond to. Furthermore, many other databases offer this knob, for example: I'm not sure what happened with #51505 and #54745 but I hope we're still planning to add this. |
@vy-ton I think these are two unrelated issues.
Also from a process perspective, we usually keep the older issue open, and so I'd be in favor of closing #54653 as a duplicate of this one. |
One thing to consider: we've done quite a bit of testing that revealed that open idle connections do not really take up many resources. The thing that really can take up resources is when the connections are running queries; a.k.a. active connections. @michae2 if I understand your summary correctly, the incident was caused by too many queries running at the same time, not just having the connections open; is that correct? So IMO, the bigger win here would be to add a limit on the number of active connections. #54652 is an issue for this. #54785 (based on active queries) and #55173 (based on open transactions) were two approaches to implement it. So I advocate for bringing those ideas back. Focusing only on this issue alone wouldn't get at the more fundamental problem that caused that incident. It would also be a bit clunky to use correctly IMHO, since the types of clusters that run into these issues are commonly used by dozens or hundreds of microservices, each of which has a connection pool for connecting to the cluster. Most of the connections in all the many pools will be idle, and we don't really need to limit them. If you set this "max open connections" setting too low, it will prevent new microservices from spinning up in a normal scale-out situation. If you set the "max open connections" setting higher, to allow a scale-out, then it might be a problem if all the idle connections suddenly become active. |
A valid point. Thanks for the additional context. @yuzefovich also pointed out that in 21.1 there is the Judging by the comments on those PRs, I guess this is a well-trodden discussion with many differing opinions 🙂. I suppose an RFC would be the next step, to achieve some kind of consensus before proceeding. |
One thing to consider if there is a limit on "active connections" is that it's going to be confusing for the client conn pool managers to be able to open a connection but then subsequently become unable to send queries to them because some active conn limit is reached. How do you suggest to arrange the client protocol to avoid that? |
Have y'all seen radu's distributed token pool RFC? Can't we use a token pool for active sql conns as well |
SQL Experience will pick this up in the coming months. We'll implement the initial, simple proposal of adding a per-node limit on number of open connections (i.e.; not trying to account for "active" connections). The open connection limit is a very crude guardrail, but it has less of a chance of causing unintended consequences if someone hits the limit. See the parent Epic for information about customers who are looking for this. |
Fixed by #76401 |
No server can handle an infinite number of connections. While not being a particularly sophisticated form of admission control, limiting the total number of client connections can help mitigate excess resource usage in the face of a storm of connections.
A limit controlled by a cluster setting with a conservative value would be a good starting point.
Epic: CRDB-7643
The text was updated successfully, but these errors were encountered: