Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add limit on number of client connections per node #35428

Closed
ajwerner opened this issue Mar 5, 2019 · 12 comments
Closed

Add limit on number of client connections per node #35428

ajwerner opened this issue Mar 5, 2019 · 12 comments
Assignees
Labels
A-admission-control A-cc-enablement Pertains to current CC production issues or short-term projects A-server-architecture Relates to the internal APIs and src org for server code C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-server-and-security DB Server & Security T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-server-triaged-202105

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Mar 5, 2019

No server can handle an infinite number of connections. While not being a particularly sophisticated form of admission control, limiting the total number of client connections can help mitigate excess resource usage in the face of a storm of connections.

A limit controlled by a cluster setting with a conservative value would be a good starting point.

Epic: CRDB-7643

@ajwerner ajwerner changed the title Add limit on number of connections per node Add limit on number of clientconnections per node Mar 5, 2019
@ajwerner ajwerner changed the title Add limit on number of clientconnections per node Add limit on number of client connections per node Mar 5, 2019
@bladefist
Copy link

I had some network issues and a rogue app flood our nodes w/ connections, so this would be very useful in protecting the nodes for the common good.

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 6, 2019
@lunevalex
Copy link
Collaborator

cc: @knz since you are investigating rate limiting.

@rafiss
Copy link
Collaborator

rafiss commented Aug 21, 2020

The experiment described in #25404 would be useful for determining how we should tune and document this setting.

@vy-ton
Copy link
Contributor

vy-ton commented Jun 15, 2021

Can this be closed in favor or #54653?

@michae2
Copy link
Collaborator

michae2 commented Jun 15, 2021

This would have helped with a recent performance incident on a 20.2.9 cluster. During the incident, a normally fast type of query took much longer than usual to execute (possibly due to some bad statistics). In spite of the increase in query execution time, the application continued to open new connections to every node at the same rate, leading to a large spike in open connections and queued distsql flows. After several hours two nodes OOM'd.

With a connection limit, the OOMs probably would not have happened. The original performance degradation would still have happened, but it would not have been exacerbated by a glut of open connections trying to execute more queries at the same time. The performance slowdown would have propagated up to the application in the form of "connection errors" rather than in the form of "hung queries", which I believe are easier to respond to.

Furthermore, many other databases offer this knob, for example:

I'm not sure what happened with #51505 and #54745 but I hope we're still planning to add this.

@knz
Copy link
Contributor

knz commented Jun 16, 2021

@vy-ton I think these are two unrelated issues.

  1. there should be a server-configurable maximum set by the SRE regardless of particular values configured by the customer in SQL
  2. we need a limit that's shared across HTTP and SQL connections
  3. we also need a mechanism that's clever and applies different limits for e.g. root connections.

Also from a process perspective, we usually keep the older issue open, and so I'd be in favor of closing #54653 as a duplicate of this one.

@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@rafiss
Copy link
Collaborator

rafiss commented Jun 21, 2021

One thing to consider: we've done quite a bit of testing that revealed that open idle connections do not really take up many resources. The thing that really can take up resources is when the connections are running queries; a.k.a. active connections. @michae2 if I understand your summary correctly, the incident was caused by too many queries running at the same time, not just having the connections open; is that correct?

So IMO, the bigger win here would be to add a limit on the number of active connections. #54652 is an issue for this. #54785 (based on active queries) and #55173 (based on open transactions) were two approaches to implement it. So I advocate for bringing those ideas back.

Focusing only on this issue alone wouldn't get at the more fundamental problem that caused that incident. It would also be a bit clunky to use correctly IMHO, since the types of clusters that run into these issues are commonly used by dozens or hundreds of microservices, each of which has a connection pool for connecting to the cluster. Most of the connections in all the many pools will be idle, and we don't really need to limit them. If you set this "max open connections" setting too low, it will prevent new microservices from spinning up in a normal scale-out situation. If you set the "max open connections" setting higher, to allow a scale-out, then it might be a problem if all the idle connections suddenly become active.

@michae2
Copy link
Collaborator

michae2 commented Jun 21, 2021

One thing to consider: we've done quite a bit of testing that revealed that open idle connections do not really take up many resources. The thing that really can take up resources is when the connections are running queries; a.k.a. active connections. @michae2 if I understand your summary correctly, the incident was caused by too many queries running at the same time, not just having the connections open; is that correct?

So IMO, the bigger win here would be to add a limit on the number of active connections. #54652 is an issue for this. #54785 (based on active queries) and #55173 (based on open transactions) were two approaches to implement it. So I advocate for bringing those ideas back.

A valid point. Thanks for the additional context. @yuzefovich also pointed out that in 21.1 there is the sql.distsql.max_running_flows limit which would probably have helped here, too.

Judging by the comments on those PRs, I guess this is a well-trodden discussion with many differing opinions 🙂. I suppose an RFC would be the next step, to achieve some kind of consensus before proceeding.

@knz
Copy link
Contributor

knz commented Jun 22, 2021

One thing to consider if there is a limit on "active connections" is that it's going to be confusing for the client conn pool managers to be able to open a connection but then subsequently become unable to send queries to them because some active conn limit is reached.

How do you suggest to arrange the client protocol to avoid that?

@knz
Copy link
Contributor

knz commented Jun 22, 2021

Have y'all seen radu's distributed token pool RFC?

Can't we use a token pool for active sql conns as well
(A separate pool than the one used for kv ops)

@knz knz added A-server-architecture Relates to the internal APIs and src org for server code A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jan 18, 2022
@rafiss
Copy link
Collaborator

rafiss commented Jan 18, 2022

SQL Experience will pick this up in the coming months. We'll implement the initial, simple proposal of adding a per-node limit on number of open connections (i.e.; not trying to account for "active" connections). The open connection limit is a very crude guardrail, but it has less of a chance of causing unintended consequences if someone hits the limit. See the parent Epic for information about customers who are looking for this.

@rafiss rafiss assigned rafiss and ecwall and unassigned rafiss Jan 31, 2022
@ecwall
Copy link
Contributor

ecwall commented Feb 22, 2022

Fixed by #76401

@ecwall ecwall closed this as completed Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-cc-enablement Pertains to current CC production issues or short-term projects A-server-architecture Relates to the internal APIs and src org for server code C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-server-and-security DB Server & Security T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) X-server-triaged-202105
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants