-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloudsql-proxy keeps active connections to DB instance even after client disconnects #147
Comments
We've lost a node in GKE and all the connections from lost pods are still alive from the cloud-sqlproxy point of view, therefore it keeps the server-side connections too. This quickly adds to the PostgreSQL instance limit. |
Sounds like a bug - when the client-side cloud SQL proxy notices the local
disconnect it /should/ disconnect from the server, which should notice that
disconnect and tear down that full connection.
|
I am pretty sure that cloud SQL proxy never notices the disconnects. If the client is abnormally terminated, there is no TCP RST sent to the proxy, so it never notices the client connection went down. |
@Carrotman42 is there any traction on this issue? The more the load, the more drastic the effect. Take a look at this screenshot: When this happens in the middle of the night, it may use all the active connections quota and SQL instance might go into failover mode, which will degrade app's performance. Also throwing CPUs at the SQL instance to increase active connections count is not money-wise :'( |
+1 to this as we have been experiencing the same issues on open Postgres proxy connections. Since Google does not expose a method for increasing connection pool sizes within CloudSQL, which is an annoying deficiency, it would be nice if this issue was fixed sooner than later. |
Hi ! Thanks, Yann |
@hfwang Could you take a look at this? |
We've got the same issue, to the tune of 130 connections on an instance with 200 limit. Is a new release planned? |
Sorry for the long delay. The change looks good, I'm going to pull it into our internal repo, and let it soak in our test environment for the remainder of this week. Assuming all goes well, I'll merge it in. |
@hfwang is this in the active version? |
@hfwang We have been experiencing the same issue. We built the cloud_sql_proxy binary from source (latest from master with your fix), deployed this to a new Docker image (based on 1.11), then spun up new pods which use it. We are still experiencing the same issue. For more background and context, we have a Django app deployed with the Interestingly, this only seems to be an issue when using cloud_sql_proxy in GCP. When we set this up in a separate AWS deployment (using the 1.11 release) to point to a CloudSQL instance with the |
Thanks for running a comparison! I'm curious for a bit more information:
was the operating system exactly the same for GCP and AWS? What OS(es) were
used?
…On Wed, Aug 29, 2018, 10:57 AM Mike Soares ***@***.***> wrote:
@hfwang <https://github.com/hfwang> We have been experiencing the same
issue. We built the cloud_sql_proxy binary from source (latest from master
with your fix), deployed this to a new Docker image (based on 1.11), then
spun up new pods which use it. We are still experiencing the same issue.
For more background and context, we have a Django app deployed with the
CONN_MAX_AGE
<https://docs.djangoproject.com/en/2.1/ref/settings/#conn-max-age>
setting set to 30. The default behavior to close connections immediately
after use works without issue, but obviously there's some overhead
associated with doing this (hence, the reason why we'd like to reuse
connections).
Interestingly, this only seems to be an issue when using cloud_sql_proxy
in GCP. When we set this up in a separate AWS deployment (using the 1.11
release) to point to a CloudSQL instance with the CONN_MAX_AGE setup
described above, we see no issue whatsoever, and actually see the
connections getting closed accordingly after 30 seconds. This seems to
imply your fix is not sufficient or the issue lies elsewhere. Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAiy70iDKbgnsEoVIskYd27DTKoSfj05ks5uVtYBgaJpZM4R-QTv>
.
|
We're running Debian Stretch across the board. Only some slight differences, listed below, including On AWS, our Django app is not containerized and and runs on a regular EC2 instance.
On AWS, the cloud_sql_proxy binary is running on its own instance as well.
On GCP, we're using a Kubernetes pod with cloud_sql_proxy running alongside as recommended.
|
@Carrotman42 @hfwang Any new hunches or findings? |
We have the same problem, has anybody found a solution? |
Reopening issue since the fix doesn't appear to have resolved the issue. |
Hello, I'm trying to reproduce this problem. Would you mind give me some detailed instructions to do it? |
No. cloudsqlproxy keeps connections open because it never knows that the client disappeared (e.g. kubernetes node that had the client crashed). |
We have 5 identical kubernetes pods of a spring boot app connecting through cloudsql-proxy to a postgres database in GCP. Each pod contains one container with the app and one container with cloudsql-proxy. We're using Hikari for connection pooling. Once every 6-8 days our hikaricp logs report connection leaks on all the instances approximately at 10ms apart:
Maybe this is an indication of something happening with all the proxy instances at the same time (maybe closed and reopened connection with cloud sql ?) and after that, the connections are not released by cloudsql-proxy ? |
@mariusstaicu - If the leak is being detected in Hikari, the leak is being caused by your application and not with cloud-sql-proxy. You can check out our page on Managing Database Connections and take care that connections are always closed, even when an error occurs. |
usually yes, however how come it occurs on multiple instances in the same time ? |
That would depend on the application, so I'm not able to answer. It could be some scheduled or reoccurring event that isn't freeing the connections. If Hikari is able to detect the leaks, it means the leak is client side and the proxy is working as intended (it should hold connections to the database open as long as the client does). |
Just following up on this one. I had to use pgbouncer in front of cloudsql proxy because of this issue. Any ETA for this fix? |
@gustavovalverde We've been unable to replicate this at all since #161 was released. I've tried to replicate it on more than one occasion, but unable to do so. If you have steps that can replicate it somewhat consistency, we'd like to hear them. |
We found a way to reproduce this behaviour locally:
This will make We encountered the problem described in this thread only with PREEMPTIBLE nodes in our k8s clusters. This makes perfect sense, as they might even get a ACPI G3 Mechanical Off (as described here) and thus might not be able to send a signal to the proxy in order to close the connection anymore. As shown in the example, we use |
Can we maybe have an ETA of when someone will be able to look at this issue? It wouldn't have been a major issue for us if there was an alternative version based on an image with a shell (#330, #371), which would allow us to configure a livenessProbe:
exec:
command: ["nc", "-nz", "127.0.0.1", "3306"]
initialDelaySeconds: 5
timeoutSeconds: 3 But as it stands, this causes a system outage whenever our CloudSQL instance performs any maintenance that closes client connections. We're running v1.16. |
Hi @devillexio - this is currently on my plate, so it's my fault the ball was dropped here.
I actually did spend 2-3 days looking into this again a month or so ago, but neglected to follow up here. As it currently stands, I'm still unable to replicate this issue, even following the above instructions. As soon as I close I've tried using in a couple of different languages / frameworks, I have tried interrupting connections mid long running statements, and every weird edge case I was able to come up with. On 2 separate occasions I've even sat down with customers reporting similar issues to try to debug and both times it amounted to a misunderstanding of how their connection pool library was handling connections. I currently suspect that this issue is indeed fixed, and that continued reports have been a misunderstanding. However, I intended to write a new e2e test to verify, but ended up falling down the rabbit hole on a couple of other efforts and haven't made time to circle back yet. That's my bad, and I'm sorry I haven't followed up sooner.
As a suitable workaround, you can pretty easily create your down Dockerfile where you can
Tangentially, I wouldn't really recommend relying too heavily on a livenessProbe like this. This check verifies that the proxy is listening, but not necessarily that it's connecting successfully. It's usually more thorough to have a pod check that verifies your application is able to successfully reach your database, and restart the whole pod if necessary.
Could you explain why this is? This might be a different issue unrelated to this one - this current problem has been describe as cache of client side connections. If the server terminates the connection, the Cloud SQL proxy would have to be actively attempt to reconnect in order to tie up more server connection slots (which I don't think is the case). |
Hi @kurtisvg. Thank you for the prompt, honest and detailed response! 🙌 I fully appreciate that this is a difficult issue to reproduce consistently, and would happily share any information that would help in your investigation. I apologize for the length of this reply, but there's a lot of information to cover. What we have to go on is basically the symptoms of SQL connectivity issues we've experienced a number of times that always seem to coincide with our CloudSQL instance performing maintenance that results in active connections being closed, and new connection attempts timing out. During the latest incident, two days ago, we noticed the following:
During our investigation, we also discovered our CloudSQL reported 39
Naturally, we were immediately drawn to the highest number, Unsuccessful connection attempts can occur for the following reasons:
We also noticed that after the DB restart, the number of MySQL connections did not rise up to the same number as before the restart: As mentioned, we've noticed a similar series of events for at least the last 3 instances where our CloudSQL instance performed maintenance. Given what was observed, we thought that a livenessProbe on the cloudsql-proxy sidecar container would be a suitable workaround. A number of failed attempts led us to this ticket and the discovery that our workaround needs a shell. Thank you for the suggestion of rolling our own Docker image, but I would honestly prefer using a more official one maintained by the community, if possible.
I agree completely that relying on this workaround isn't ideal. That being said, I'm not sure if the end result of the proposed suggestion would be significantly different from what we currently have, i.e. a database timeout error encountered by our application is considered fatal, which then restarts the application pod. I'm happy to provide any additional information that would help. For reference, our deployment details are as follows:
I'd also like to thank my colleagues @theclive and @AndreMocke91 for their help in the investigation. |
Hey @devillexio, Thanks for following up in such detail - it's super helpful when digging into issues like these. There is quite a bit to comment on, but I'll try to break it up into 3 main thoughts:
First, what you have described definitely sounds like two independent issues; The first one is that when your database undoes maintenance your applications fails to reach your database until the proxy is restarted, while the second (the one this issue is describing) is when a client drops a connection, the proxy is somehow keeping the corresponding connection to the Cloud SQL instance live. I would appreciate if you would open a new issue, so that we can track the investigation into that separately. Second, a drop in connections after a cluster-wide restart doesn't necessarily indicate that the Cloud SQL proxy is dangling connections. Checking the documentation for node-mysql2, there is the following detail:
This is a pretty common pattern across connection pool libraries - while the api varies by implementation, there almost always is a "min" and "max" that the connection pool count sits between. Especially when the pool is oversized, new connections will only be created if there isn't already a connection immediately available. Additionally, this driver doesn't appear to support "max lifetime" args, so there isn't an easy way (from the clients perspective) to limit how long a connection can sit in the pool, even if idle. As such, I'd totally expect there to be a slow creep upwards of connections. This is why I think this problem has actually been fixed - #161 introduce a 1 minute TCP keep alive. The MySQL errors you are describing ( The best way we can verify this problem does still exist is reproducible example of a client connecting to the proxy, dropping the connection, and the proxy continues to keep the server side connection alive for more than a minute. (This is why the example in this comment was interesting - it implied that perhaps the Third, I don't think that a Additionally, the pattern you've described (process exit on connection timeout) could potentially be causing a high number of Before describing my recommendation in more detail, let me clarify some terms (apologies if you are already be familiar, but I thought what I thought might be inconsistencies, so let me clarify so I'm sure we are both on the same page): From what you specified above, I get the impression that when your app exits, only the main container restarts. My suggestion instead would be instead to introduce a health check at the pod level, but based on the health of your application. There are likely many different ways to implement, but you'll want to include some check that verifies the app is able to connect + authenticate to the database successfully. If this check fails 3-4 times in a row (or maybe X% over a specific timeframe), you should restart the entire pod (not just the container). If the issue is related to the proxy, this will it give it a fresh start (and hopefully corrects the problem without a manual redeploy). Apologies for the novel, but hopefully it's all helpful for you. |
Thanks, @kurtisvg! This is really useful! Your suggested solution to have the check at the pod level instead of the container level should resolve our issue. I didn't even know this was possible. 😄 Also, I've logged a separate issue (#400) as you suggested for the instances where the application cannot reconnect to the database instance after maintenance. |
@kurtisvg I've spent some time scouring the k8s documentation to find a way to add a livenessProbe to the Pod, but it doesn't currently seem possible. It seems to be a feature request on a long-standing ticket: kubernetes/kubernetes#40908. Do you perhaps have an example config where you've configured a probe that results in the pod being restarted, instead of individual containers in the pod? |
@devillexio Yeah, I alluded above there might be a few different ways to do it. I'm not really an expert on k8s, so there might already be a better or more established pattern. My approach would probably be to have a sentinel file in a shared volume to tell the proxy when to restart. Assuming you are already sharing rm -f /cloudsql/restart_proxy # remove any existing sentinel
./cloud_sql_proxy ... &
PID=$!
while [ ! -f /cloudsql/restart_proxy || ! kill -0 "$PID" ]; do
sleep 1
done
kill "$PID"
exit $? Then on the main container, you could add a spec:
containers:
- name: your-container-name
image: your-image-name
lifecycle:
preStop:
exec:
command: ["touch", "/cloudsql/restart_proxy"] Alternatively to |
Brilliant. Thanks, @kurtisvg. My bad for missing it the first time around. I'll definitely be giving this a try. |
Hey folks, I spent some time trying to add this as a permanent test run in our CI, but ran into some issues getting an accurate connection count from the instance side without any guarantee that different instances of the test aren't running in parallel. Instead I have had this bash script running on an isolated instance for a couple of weeks straight now, but haven't been able to replicate this with v1.18.0 of the proxy: As you can see by the image, the connection all terminate and don't linger. It's worth pointing out that at higher durations, the graph does some kind of averaging for the window that is somewhat deceiving, as the connections do in fact reach zero but it isn't reflected properly in the graph. I'm going to go ahead and close this, and lock it since it was a popular issue (and popular issues have a habit of attracting similar but unrelated question seekers). If anyone feels that they've encountered this issue again in some way shape or form, please open a new issue and make sure to include detailed reproduction steps. Thanks! |
Hello, we've noticed that the proxy sometimes keeps active connections to DB after the K8S payload requesting these connection died. Observed on PostgreSQL instances.
For example, we have deployment, which spawns 1 pod, application in this pod requests 10 connections. If we go to CloudSQL dashboard for this instance, we'll see that there are 10 new active connections now.
If we have autoscaling event, deployment spawns one more pod with 10 more connections. We can see that reflected on the dashboard. But after application load decreases and deployment scales down back to 1 pod from 2, we can still see 20 active connections to DB. If we delete cloudsql-proxy pod and its respective deployment spawns a new one, we will see that application reconnects and there are now 10 normal active connections.
This does not happen consistently 100% of a time, but is usually related to scaling events. This can create a problem, when these zombie connections add up to using significant fraction of max active connections possible on that DB.
The text was updated successfully, but these errors were encountered: