-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constantly Reestablishing Connections in Cluster Mode #1912
Comments
Thanks for the detailed issue 👍 We recently received #1910, but it looks like a different problem.
This does not look like a go-redis log. We don't use redis_writer prefix anywhere, so it could be a good idea to figure out what happens here.
What is the difference between restarting a service and restarting a client?
I would add a log here to figure out why connections are constantly closed: func (p *ConnPool) Remove(ctx context.Context, cn *Conn, reason error) {
+log.Printf("closing bad conn: %s", reason)
p.removeConnWithLock(cn)
p.freeTurn()
_ = p.closeConn(cn)
}
No, but problem is not with creation of new connections, but with closing of existing connections.
I would check Read/Write timeouts - anything less than |
@vmihailenco Thanks for the quick response!
They're essentially the same thing. However, something about having all clients shut down at the same time gets us back into a good state sometimes, whereas doing a rolling restart did not. |
@vmihailenco Another question - are there any other places we can add logs to see when connections are being closed for reasons that aren't |
You can also duplicate the log message at https://github.com/go-redis/redis/blob/master/redis.go#L265.
PubSub bypasses the pool completely - see https://github.com/go-redis/redis/blob/master/pubsub.go#L145 and https://github.com/go-redis/redis/blob/master/redis.go#L651-L654 |
Is it worth also logging in the if isBadConn(err, false) {
c.connPool.Remove(ctx, cn, err)
} else {
c.connPool.Put(ctx, cn)
} I'm not positive what that case is responsible for.
We aren't using any PubSub features, so that should eliminate a variable |
|
@vmihailenco This behavior fortunately started again on one of our cluster nodes this weekend! The added log I see is giving the following reason for constantly closing connections:
Knowing this, what other information would it be helpful for me to gather given that the client is in some kind of context deadline loop? The only spot I can trace up from this connection pool code to the cluster client is in pipelined commands (both regular pipelines and Is it possible that somehow if the Redis cluster gets sad briefly for some reason, our 100ms context we pass in for some of our operations is also applied to the creation of a new connection when one is dropped, which causes the connection creation to fail if the cluster is sad, rinse and repeat? |
I'm open to the possibility that increasing our timeout is the move here, but if my hypothesis is true it still seems bad that this can happen in the first place. Also it seems like this could still happen for any timeout value, albeit 10x less frequent when timeout is increased from 100ms to 1 second? |
Around the time the latest event started, we see a smaller amount of I/O timeout logs starting as well. Hard to know which is the chicken and which is the egg though. Maybe this is helpful (i/o timeouts vs context deadline exceeded, logarithmic scale)? Those specifically look like:
and are present both for |
Try to upgrade to v8.11.4. May be it will help. But overall go-redis does the right thing by closing connections on context deadline and/or timeouts, because such connections conns can/will receive unsolicited responses. Nothing to improve here. |
That makes a lot of sense 😄 Are there any other parameters on the Redis client we can tune other than our per-transaction context deadline that could help mitigate reconnect storms like this? We're currently using a fork of 8.11.4 |
No, go-redis should work with default settings. See https://redis.uptrace.dev/guide/performance.html for some general information. |
Hi @enjmusic I am wondering if you managed to get to the bottom of this issue. There are multiple things we have in common here. Running redis cluster on ElasticCache, Redis CPU stuck at 100% due to tens of thousand connections despite a very small pool and only able to recover by stopping all workers. We are also using context timeouts and pipeline commands. @vmihailenco what is a reasonable high context timeline? We have this at 1s and ReadTimeout at 750ms but we do not want to have any lower to this. We are running Any advice would be greatly appreciated |
Hi @pedrocunha ! Thanks for asking, it's good to know we're not the only ones. We were not able to resolve this issue through a variety of mitigation measures, including (not all at once):
We are seeing the issue a lot less often after
but I don't think it's a permanent fix, just a consequence of there being fewer potential resources to overload an individual ElastiCache Redis cluster node. We still have nothing to blame for when this TL;DR: we're still very much stuck investigating this issue before moving more traffic on to this system, because we still don't know why this is happening. |
Thanks for the reply! You listed a few things that we wanted to try too but now I feel less encouraged we will get to the bottom of it. Couple of questions:
We are seeing this issue on a |
Similar deal here:
|
We encountered the same problem. Many new connections will be created suddenly at an unfixed point in time. According to the error log, the I/O timeout is displayed, so go-redis creates many new connections, which causes the redis server load to be too high, but it still fails to find out why this happening, because the timing of such problems is completely random and the system's load bottleneck is not reached at all |
Glad we aren't alone on this @klakekent. This has been an extremely frustrating issue for us. We implemented a rate limiter and it doesn't fully mitigate this issue. We have now just upgraded to redis 6.2 this morning and we are hoping to see if it helps. |
@pedrocunha And i also contact with AWS support to check this issue and they told me the Elasticache and EC2 found nothing wrong, so I think the redis client should try to check what happened. @vmihailenco |
It sounds like there are three of us all experiencing the same issue now. @vmihailenco would it be possible to reopen this issue for further investigation? When we contacted AWS support, similar to @klakekent, we also were told that they found absolutely nothing wrong on their end. The issue also seems to trigger so quickly (sub 30-second interval) that most attempts to measure things like relative command durations/failures etc. haven't been fruitful in chasing down what's going on. |
@enjmusic It is best to open a new issue with summary of what is happening and what have been tried so far. |
@vmihailenco @klakekent @pedrocunha I've opened a new issue for this problem #2046 |
Expected Behavior
Creating a cluster client using pretty much default settings should not overwhelm Redis with constant barrage of new connections.
Current Behavior
Occasionally, at times completely unrelated to system load/traffic, we are seeing connections being constantly re-established to one of the cluster nodes in our Redis cluster. We are using ElastiCache Redis in cluster mode with TLS enabled, and there seems to be no trigger we can find for this behavior. We also do not see any relevant logs in our service's
systemd
output injournalctl
, other thanwhich seems more like a symptom of an overloaded Redis cluster node rather than a cause.
When this issue happens, running
CLIENT LIST
on the affected Redis node showsage=0
orage=1
for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.New Connections (Cloudwatch)
Current Connections (Cloudwatch)
In the example Cloudwatch graphs above we can also see that the issue can move between Redis cluster shards. As you can see, we're currently running with a 4-shard cluster, where each shard has 1 replica.
Restarting our service does not address this problem, and to address it we basically need to do a hard reset (completely stop the clients for a while, then start them up again).
We've reached out to AWS support and they have found no issues with our ElastiCache Redis cluster on their end. Additionally, there are no ElastiCache events happening at the time this issue is triggered.
Possible Solution
In this issue I'm mainly hoping to get insight into how I could better troubleshoot this issue and/or if there are additional client options we can use to try and mitigate this worst case scenario (i.e. rate limiting the creation of new connections in the cluster client) in absence of a root-cause fix.
My main questions are:
go-redis
experts here?ClusterClient
to keep things from getting too out of control if this does continue to occur?Steps to Reproduce
The description of our environment/service implementation below, as well as the snippet of our
NewClusterClient
call at the beginning of this issue, provide a fairly complete summary of how we're using bothgo-redis
and ElastiCache Redis. We've not been able to consistently trigger this issue since it often happens when we're not load testing, and are mainly looking for answers for some of our questions above.Context (Environment)
We're running a service that has a simple algorithm for claiming work from a Redis set, doing something with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
SRANDMEMBER pending 10
- grab up to 10 random items from the pool of available workZADD in_progress <current_timestamp> <grabbed_item>
for each of our items we got in the previous stepZADD
have been claimed by some other instance of the service, skip themSREM pending <grabbed_item>
ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>
so that claimed items aren't claimed foreverCurrently we run this algorithm on 6 EC2 instances, each running one service. Since each instance has 4 CPU cores,
go-redis
is calculating a max connection pool size of 20 for ourClusterClient
. Each service has 20 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.At a steady state with no load on the system (just a handful of heartbeat jobs being added to
pending
every minute) we see a maximum of ~8%EngineCPUUtilization
on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using
github.com/go-redis/redis/v8 v8.0.0
andgithub.com/go-redis/redis/v8 v8.11.2
- both have run into this issue.As mentioned earlier, we're currently running with a 4-shard ElastiCache Redis cluster with TLS enabled, where each shard has 1 replica.
Detailed Description
N/A
Possible Implementation
N/A
The text was updated successfully, but these errors were encountered: