-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued) #2046
Comments
@enjmusic thanks for creating the isssue.
Do you see any logs? I guess not which may indicate that the idle conns reaper is indeed the problem. You could try to disable it completely to confirm that. Or add some logging here and here if you really need |
@vmihailenco if I remember correctly, we were seeing a mixture of I/O timeout and context deadline exceeded logs when this issue was triggered. We did completely disable idle connection reaping for a while (both the reaper goroutine and the
but again, it's difficult to believe that we've completely addressed the issue or just made it exceedingly unlikely at our current scale. I'd be interested to know if we can get any more insights from @klakekent and @pedrocunha, assuming they're still seeing the issue periodically. |
Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis. |
Our web servers running
With these points in mind, I think it's very reasonable to consider them enough evidence to continue taking a closer look at |
According to your log report But according to your reply:
You did not encounter this problem in non-redis cluster mode, so what is your context timeout in non-redis cluster mode? Is an ACL operation required( skip initConn: if c.opt.Password == "" && c.opt.DB == 0 && !c.opt.readOnly && c.opt.OnConnect == nil {
// skip initConn
} |
In addition, you may need to consider network issues (aws vpc), or the time spent processing commands by redis-cluster(> context.Timeout). |
So, we have a theory about what might be going on. AWS ElastiCache Redis best practices dictate that one should use the "Cluster Configuration Endpoint" for cluster discovery, which we took to mean that it was ok to use that as our one seed address for the
which we took to mean that it would balance between all nodes. However, there's likely some DNS caching going on. When I We may be able to get around this issue by performing a We're probably going to implement this safety measure no matter what, but my main question after looking into the Configuration Endpoint (and its use as the only seed address for a |
There seems to be no evidence of a DNS problem, in the logs you see "context deadline exceeded", we should determine where more time is being consumed causing the context to time out. We were unable to reproduce this issue in a test environment, you can add more logs to go-redis to locate where time is consuming: |
@enjmusic Have you seen this post https://medium.com/tinder-engineering/taming-elasticache-with-auto-discovery-at-scale-dc5e7c4c9ad0 ? Tinder found a bug in their Java client failing to dynamically reconfigure which IP it was connecting to following AWS Elasticache cluster maintenance events. |
We finally saw this issue again during the night yesterday, but it affected eight different nodes, which we've never seen before. Previously it was always one. This may or may not disprove the cluster configuration endpoint issue. Again, temporarily stopping new ingress of jobs into Redis resolved the issue. |
@dzqv just read the blog post! It's an interesting study but doesn't look like our issue where connection floods occur to some nodes independent of any cluster maintenance events |
This time we did see the issue happen within a couple minutes of a
... curious about whether or not that could put clients into a bad state |
We still don't know what caused this but a couple of things:
|
Just wanna add that we are also seeing this issue. We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout. A few other params we also tuned to mitigate this issue:
The changes we made are helpful to reduce the symptom, but it's still happening from time to time. @pedrocunha |
Today we saw this issue recur despite our Redis cluster being on
Not sure if that means the nodes are actually running 6.2 or just something to do with which commands work. That said, we now have a circuit breaker implemented in our writers that stops writes when errors are detected for given keys, which has helped with autoremediation. |
We had this new logging in place on our |
Actually, our nodes are |
This article should explain why that helps. |
Are there other good reasons to allow a context to be passed to Redis commands at all in |
But, perhaps, we should add an option to ignore |
I tried to lower the pool size from 1300 to 800, and go-redis seems to work stable for about 2 weeks. |
We had one occurrence of this issue with r6g.xlarge and 6.2. Seems that tweaking the limiter to be more aggressive and reducing the pool size has helped us. We haven't had issues for nearly 3 weeks now. We haven't got rid of context deadline yet but it looks like a sensible thing to do, however in our case we would want these to be extremely small (< 200ms) as we rather still return something over waiting and clogging all the backend requests. @enjmusic
I understand wanting more than 1 vCPU but large has 2 vCPUs. Is there anything in particular of 2 vs many more vCPU in this case?
👍 on the latter |
After experiencing issues like this we have a theory that when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right. Possibly half creating a new connection and immediately giving up. |
I got this issue again yesterday, but this time i found the error message was "redis: connection pool timeout",and tried to check the metrics in ElastiCache console and find at that the new connection suddenly went to 1.6k, so seems only to decrease the pool size not fix this issue. |
Anything new ? we started to experience this too on non managed redis cluster, random times. |
Hi, guys! I was able to reproduce the issue. EnvRedis 6.2.5 Steps to reproduce:
package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 50 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 30,
MinIdleConns: 30,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
err := rc.Get(ctx, "foo").Err()
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
} I also reproduced "use of closed network connection" problem with Redis pipeline: package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 35 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 50,
MinIdleConns: 50,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
defer cancel()
pipe := rc.Pipeline()
pipe.Get(ctx, "foo1")
pipe.Get(ctx, "foo2")
pipe.Get(ctx, "foo3")
_, err := pipe.Exec(ctx)
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
} In the second example error with message "use of closed network connection" happens only with pipeline. It seams that passing the context without timeout (as proposed here) works As you can see ReadTimeout is lower than context timeout. When I do not set context timeout and use only ReadTimeout it works fine, but as far as I set context timeout I face timeouts and reuse of closed connection (in case of pipeline). @yoni-veksler-px
I also think that connections should not be closed on timeouts.
Yes, it may be true. @vmihailenco can you recheck my examples? |
Are there any updates here? We're experiencing exactly the same issues others have discussed here. We get spiking new connections that cause the Redis CPU to spike on at least one node. From there, everything goes to hell for a few minutes until the cluster recovers on its own. We have tried tuning a lot of our config options and are currently running with a pool size of 64, but when we start up our service, we see a spike in connections that makes no sense given our settings. Immediately on start up, we see new connections spike to 10x what we'd expect when doing the math of pods X pool size. This, in turn, causes issues as kubernetes rolls out more pods throughout the day as our traffic increases. I updated our calls to |
For those who face the issue and perform primarily pipelined operations I would recommend giving github.com/joomcode/redispipe a shot which gives you a lot of freedom with no connection pool to worry about:
|
we are also experiencing a similar issue. We use Cluster client with no context timeouts. Apparently the cause is the increasing amount of new connections which takes elasticache CPU super high |
We were having this issue with non-managed Redis Cluster. We had no timeouts/deadlines on Context, but we had lower read/write timeouts set on It mostly all made sense after reading this article: https://uptrace.dev/blog/golang-context-timeout.html |
How did that blog post clear up anything? A large portion of us — me included — aren't using context timeouts, yet we still see context deadline exceeded errors because the redis client is faulty. |
I'm 100% sure this is the redis client. We're using memorydb, and whenever we see this exact issue we see new connections spin on our cluster: We're on V8. We'll upgrade to V9, but people's hypothesis about spinning and recreating connections seems correct. There's an insidious bug that has runaway problems, and is exacerbated by Redis going to 100% CPU from this client churn. |
We solved it by switching to https://github.com/rueian/rueidis |
The go-redis library has some concerning issues with connection churn, as described in redis/go-redis#2046. Hinge and other companies have moved to rueidis as a fix. Instead of debugging the go-redis library, we move natively to rueidis for the redis state package.
* Change client library for Redis The go-redis library has some concerning issues with connection churn, as described in redis/go-redis#2046. Hinge and other companies have moved to rueidis as a fix. Instead of debugging the go-redis library, we move natively to rueidis for the redis state package. * Update queue client * Update queue tests * Ensure we don't set count on hscans * Update vendor * Ignore existing leases on concurrency svc removal * Logging updates
We've also done the same thing. Thank you. |
We did the same thing, migrating to rueidis didn't have any more problems. |
@vmihailenco I can see this boolean added in the V9 in the redis Options struct : Is this added to resolve the above issue? |
We ran into similar issues with Memorystore (Google's version of Redis). This was happening even with I think that perhaps retrieving a connection from the connection pool ignores the However, we also just switched to redis/rueidis and it seems to have completely alleviated the issue, as well as reducing CPU usage on both our clients and Memorystore instances. |
(See previous issue for additional context and discussion)
Expected Behavior
A
ClusterClient
with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.Example
ClusterClient
config:Current Behavior
(The Current Behavior section in the original issue is still accurate to this issue)
Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:
When this issue happens, running
CLIENT LIST
on the affected Redis node showsage=0
orage=1
for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.
Possible Solution
Redis
ClusterClient
should react more gracefully to that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.After the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:
context.Context
s passed to Redis commands to 1 second or greaterMULTI
/EXEC
pipelinesWe are seeing the issue a lot less often after
but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.
Details About Fuzzing Idle Connection Cleanup
We opted to set
IdleTimeout
andIdleCheckFrequency
on theredis.ClusterOptions
passed to our client to-1
and added a customNewClient
implementation:Resetting the System to a Good State
When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes
SRANDMEMBER pending 10
-> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.Steps to Reproduce
The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both
go-redis
and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.Context (Environment)
We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
SRANDMEMBER pending 10
- grab up to 10 random items from the pool of available workZADD in_progress <current_timestamp> <grabbed_item>
for each of our items we got in the previous stepZADD
have been claimed by some other instance of the service, skip themSREM pending <grabbed_item>
ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>
so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second windowWork item producers simply
SADD pending <work_item>
but the load they produce is miniscule compared to the consumers running the algorithm outlined above.Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our
ClusterClient
to use aPoolSize
of10
(reduced from the default of20
on these instances) for each childClient
it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.
Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using
github.com/go-redis/redis/v8 v8.0.0
,github.com/go-redis/redis/v8 v8.11.2
, andgithub.com/go-redis/redis/v8 v8.11.4
- all 3 have run into this issue. We are now using a fork ofv8.11.4
(see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):ClusterClient.process
method for logging/metrics around various anomalies during processing commands (MOVED
/ASK
responses, errors getting nodes, node failures). We do not see these while the issue is happening.We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of
cache.r6g.large
instances.Detailed Description
N/A
Possible Implementation
N/A
The text was updated successfully, but these errors were encountered: