Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constantly Reestablishing Connections in Cluster Mode #1912

Closed
ianjhoffman opened this issue Sep 28, 2021 · 22 comments
Closed

Constantly Reestablishing Connections in Cluster Mode #1912

ianjhoffman opened this issue Sep 28, 2021 · 22 comments

Comments

@ianjhoffman
Copy link

Expected Behavior

Creating a cluster client using pretty much default settings should not overwhelm Redis with constant barrage of new connections.

redis.NewClusterClient(&redis.ClusterOptions{
    Addrs: []string{redisAddr},
    TLSConfig: &tls.Config{},
})

Current Behavior

Occasionally, at times completely unrelated to system load/traffic, we are seeing connections being constantly re-established to one of the cluster nodes in our Redis cluster. We are using ElastiCache Redis in cluster mode with TLS enabled, and there seems to be no trigger we can find for this behavior. We also do not see any relevant logs in our service's systemd output in journalctl, other than

redis_writer:85 {}        Error with write attempt: context deadline exceeded

which seems more like a symptom of an overloaded Redis cluster node rather than a cause.

When this issue happens, running CLIENT LIST on the affected Redis node shows age=0 or age=1 for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.

New Connections (Cloudwatch)

NewConnections

Current Connections (Cloudwatch)

CurrConnections

In the example Cloudwatch graphs above we can also see that the issue can move between Redis cluster shards. As you can see, we're currently running with a 4-shard cluster, where each shard has 1 replica.

Restarting our service does not address this problem, and to address it we basically need to do a hard reset (completely stop the clients for a while, then start them up again).

We've reached out to AWS support and they have found no issues with our ElastiCache Redis cluster on their end. Additionally, there are no ElastiCache events happening at the time this issue is triggered.

Possible Solution

In this issue I'm mainly hoping to get insight into how I could better troubleshoot this issue and/or if there are additional client options we can use to try and mitigate this worst case scenario (i.e. rate limiting the creation of new connections in the cluster client) in absence of a root-cause fix.

My main questions are:

  1. Is there a way for me to gather more data that would be helpful for the Redis/go-redis experts here?
  2. Is there a way for us to rate-limit the creation of new connections in the ClusterClient to keep things from getting too out of control if this does continue to occur?
  3. Has anyone else encountered a similar issue with Cluster mode, whether or not it was with ElastiCache Redis?

Steps to Reproduce

The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both go-redis and ElastiCache Redis. We've not been able to consistently trigger this issue since it often happens when we're not load testing, and are mainly looking for answers for some of our questions above.

Context (Environment)

We're running a service that has a simple algorithm for claiming work from a Redis set, doing something with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:

  • SRANDMEMBER pending 10 - grab up to 10 random items from the pool of available work
  • ZADD in_progress <current_timestamp> <grabbed_item> for each of our items we got in the previous step
  • Any work items we weren't able to ZADD have been claimed by some other instance of the service, skip them
  • Once we're done with a work item, SREM pending <grabbed_item>
  • Periodically ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago> so that claimed items aren't claimed forever

Currently we run this algorithm on 6 EC2 instances, each running one service. Since each instance has 4 CPU cores, go-redis is calculating a max connection pool size of 20 for our ClusterClient. Each service has 20 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.

At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.

Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using github.com/go-redis/redis/v8 v8.0.0 and github.com/go-redis/redis/v8 v8.11.2 - both have run into this issue.

As mentioned earlier, we're currently running with a 4-shard ElastiCache Redis cluster with TLS enabled, where each shard has 1 replica.

Detailed Description

N/A

Possible Implementation

N/A

@vmihailenco
Copy link
Collaborator

Thanks for the detailed issue 👍 We recently received #1910, but it looks like a different problem.

redis_writer:85 {} Error with write attempt: context deadline exceeded

This does not look like a go-redis log. We don't use redis_writer prefix anywhere, so it could be a good idea to figure out what happens here. context deadline exceeded suggests a timeout which can cause the behavior you mentioned.

Restarting our service does not address this problem, and to address it we basically need to do a hard reset (completely stop the clients for a while, then start them up again).

What is the difference between restarting a service and restarting a client?

Is there a way for me to gather more data that would be helpful for the Redis/go-redis experts here?

I would add a log here to figure out why connections are constantly closed:

func (p *ConnPool) Remove(ctx context.Context, cn *Conn, reason error) {
	+log.Printf("closing bad conn: %s", reason)
	p.removeConnWithLock(cn)
	p.freeTurn()
	_ = p.closeConn(cn)
}

Is there a way for us to rate-limit the creation of new connections in the ClusterClient to keep things from getting too out of control if this does continue to occur?

No, but problem is not with creation of new connections, but with closing of existing connections.

Has anyone else encountered a similar issue with Cluster mode, whether or not it was with ElastiCache Redis?

I would check Read/Write timeouts - anything less than time.Second is not very practical. Also context.Context deadline should be reasonably high.

@ianjhoffman
Copy link
Author

ianjhoffman commented Sep 29, 2021

@vmihailenco Thanks for the quick response!

I would add a log here to figure out why connections are constantly closed

Is there a viable hook/other way for us to inject the log you've recommended without having to duplicate the go-redis library and modify it? It's not entirely out of the question for us, but generally our builds that we promote happen on a remote build server and start from a clean state, so we would have to copypaste the whole go-redis library and no longer reference it directly using go modules to do this. Edit: I totally forgot we can just do a replace statement with a fork of this library. Doh!

This does not look like a go-redis log.

I would check Read/Write timeouts - anything less than time.Second is not very practical. Also context.Context deadline should be reasonably high.

redis_writer is one of our files, and that log is essentially just what we print if the context.Context we passed into the SADD to pending on the writer side timed out. Each individual attempt has a timeout of 100ms, which sounds like it may be too low from what you're saying. We wanted to set an aggressively low timeout on this system but perhaps we were too brazen in our choice of timeout. Any other timeouts on the Redis client itself would be the default values, since we aren't overriding anything other than the Addrs and TLSConfig.

What is the difference between restarting a service and restarting a client?

They're essentially the same thing. However, something about having all clients shut down at the same time gets us back into a good state sometimes, whereas doing a rolling restart did not.

@ianjhoffman
Copy link
Author

@vmihailenco Another question - are there any other places we can add logs to see when connections are being closed for reasons that aren't bad conns? I deployed versions of our services with a fork adding your recommended log line and although we haven't been able to reproduce the issue where we have a constant stream of new connections yet, we also don't see this closing bad conn log at all despite seeing small amount of new connections without a corresponding increase in total connections

image

@vmihailenco
Copy link
Collaborator

You can also duplicate the log message at https://github.com/go-redis/redis/blob/master/redis.go#L265.

despite seeing small amount of new connections without a corresponding increase in total connections

PubSub bypasses the pool completely - see https://github.com/go-redis/redis/blob/master/pubsub.go#L145 and https://github.com/go-redis/redis/blob/master/redis.go#L651-L654

@ianjhoffman
Copy link
Author

ianjhoffman commented Oct 1, 2021

Is it worth also logging in the else clause there?

	if isBadConn(err, false) {
		c.connPool.Remove(ctx, cn, err)
	} else {
		c.connPool.Put(ctx, cn)
	}

I'm not positive what that case is responsible for.

PubSub bypasses the pool completely

We aren't using any PubSub features, so that should eliminate a variable

@vmihailenco
Copy link
Collaborator

Put returns healthy connections to the pool.

@ianjhoffman
Copy link
Author

ianjhoffman commented Oct 4, 2021

@vmihailenco This behavior fortunately started again on one of our cluster nodes this weekend! The added log I see is giving the following reason for constantly closing connections:

Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:10 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:10 closing bad conn: context deadline exceeded
Oct 04 11:49:11 <hostname_redacted> <service_redacted>[25105]: 2021/10/04 11:49:11 closing bad conn: context deadline exceeded

Knowing this, what other information would it be helpful for me to gather given that the client is in some kind of context deadline loop?

The only spot I can trace up from this connection pool code to the cluster client is in pipelined commands (both regular pipelines and MULTIs), still trying to figure out where this ctx is coming from though.

Is it possible that somehow if the Redis cluster gets sad briefly for some reason, our 100ms context we pass in for some of our operations is also applied to the creation of a new connection when one is dropped, which causes the connection creation to fail if the cluster is sad, rinse and repeat?

@ianjhoffman
Copy link
Author

ianjhoffman commented Oct 4, 2021

I'm open to the possibility that increasing our timeout is the move here, but if my hypothesis is true it still seems bad that this can happen in the first place. Also it seems like this could still happen for any timeout value, albeit 10x less frequent when timeout is increased from 100ms to 1 second?

@ianjhoffman
Copy link
Author

Around the time the latest event started, we see a smaller amount of I/O timeout logs starting as well. Hard to know which is the chicken and which is the egg though. Maybe this is helpful (i/o timeouts vs context deadline exceeded, logarithmic scale)?

image

Those specifically look like:

2021/10/02 15:21:46 closing bad conn: read tcp <ip_redacted>:42792-><ip_redacted>:6379: i/o timeout

and are present both for read tcp and write tcp

@vmihailenco
Copy link
Collaborator

Try to upgrade to v8.11.4. May be it will help.

But overall go-redis does the right thing by closing connections on context deadline and/or timeouts, because such connections conns can/will receive unsolicited responses. Nothing to improve here.

@ianjhoffman
Copy link
Author

That makes a lot of sense 😄

Are there any other parameters on the Redis client we can tune other than our per-transaction context deadline that could help mitigate reconnect storms like this? We're currently using a fork of 8.11.4

@vmihailenco
Copy link
Collaborator

No, go-redis should work with default settings. See https://redis.uptrace.dev/guide/performance.html for some general information.

@pedrocunha
Copy link

Hi @enjmusic I am wondering if you managed to get to the bottom of this issue. There are multiple things we have in common here. Running redis cluster on ElasticCache, Redis CPU stuck at 100% due to tens of thousand connections despite a very small pool and only able to recover by stopping all workers. We are also using context timeouts and pipeline commands.

@vmihailenco what is a reasonable high context timeline? We have this at 1s and ReadTimeout at 750ms but we do not want to have any lower to this.

We are running v8.11.3

Any advice would be greatly appreciated
Thank you

@ianjhoffman
Copy link
Author

ianjhoffman commented Feb 21, 2022

Hi @pedrocunha ! Thanks for asking, it's good to know we're not the only ones. We were not able to resolve this issue through a variety of mitigation measures, including (not all at once):

  • Increasing all timeouts to 1 second or greater
  • Lowering pool size
  • Reducing # of goroutines exercising our Redis claim logic
  • Moving all operations out of MULTI/EXEC pipelines
  • Completely disabling idle connection reaping
  • Fuzzing the interval at which each individual cluster node's Client cleans idle connections
  • Changing the instance type of the ElastiCache Redis nodes
  • Rebalancing/increasing shard count for the ElastiCache Redis cluster

We are seeing the issue a lot less often after

  • Reducing # of goroutines exercising our Redis claim logic
  • Lowering pool size

but I don't think it's a permanent fix, just a consequence of there being fewer potential resources to overload an individual ElastiCache Redis cluster node. We still have nothing to blame for when this 100% due to tens of thousand connections despite a very small pool happens. It seems to trigger randomly, often at times when the system is not even at its peak load. We also started ingesting pool stats from the ClusterClient and don't see any pool timeouts while this issue is happening.

TL;DR: we're still very much stuck investigating this issue before moving more traffic on to this system, because we still don't know why this is happening.

@pedrocunha
Copy link

Thanks for the reply! You listed a few things that we wanted to try too but now I feel less encouraged we will get to the bottom of it.

Couple of questions:

  • What is your dial timeout?
  • Do you also see a spike on CurrentConnections?
  • What does your traffic look like? We do about ~2 million GET requests per node per min (pipeline with several GETs) and this issue only happens at peak randomly during the week around 5-6pm BST.
  • Have you considered not using redis cluster at all?
  • Have you tried using SLOWLOG?

We are seeing this issue on a cache.r6g.xlarge at less than 20% CPU and 67% memory usage. I am also suspicious of us storing big objects as our NetworkBytesOut can be ~500MB/min.

@ianjhoffman
Copy link
Author

ianjhoffman commented Feb 21, 2022

Similar deal here:

  • 10-30% CPU usage, but very low (only .15%) memory usage
  • Not sure if I remember/still have data for CurrentConnections, but NewConnections always ends up at around 15k for the shard in question
  • We're using the default DialTimeout
  • We tried using SLOWLOG from the AWS Console and didn't see anything interesting
  • Our alternative to using Redis Cluster would probably be similar to something we did in a legacy system with much more traffic where each node is a regular ElastiCache primary/replica pair. We haven't tried that solution here as we didn't want to commit to handling our own sharding and sharding configuration again
  • As far as maximums go for different byte rates, I see ReplicationBytes at ~200MB/min and NetworkBytesOut at ~900MB/min. Usually our payloads we're storing/fetching/deleting are around 1-2KB in size but that can vary
  • The busiest shards (that happen to hold more keys than others because randomness ain't perfect) see ~650k SET-type commands/min and ~180k GET-type commands/min at maximum. All commands are one-offs and don't live inside pipelines, not that this change helped us

@klakekent
Copy link

We encountered the same problem. Many new connections will be created suddenly at an unfixed point in time. According to the error log, the I/O timeout is displayed, so go-redis creates many new connections, which causes the redis server load to be too high, but it still fails to find out why this happening, because the timing of such problems is completely random and the system's load bottleneck is not reached at all

@pedrocunha
Copy link

Glad we aren't alone on this @klakekent. This has been an extremely frustrating issue for us. We implemented a rate limiter and it doesn't fully mitigate this issue. We have now just upgraded to redis 6.2 this morning and we are hoping to see if it helps.

@klakekent
Copy link

@pedrocunha And i also contact with AWS support to check this issue and they told me the Elasticache and EC2 found nothing wrong, so I think the redis client should try to check what happened. @vmihailenco

@ianjhoffman
Copy link
Author

ianjhoffman commented Mar 11, 2022

It sounds like there are three of us all experiencing the same issue now. @vmihailenco would it be possible to reopen this issue for further investigation? When we contacted AWS support, similar to @klakekent, we also were told that they found absolutely nothing wrong on their end. The issue also seems to trigger so quickly (sub 30-second interval) that most attempts to measure things like relative command durations/failures etc. haven't been fruitful in chasing down what's going on.

@vmihailenco
Copy link
Collaborator

@enjmusic It is best to open a new issue with summary of what is happening and what have been tried so far.

@ianjhoffman
Copy link
Author

@vmihailenco @klakekent @pedrocunha I've opened a new issue for this problem #2046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants