Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various timeouts performing HMGET #2069

Closed
alessandro-gandolfi-ors opened this issue Apr 6, 2022 · 26 comments
Closed

Various timeouts performing HMGET #2069

alessandro-gandolfi-ors opened this issue Apr 6, 2022 · 26 comments

Comments

@alessandro-gandolfi-ors
Copy link

alessandro-gandolfi-ors commented Apr 6, 2022

Hello everyone, the company I work for is having similar problems to #2066 with Microsoft's wrapper package (v5.0.1) and .NET 5 but I can't understand the cause. The size of a single key is a bit less than 0,33mb. Sometimes, rather rarely, these exceptions are thrown. After people have reported them, I included the package explicitly to the latest version hoping that all the performance improvements and fixes can help. Still, if they happen again, it is better to get what is going on now than to face them in the future. For more info, Redis is deployed on a K8s cluster.

Moreover, I noticed that the set operation used in the project was async without await and changed it to a sync set, so there is a possibility that all of them are already resolved, but I don't have much time to test the changes and I don't also know how to reproduce the errors. These errors were caused using an async set and a sync get.

Here there are 3 examples and what I understood out of them (some info is replaced with 'xxx'):

1 - In this one it seems that the client is waiting for a response but there are no bytes left to read. I don't understand what is happening with the threads, for now I understood that there are no threads active in the threadpool dedicated to Redis (mgr) but at the same time one is active in the threadpool (busy worker), can someone ELI5? Also I know that the wrapper package uses more set operation under the hood to manage the expirations, but why would the read state be ReadAsync?
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 10 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), v: 2.0.593.37019

2 + 3 - In these the read state is CompletePendingMessage. I thought it could be thread theft but I also read in a lot of issues that it shouldn't happen after .NET Core so they remain an open question.
The number 2 has no requests pending but there are bytes still to be read in the pipe, one thread active for async operations (mgr) and 2 worker threads active.
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 0, aw: False, rs: CompletePendingMessage, ws: Idle, in: 0, in-pipe: 7150, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2,Free=32765,Min=4,Max=32767), v: 2.0.593.37019
The number 3 is similar to the 2 but there is a read operation pending and the number of worker threads has surpassed the minimum configuration.
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 1, aw: False, rs: CompletePendingMessage, ws: Idle, in: 0, in-pipe: 8404, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=32762,Min=4,Max=32767), v: 2.0.593.37019

Any help or suggestion would be greatly appreciated.

@NickCraver
Copy link
Collaborator

First step is definitely upgrading the client to a much newer version (preferably latest) - we've added debugging into which helps diagnose issues that aren't apparent in this old version's stats. If you can repro on latest, can you please share an example error message?

@AlessandroGandolfi
Copy link

AlessandroGandolfi commented Apr 7, 2022

Thanks for the reply, I'm writing from my personal account now.

First step is definitely upgrading the client to a much newer version (preferably latest)

After people have reported them, I included the package explicitly to the latest version hoping that all the performance improvements and fixes can help

As I wrote, after the reports, I already included the package explicitly to the latest version in the project. I also enabled the configuration to get better details in exceptions.

If you can repro on latest, can you please share an example error message?

I don't have much time to test the changes and I don't also know how to reproduce the errors.

As I already wrote again, I don't actually know how I can reproduce the errors. The only thing I know is that they were thrown when using a wireless connection (working from home) instead of the usual wired connection of the office, so the connection could have easily be slower than the one in the office.

The operations used, for now, involve only one key for each connection that is read, modified and rewritten repeatedly. The size is usually 0,33mb but it could get bigger in the future. I already told to some of my coworkers that some parts could be divided in more than one key or compressed but for now it is not in the near-future plans to do so.

I wrote the issue mainly to get a better understanding of what could have happened, our usage doesn't seem to be so extreme in data size, slow connections or other possible influencing factors. The version used when these problems happened was older but the codes for the timeouts are still relevant to the latest version. If someone could help me understand what could have been the cause of them it would be great.

@alessandro-gandolfi-ors
Copy link
Author

alessandro-gandolfi-ors commented Apr 14, 2022

@NickCraver
an error was again reported with the latest version of the package, this time a bit different.

Timeout performing EVAL (5000ms), next: EVAL, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, 
in-pipe: 0, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, clientName: xxx,
IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), 
POOL: (Threads=9,QueuedItems=0,CompletedItems=552), Local-CPU: 100%, v: 2.5.61.22961

CPU usage was at 100%, could this be related to #1992 ?
I enabled the main threadpool usage since I saw you recommending it on multiple issues.

@DarthSonic
Copy link

DarthSonic commented Apr 21, 2022

We encounter the same problem for months now and since the last update the exception is occurring every day multiple times:
Timeout performing EVAL (5000ms), next: EVAL, inst: 4, qu: 0, qs: 185, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 65536, in-pipe: 0, out-pipe: 0, serverEndpoint: xyz.redis.cache.windows.net:6379, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF1C09E7(SE.Redis-v2.5.61.22961), IOCP: (Busy=5,Free=995,Min=4,Max=1000), WORKER: (Busy=192,Free=32575,Min=4,Max=32767), v: 2.5.61.22961 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

@DarthSonic
Copy link

We also upgrade Azure Redis from v3 to v6. Maybe this is a problem for StackExchange.Redis?

@DarthSonic
Copy link

P.S.: Server cpu load is max. 3%, memory usage is around 250 MB of 2.5 GB. So the instance should handle that very well.

@NickCraver
Copy link
Collaborator

@DarthSonic I'd guess you're hitting thread pool exhaustion there, are you using sync commands or async? The 192 threads, a full socket inbound, and 185 in the queue means the client's overloaded and I'd guess that's from thread pool exhaustion. I'd aim at getting a stack dump and seeing where all those threads are at...and unavailable because of it.

@DarthSonic
Copy link

@NickCraver we are using async methods with await (Asp.Net MVC). Any problem with that?

@NickCraver
Copy link
Collaborator

@DarthSonic on the Redis client bits too? e.g. .StringGetAsync instead of .StringGet? In either case: a memory dump and looking at your threads would be very enlightening and show you what's up in a hurry.

@DarthSonic
Copy link

@NickCraver yes. using only async methods of your Redis client. Will take a memory dump of Azure App Service when problem happens again and I am online at that time.

@DarthSonic
Copy link

@NickCraver Here is a stat from our thread counts. Seems that there is indeed a problem. The spike do fit the times when the exceptions occur:
image

@alessandro-gandolfi-ors
Copy link
Author

it happened again more than once. now usually it happens for people that work from home, with (not-so-much) slower connections than the one the company has in its offices. still, it shouldn't happen. i asked to my coworkers who are reporting this to execute some commands and get more info about their connections if possible, i will post more info once i get them.

local cpu is at 100% again in this one, i disabled the prevent theft flag previously since, by what i understood in other issues, it is useful for .NET versions prior to Core and we are using .NET 5:
Timeout performing HMGET (5000ms), active: HMGET xxx, next: HMGET xxx ** possible thread-theft indicated; see https://stackexchange.github.io/StackExchange.Redis/ThreadTheft ** , inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: CompletePendingMessageSync, ws: Idle, in: 0, in-pipe: 1459, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767), POOL: (Threads=12,QueuedItems=0,CompletedItems=610), Local-CPU: 100%, v: 2.5.61.22961

another one is this:
Timeout performing HMGET (5000ms), active: HMGET xxx, next: HMGET xxx, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, mgr: 10 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=1,Free=32766,Min=16,Max=32767), POOL: (Threads=13,QueuedItems=0,CompletedItems=1377), Local-CPU: 3,99%, v: 2.5.61.22961

@NickCraver
Copy link
Collaborator

@alessandro-gandolfi-ors The first error: CPU, the second: looks like we legitimately are waiting on data to come back. Recommend using SLOWLOG to see any pauses on the server end!

@sinanopcommerce
Copy link

https://stackoverflow.com/questions/74011008/redis-timeout-aws-net6

Can anybody please take a look? What is wrong there?

@NickCraver
Copy link
Collaborator

@sinanopcommerce recommend opening an issue here with more details - How many keys? How large? Distance from server? Does SLOWLOG show anything on the Redis server side? Have you tried a newer version? We have a lot of enhancements since the 2.2.x release.

@sinanopcommerce
Copy link

@NickCraver , There are lots of keys around 50k-60k. Data size is around 150KB-200KB or even 1MB-2MB. It is in the same region. of aws. I use Microsoft.Extensions.Caching.StackExchangeRedis latest version.

@NickCraver
Copy link
Collaborator

Is that keys in a single fetch you mean or in the database total? If total - how many are being fetched at a time? Trying to get an idea of a per-fetch payload size here.

@sinanopcommerce
Copy link

@NickCraver total key. No not a total single fetch. Per fetch it can be 2-4k for 1k HTTP hits

@NickCraver
Copy link
Collaborator

2-4k keys? How large is each?

@sinanopcommerce
Copy link

Yes for 1k HTTP hits it can be 2-4k fetch and the size can be 10kb-150kb each key.

@NickCraver
Copy link
Collaborator

@sinanopcommerce If I'm understanding you right, that's 2,000 keys * 10 KB at the low end and 4,000 keys * 150 KB at the high end, or 2 MB to 600 MB per fetch. That's a hell of an upper end and would exhaust bandwidth for most cloud services. I'd recommend keys not being that large or lowering batch sizes, else you're not going to get bytes across the wire before the timeout under load.

@sinanopcommerce
Copy link

sinanopcommerce commented Oct 16, 2022

@NickCraver you get me perfectly. Yes the CMS I am using has huge architectural issues for the distributed cache. I understood you. You can close this issue.

But I assume for 1k keys fetch will also cause the time-out issue even if the size of each key is 10kb.

@NickCraver
Copy link
Collaborator

@sinanopcommerce I wouldn't expect 1k keys at 10 KB to timeout - that's 1MB, but under load it might. It's really "do I have enough bandwidth from app server to Redis and back?" if not, it can't keep up and the timeout is going to be what tells you that. Larger timeouts can survive spikes in a pinch but it's a bandaid for sure. I hope you can help tweak some of the cache layout to make it friendlier!

@sinanopcommerce
Copy link

@NickCraver I will try. There is no way except a hybrid cache. I also tried "System.Threading.ThreadPool.MinThreads": 10000 but no luck. By the way, thank you for your response.

@NickCraver
Copy link
Collaborator

If you have ideas and want more specific advice, feel free to open an issue and we can discuss a bit more - not sure how to help on current but if you have architecture ideas happy to give any advice we can. Thanks for quick responses here!

Closing this just to tidy up.

@Alik2015
Copy link

Exactly same issue here. Seeing way too many of these in AWS ElastiCache Redis. Exact same error. Keys around 1mb per key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants