Various timeouts performing HMGET #2069

alessandro-gandolfi-ors · 2022-04-06T12:50:26Z

Hello everyone, the company I work for is having similar problems to #2066 with Microsoft's wrapper package (v5.0.1) and .NET 5 but I can't understand the cause. The size of a single key is a bit less than 0,33mb. Sometimes, rather rarely, these exceptions are thrown. After people have reported them, I included the package explicitly to the latest version hoping that all the performance improvements and fixes can help. Still, if they happen again, it is better to get what is going on now than to face them in the future. For more info, Redis is deployed on a K8s cluster.

Moreover, I noticed that the set operation used in the project was async without await and changed it to a sync set, so there is a possibility that all of them are already resolved, but I don't have much time to test the changes and I don't also know how to reproduce the errors. These errors were caused using an async set and a sync get.

Here there are 3 examples and what I understood out of them (some info is replaced with 'xxx'):

1 - In this one it seems that the client is waiting for a response but there are no bytes left to read. I don't understand what is happening with the threads, for now I understood that there are no threads active in the threadpool dedicated to Redis (mgr) but at the same time one is active in the threadpool (busy worker), can someone ELI5? Also I know that the wrapper package uses more set operation under the hood to manage the expirations, but why would the read state be ReadAsync?
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 10 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), v: 2.0.593.37019

2 + 3 - In these the read state is CompletePendingMessage. I thought it could be thread theft but I also read in a lot of issues that it shouldn't happen after .NET Core so they remain an open question.
The number 2 has no requests pending but there are bytes still to be read in the pipe, one thread active for async operations (mgr) and 2 worker threads active.
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 0, aw: False, rs: CompletePendingMessage, ws: Idle, in: 0, in-pipe: 7150, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2,Free=32765,Min=4,Max=32767), v: 2.0.593.37019
The number 3 is similar to the 2 but there is a read operation pending and the number of worker threads has surpassed the minimum configuration.
Timeout performing HMGET xxx (5000ms), inst: 0, qu: 0, qs: 1, aw: False, rs: CompletePendingMessage, ws: Idle, in: 0, in-pipe: 8404, out-pipe: 0, serverEndpoint: Unspecified/xxx, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=5,Free=32762,Min=4,Max=32767), v: 2.0.593.37019

Any help or suggestion would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

NickCraver · 2022-04-06T16:25:57Z

First step is definitely upgrading the client to a much newer version (preferably latest) - we've added debugging into which helps diagnose issues that aren't apparent in this old version's stats. If you can repro on latest, can you please share an example error message?

AlessandroGandolfi · 2022-04-07T10:52:59Z

Thanks for the reply, I'm writing from my personal account now.

First step is definitely upgrading the client to a much newer version (preferably latest)

After people have reported them, I included the package explicitly to the latest version hoping that all the performance improvements and fixes can help

As I wrote, after the reports, I already included the package explicitly to the latest version in the project. I also enabled the configuration to get better details in exceptions.

If you can repro on latest, can you please share an example error message?

I don't have much time to test the changes and I don't also know how to reproduce the errors.

As I already wrote again, I don't actually know how I can reproduce the errors. The only thing I know is that they were thrown when using a wireless connection (working from home) instead of the usual wired connection of the office, so the connection could have easily be slower than the one in the office.

The operations used, for now, involve only one key for each connection that is read, modified and rewritten repeatedly. The size is usually 0,33mb but it could get bigger in the future. I already told to some of my coworkers that some parts could be divided in more than one key or compressed but for now it is not in the near-future plans to do so.

I wrote the issue mainly to get a better understanding of what could have happened, our usage doesn't seem to be so extreme in data size, slow connections or other possible influencing factors. The version used when these problems happened was older but the codes for the timeouts are still relevant to the latest version. If someone could help me understand what could have been the cause of them it would be great.

alessandro-gandolfi-ors · 2022-04-14T07:28:25Z

@NickCraver
an error was again reported with the latest version of the package, this time a bit different.

Timeout performing EVAL (5000ms), next: EVAL, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, 
in-pipe: 0, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, clientName: xxx,
IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), 
POOL: (Threads=9,QueuedItems=0,CompletedItems=552), Local-CPU: 100%, v: 2.5.61.22961

CPU usage was at 100%, could this be related to #1992 ?
I enabled the main threadpool usage since I saw you recommending it on multiple issues.

DarthSonic · 2022-04-21T10:34:51Z

We encounter the same problem for months now and since the last update the exception is occurring every day multiple times:
Timeout performing EVAL (5000ms), next: EVAL, inst: 4, qu: 0, qs: 185, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 65536, in-pipe: 0, out-pipe: 0, serverEndpoint: xyz.redis.cache.windows.net:6379, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF1C09E7(SE.Redis-v2.5.61.22961), IOCP: (Busy=5,Free=995,Min=4,Max=1000), WORKER: (Busy=192,Free=32575,Min=4,Max=32767), v: 2.5.61.22961 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

DarthSonic · 2022-04-21T10:38:25Z

We also upgrade Azure Redis from v3 to v6. Maybe this is a problem for StackExchange.Redis?

DarthSonic · 2022-04-21T10:49:07Z

P.S.: Server cpu load is max. 3%, memory usage is around 250 MB of 2.5 GB. So the instance should handle that very well.

NickCraver · 2022-04-21T20:32:11Z

@DarthSonic I'd guess you're hitting thread pool exhaustion there, are you using sync commands or async? The 192 threads, a full socket inbound, and 185 in the queue means the client's overloaded and I'd guess that's from thread pool exhaustion. I'd aim at getting a stack dump and seeing where all those threads are at...and unavailable because of it.

DarthSonic · 2022-04-21T21:14:30Z

@NickCraver we are using async methods with await (Asp.Net MVC). Any problem with that?

NickCraver · 2022-04-21T21:23:32Z

@DarthSonic on the Redis client bits too? e.g. .StringGetAsync instead of .StringGet? In either case: a memory dump and looking at your threads would be very enlightening and show you what's up in a hurry.

DarthSonic · 2022-04-22T06:37:55Z

@NickCraver yes. using only async methods of your Redis client. Will take a memory dump of Azure App Service when problem happens again and I am online at that time.

DarthSonic · 2022-04-22T11:19:23Z

@NickCraver Here is a stat from our thread counts. Seems that there is indeed a problem. The spike do fit the times when the exceptions occur:

alessandro-gandolfi-ors · 2022-04-28T12:40:32Z

it happened again more than once. now usually it happens for people that work from home, with (not-so-much) slower connections than the one the company has in its offices. still, it shouldn't happen. i asked to my coworkers who are reporting this to execute some commands and get more info about their connections if possible, i will post more info once i get them.

local cpu is at 100% again in this one, i disabled the prevent theft flag previously since, by what i understood in other issues, it is useful for .NET versions prior to Core and we are using .NET 5:
Timeout performing HMGET (5000ms), active: HMGET xxx, next: HMGET xxx ** possible thread-theft indicated; see https://stackexchange.github.io/StackExchange.Redis/ThreadTheft ** , inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: CompletePendingMessageSync, ws: Idle, in: 0, in-pipe: 1459, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, mgr: 9 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767), POOL: (Threads=12,QueuedItems=0,CompletedItems=610), Local-CPU: 100%, v: 2.5.61.22961

another one is this:
Timeout performing HMGET (5000ms), active: HMGET xxx, next: HMGET xxx, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: xxx, mc: 1/1/0, mgr: 10 of 10 available, clientName: xxx, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=1,Free=32766,Min=16,Max=32767), POOL: (Threads=13,QueuedItems=0,CompletedItems=1377), Local-CPU: 3,99%, v: 2.5.61.22961

NickCraver · 2022-05-28T20:13:14Z

@alessandro-gandolfi-ors The first error: CPU, the second: looks like we legitimately are waiting on data to come back. Recommend using SLOWLOG to see any pauses on the server end!

sinanopcommerce · 2022-10-10T07:12:07Z

https://stackoverflow.com/questions/74011008/redis-timeout-aws-net6

Can anybody please take a look? What is wrong there?

NickCraver · 2022-10-16T00:12:03Z

@sinanopcommerce recommend opening an issue here with more details - How many keys? How large? Distance from server? Does SLOWLOG show anything on the Redis server side? Have you tried a newer version? We have a lot of enhancements since the 2.2.x release.

sinanopcommerce · 2022-10-16T10:07:08Z

@NickCraver , There are lots of keys around 50k-60k. Data size is around 150KB-200KB or even 1MB-2MB. It is in the same region. of aws. I use Microsoft.Extensions.Caching.StackExchangeRedis latest version.

NickCraver · 2022-10-16T10:10:05Z

Is that keys in a single fetch you mean or in the database total? If total - how many are being fetched at a time? Trying to get an idea of a per-fetch payload size here.

sinanopcommerce · 2022-10-16T10:12:07Z

@NickCraver total key. No not a total single fetch. Per fetch it can be 2-4k for 1k HTTP hits

NickCraver · 2022-10-16T10:14:40Z

2-4k keys? How large is each?

sinanopcommerce · 2022-10-16T10:17:52Z

Yes for 1k HTTP hits it can be 2-4k fetch and the size can be 10kb-150kb each key.

NickCraver · 2022-10-16T10:22:59Z

@sinanopcommerce If I'm understanding you right, that's 2,000 keys * 10 KB at the low end and 4,000 keys * 150 KB at the high end, or 2 MB to 600 MB per fetch. That's a hell of an upper end and would exhaust bandwidth for most cloud services. I'd recommend keys not being that large or lowering batch sizes, else you're not going to get bytes across the wire before the timeout under load.

sinanopcommerce · 2022-10-16T10:27:42Z

@NickCraver you get me perfectly. Yes the CMS I am using has huge architectural issues for the distributed cache. I understood you. You can close this issue.

But I assume for 1k keys fetch will also cause the time-out issue even if the size of each key is 10kb.

NickCraver · 2022-10-16T10:32:37Z

@sinanopcommerce I wouldn't expect 1k keys at 10 KB to timeout - that's 1MB, but under load it might. It's really "do I have enough bandwidth from app server to Redis and back?" if not, it can't keep up and the timeout is going to be what tells you that. Larger timeouts can survive spikes in a pinch but it's a bandaid for sure. I hope you can help tweak some of the cache layout to make it friendlier!

sinanopcommerce · 2022-10-16T10:47:56Z

@NickCraver I will try. There is no way except a hybrid cache. I also tried "System.Threading.ThreadPool.MinThreads": 10000 but no luck. By the way, thank you for your response.

NickCraver · 2022-10-16T11:06:32Z

If you have ideas and want more specific advice, feel free to open an issue and we can discuss a bit more - not sure how to help on current but if you have architecture ideas happy to give any advice we can. Thanks for quick responses here!

Closing this just to tidy up.

Alik2015 · 2022-10-21T01:04:43Z

Exactly same issue here. Seeing way too many of these in AWS ElastiCache Redis. Exact same error. Keys around 1mb per key.

hyperion-cs mentioned this issue Jul 27, 2022

HMGET frequent errors like StackExchange.Redis.RedisTimeoutException #2199

Closed

NickCraver added ❔ question ⏱️ timeout labels Oct 16, 2022

NickCraver closed this as completed Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various timeouts performing HMGET #2069

Various timeouts performing HMGET #2069

alessandro-gandolfi-ors commented Apr 6, 2022 •

edited

Loading

NickCraver commented Apr 6, 2022

AlessandroGandolfi commented Apr 7, 2022 •

edited

Loading

alessandro-gandolfi-ors commented Apr 14, 2022 •

edited

Loading

DarthSonic commented Apr 21, 2022 •

edited

Loading

DarthSonic commented Apr 21, 2022

DarthSonic commented Apr 21, 2022

NickCraver commented Apr 21, 2022

DarthSonic commented Apr 21, 2022

NickCraver commented Apr 21, 2022

DarthSonic commented Apr 22, 2022

DarthSonic commented Apr 22, 2022

alessandro-gandolfi-ors commented Apr 28, 2022

NickCraver commented May 28, 2022

sinanopcommerce commented Oct 10, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022 •

edited

Loading

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

Alik2015 commented Oct 21, 2022

Various timeouts performing HMGET #2069

Various timeouts performing HMGET #2069

Comments

alessandro-gandolfi-ors commented Apr 6, 2022 • edited Loading

NickCraver commented Apr 6, 2022

AlessandroGandolfi commented Apr 7, 2022 • edited Loading

alessandro-gandolfi-ors commented Apr 14, 2022 • edited Loading

DarthSonic commented Apr 21, 2022 • edited Loading

DarthSonic commented Apr 21, 2022

DarthSonic commented Apr 21, 2022

NickCraver commented Apr 21, 2022

DarthSonic commented Apr 21, 2022

NickCraver commented Apr 21, 2022

DarthSonic commented Apr 22, 2022

DarthSonic commented Apr 22, 2022

alessandro-gandolfi-ors commented Apr 28, 2022

NickCraver commented May 28, 2022

sinanopcommerce commented Oct 10, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022 • edited Loading

NickCraver commented Oct 16, 2022

sinanopcommerce commented Oct 16, 2022

NickCraver commented Oct 16, 2022

Alik2015 commented Oct 21, 2022

alessandro-gandolfi-ors commented Apr 6, 2022 •

edited

Loading

AlessandroGandolfi commented Apr 7, 2022 •

edited

Loading

alessandro-gandolfi-ors commented Apr 14, 2022 •

edited

Loading

DarthSonic commented Apr 21, 2022 •

edited

Loading

sinanopcommerce commented Oct 16, 2022 •

edited

Loading