-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: i/o timeouts with new memcached config #1979
Comments
cc @pracucci @GiedriusS (: |
Thanks @kamilhristov for the report. I'm seeing in the config you're using AWS ElastiCache: which instance type are you using for ElastiCache? |
Could you also show the following metrics graphed?
Random thoughts:
|
So, from your graphs I can see the 90th percentile latency is good, while the 100th is not. In my experience with AWS, performances on t2/t3 are unpredictable, so as next step I would suggest to try with the m5 before reiterating on the root cause analysis. May you give it a try, please? |
|
I've increased max idle connections to 1k and max_get_multi_batch_size to 1k. Now I am getting:
Along with:
Current connections graph. The drop is after restarting Thanos Store with new parameters. And new screenshot with irates 📈 |
This can be easily fixed increasing |
I still have no clue about this. Do memcached metrics show anything catching your attention? Among the metrics directly exported by memcached you should see the items fetched (we batch reads on the thanos side): what's the rate there? |
Another hypothesis is that the set operations are saturating the fetches. Just for testing, may you try with the following settings?
|
I've also asked about the memcached instance type. May you also confirm Thanos store is not running on a t2/t3 instance, please? I don't think so, but just a confirmation. |
I've run some benchmarks in our cluster, using two tools:
Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running. The results
Comparison on
|
Sorry for the late response and thanks for spending time on this.
r5.large
Then with r5.xlarge max latency is similar but we get more throughput.
CloudWatch is showing ~100,000,000 GET commands per minute when executing the benchmark. Might be something with our infrastructure that is causing this. I will investigate more on Monday but will most likely stick with the in-memory cache for now. It seems like better option for our scale. Perhaps when our Thanos data becomes bigger, it will make more sense to offload the index cache to external service. No need to invest more time on this as it seems that it is not caused by Thanos Store itself and thanks again for helping 🙇 |
Thanks for all your help. I will continue spend a bit more of time on this by myself, and I will get back to you in case I find out anything.
Agree. The in-memory cache is the best option for the majority of the use cases. |
I have played around with the new memcached support and my anecdotal evidence says that this happens when memcached cannot keep up because not enough memory has been allocated to it and/or the timeouts are too low. Increasing the memory size with |
Could this be caused by the keys size ?
I was able to reproduce the timeouts when executing the benchmark script with larger data size range. Have in mind that this doesn't even hit the limit so the actual Thanos Store keys are even bigger.
|
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
Facing the same issue, any documentation or work done on this ? : @pracucci |
@kuberkaul Could you check following tips in the following conversation https://cloud-native.slack.com/archives/CL25937SP/p1585919142164500? Although I haven't tested it, I'm working on related configuration on thanos-io/kube-thanos#106. If you still think this a valid bug, we can open the issue again. |
I talked offline to @kuberkaul and shared some tips. Parallelly, I'm investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly). |
Hi, also facing the same issue 😞 Any further update on this? Been looking through some of the suggestions / tips from here and from the slack channel, and have been tweaking memcache configurations, yet I'm still running into timeout or |
This log means that a query request has been cancelled. For example, if you open a Grafana dashboard and, while panels are loading, you do refresh it the previous queries are cancelled and if they're cancelled while waiting for a memcached connection slot then you get that error logged. |
@pracucci Ah that makes sense -- thanks for the explanation. Im still running into the timeout errors though :/ any further development or thoughts on that? |
I'm also currently experiencing this issue while tuning Thanos Store components. Perhaps this issue can be re-opened? |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Is thre any updates on this. Also suffering from this problem. Memcached 1.6.12, various configs of thanos cache and Memcached itself are tested.
|
I am currently digging this one so reopening. Still valid |
Some errors from my side:
There are also old good i/o timeout and read/write broken pipe. |
#4742 this should greatly improve the situation. @Antiarchitect if you have time then maybe you could try running Thanos with that pull request? |
@GiedriusS - problematic in my situation as I use Bitnami Thanos images with a lot of magic inside. So hope your MR will be accepted and I'll try the new release where it's included asap. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Is there any update on this? I am also getting the below error while trying to get the historical data.
I have two queries,
My thanos store memcached configurations are give below.
bucket caching config - memcached
I had se value 0 for Thanks in advance |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Still valid! |
After trying all the suggestions from this post without any success, one of our engineers eventually found out that the issue was due to a connection issue with the Memcached nodes. We are running Thanos in EKS and are using Elasticache Memcached with 8 nodes. We have 18 thanos-store pods running on Kubernetes nodes. Logs
nslookup and netcat
OutcomeAs it turns out, we were sometimes able to run netcat successfully from our Kubernetes nodes, others would just hang and could not connect to the Memcached cluster. After restarting the problematic Memcached nodes, things started working and now all thanos-store pods are able to connect to Memcached. We don't know the root cause for the connection issue though but it's a step in the right direction. Maybe this helps some other folks! |
Perhaps the network bandwidth has been exhausted; this is our situation. |
Based on comments in this Cortex issue: thanos-io/thanos#1979 I figured lowering parallelism can fix timeout errors. Signed-off-by: Jakub Sokołowski <[email protected]>
As this issue is still open I'm taking the opportunity to describe a similar fault that maybe somebody else come across. Last year we deployed our on-prem observability cluster for dev stg and prd. The three applications we're currently running to observe are loki for logs mimir for metrics and tempo for traces. Grafana server and keycloak are also part of the same cluster running backend mariadb in a galera cluster all in K8s RKE2. As the cluster is situated in three different sites we have workernodes spread evenly in between the two main sites and then a small site catering for split-brain. Thus we're using the zone aware replication for loki mimir and tempo ingesters and mimir's store gw & alertmanager. For backend we're storing this to minio deployed in the same cluster one minio cluster in each site and then running replication in between. Minio is stored only on two sites subsequently using active active replication between the two main sites. All has been working fine up until a week ago when mimir suddenly in PROD started to behave badly, we first noticed this when the store gateway gave the first error message bucket-index file too old. Had to logon to minio with mc and delete bucket-index with previous versions which was a lot, surprisingly so when I thought I had incorporated a life cycle rule to only allow 5 versions for non current versions. Anyhow, after deleting these we were able to query the data again although this time with horrendous performance. We have a 30 days retention in minio and although we can retrieve data it takes 4ever. The logs for the store gateway pods gives this.
I've checked the memory and CPU on the worker nodes catering for minio and these have 4CPU 16GB each, four on each site and not one seems to give us the indication we're running out of memory or CPU I have deleted mimir completely, deleted the cache data in the pvc for store gateway, deleted minio deployment to no avail. I'm lost with ideas now. Could it be that the bucket which is now 5+TB is too big for the store gateway to trundle through which for me doesn't make sense as S3 buckets could be PB big |
@percy78 Try to restart your Memcached nodes one at a time or find out which node the above IP with the connection timeout belongs to. This "solved" the issue for us |
@robert-becker-hs good idea, however tried it already. All worker nodes were rebooted last week. Basically we know that it works ish, it's just that it now take minutes to retrieve the metrics whereas it only took seconds before. Just upgraded minio to latest helm chart 5.2.0 from previous version 5.1.0 but no improvement. Checked mimir helm chart and here it's a bit outdated 5.0.0 and the latest one is 5.3.0 so will test this but not confident this will solve the issue. |
I just restarted the inde-cache pods and now I can see this from the store gw.
|
My issue has now been resolved albeit with data loss. Simply created a new bucket in minio with active replication in between the two sites and then changed mimir to point to this new bucket. Subsequently we lost the historical data but now I can see metrics from Friday morning i.e three days back and it's quick again.
|
👋 Hello again and thanks for the quick response in #1974
Thanos version used: latest master with #1975
Object Storage Provider: AWS S3
What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s 🤕
Memcached servers look OK and number of connections doesn't exceed max_idle_connections
Thanos Store process and Memcached servers are in the same AZ.
I tried the default config and also tweaked most of the parameters but to no avail.
Any recommendations ?
Arguments:
thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml
Config:
Logs:
The text was updated successfully, but these errors were encountered: