-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Store gateway] Highly increased latency on "get_range" bucket operations on 0.30.2 #6540
Comments
Do you have any way to check request latencies in AWS S3 directly? I wonder if the way they are measured has changed between versions. Also has the number of requests gone up after the upgrade? |
I couldn't fine any AWS side latency metrics for AWS, we seem to be observing this only on 2 of our 6 environments, and these are the environments that are the most loaded in terms of metrics/queries. I don't think this has to do with a change in how measurements are done because I've seen impact on systems like grafana after the upgrade |
@thomas-maurice could you try 0.29.1 so that we can look at metrics and try to narrow down the exact release that change this behavior? |
@mjimeneznet would you be kind enough to try 0.30.1, please? |
Hi @douglascamata , We've performed upgrades across one environment, moving the Thanos image within both the Prometheus operator and Thanos components from version Following these upgrades, we've observed a noticeable increase in chunk size and occurrences of the get_range bucket operation. As a result, this increase has triggered the ThanosStoreObjstoreOperationLatencyHigh alert. |
Ok, so as a summary:
Now, onto some investigation: the only notable change to my eyes from 0.29 to 0.30.1 in terms of Thanos Store GW is #5837. I would recommend to try bumping the hidden CLI flag |
Kind reminder: do not run 0.30.1 for long on a production environment. You want 0.30.2 there asap to have a very important fix from #6086. Thanks a lot for testing it though. |
I appreciate your attention to the issue. We did downgrade to 0.29.0 as we have it more stable for now. Thank you 😊 |
@douglascamata we tried 0.30.2 with |
@thomas-maurice could you post the contents of the |
@fpetkovski it is pretty stock, we didn't customise it ---
type: s3
config:
bucket: [BUCKET NAME]
endpoint: [S3 ENDPOINT]
aws_sdk_auth: true And that's it, so I am assuming the storegateway is working with the default config values |
@fpetkovski any ideas of things we could try to troubleshoot further more ? |
The only thing that comes to mind is to try out 0.32 once we release it. We are waiting on #6317 before we can cut a new release, but there have been many changes since 0.30.2 and the issue might have been addressed already. |
Okay :) Will do ! |
@thomas-maurice with |
@lasermoth not yet, I'll update the issue when we had time to try it out ! |
Hello! Any information if this has been fixed in 32.5? |
There were some improvements but the general method to iterate the bucket has not changed. There is the idea of defaulting to previous behaviour and enabling current one ( for cases where it helps ) with a hidden flag and for long term add a bucket index so we can sync cheaply but no work has been done yet. EDIT: im a potato, this was about something else, please disregard |
Hello ! Sorry for the late reply but yes this was fixed in subsequent Thanos versions! I'm closing this |
Thanos, Prometheus and Golang version used:
Object Storage Provider:
Amazon S3
What happened:
We recently upgraded Thanos from 0.28.1 to 0.30.2 (we didn't upgrade to the newer 0.31.X version because of the deduplication bug #6257 ). After the upgrade we have seen a dramatic increase in the latency for
get_range
operations on the store gateway. We didn't spot it initially because the environments we tested it in were very low data intensive, however after upgrading our bigger clusters we noticed an increase of the p99 from about a few hundreds milliseconds to sometimes over 2 or even 5 seconds depending on the environment as shown on the graph belowOur storegateway pods are configured as follows:
What you expected to happen:
No significant change in the global bucket operations latencies
How to reproduce it (as minimally and precisely as possible):
Not sure how to reproduce, in our case it was a simple matter as upgrading the running container from 0.28.1 to 0.30.2. But we noticed that this behaviour is appearing on our two busiest clusters
Full logs to relevant components:
Anything else we need to know:
We are running on Amazon EKS
The text was updated successfully, but these errors were encountered: