-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: Store gateway consuming lots of memory / OOMing #448
Comments
Do you have compactor running? If not that is expected as you might have
millions of small blocks in your bucket stored in super inefficient way
pt., 27 lip 2018, 17:57 użytkownik Felipe Cavalcanti <
[email protected]> napisał:
… *Thanos, Prometheus and Golang version used*
thanos v0.1.0rc2
*What happened*
Thanos-store is consuming 50gb of memory during startup
*What you expected to happen*
Thanos-store does not consume so much memory for starting up
*Full logs to relevant components*
store:
level=debug ts=2018-07-27T15:51:21.415788856Z caller=cluster.go:132 component=cluster msg="resolved peers to following addresses" peers=100.96.232.51:10900,100.99.70.149:10900,100.110.182.241:10900,100.126.12.148:10900
level=debug ts=2018-07-27T15:51:21.416254389Z caller=store.go:112 msg="initializing bucket store"
level=warn ts=2018-07-27T15:52:05.28837034Z caller=bucket.go:240 msg="loading block failed" id=01CKE41VDSJMSAJMN6N6K8SABE err="new bucket block: load index cache: download index file: copy object to file: write /var/thanos/store/01CKE41VDSJMSAJMN6N6K8SABE/index: cannot allocate memory"
level=warn ts=2018-07-27T15:52:05.293692332Z caller=bucket.go:240 msg="loading block failed" id=01CKE41VE4XXTN9N55YPCJSPP2 err="new bucket block: load index cache: download index file: copy object to file: write /var/thanos/store/01CKE41VE4XXTN9N55YPCJSPP2/index: cannot allocate memory"
*Anything else we need to know*
One thing that's happening is that my thanos-compactor consumer way too
much ram memory as well, the last time it ran, it used up to 60Gb of memory.
*I run store with this args:*
containers:
- args:
- store
- --log.level=debug
- --tsdb.path=/var/thanos/store
- --s3.endpoint=s3.amazonaws.com
- --s3.access-key=xxx
- --s3.bucket=xxx
- --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
- --index-cache-size=2GB
- --chunk-pool-size=8GB
*Environment*:
- OS (e.g. from /etc/os-release): kubernetes running on debian
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#448>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGoNuygVNcs4idveMfvyH5JaitCZ7Mjzks5uKzhxgaJpZM4VjvTD>
.
|
hi @Bplotka |
Yes, fix is in review: #529 |
But huge mem usage on startup for store gateway is mainly because fine greained blocks - I think your compactor did not compact all things yet. Might that be the case? |
@xjewer no:
It = means compactor in #448 (comment) (: so #529 is actually fix for this |
oh, I missed the details, ok then 😀 |
TL;DR - We are currently seeing thanos-store consuming incredibly large amounts of memory during initial sync and then being OOM killed. It is not releasing any memory as it is performing the initial sync and there is very likely a memory leak. Memory leak is likely to be occurring in https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/block/index.go#L105-L154 Thanos, Prometheus and Golang version used What happened What you expected to happen Full logs to relevant components Anything else we need to know Our thanos S3 bucket is currently 488.54404481872916G, 15078 objects in size. We've noticed that thanos-store doesn't progress past the We've modified the goroutine count for how many blocks are being processed concurrently. It is currently hardcoded The goroutine count for Through some debugging, we've identified the loading of the index cache as the location of the memory leak - https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/store/bucket.go#L1070 By commenting out that function from the We then ran some pprof heap analysis on the thanos-store as the memory leak was occurring and it identified The function in question - https://github.com/improbable-eng/thanos/blob/v0.1.0/pkg/block/index.go#L105-L154. The heap graph above suggests that the leak is in the json encoding/decoding of the index file and for some reason is not releasing memory. |
Any update on this? We have blocks that after compaction but before downsampling are 400GB which means either we run a massively expensive AWS instance or just add a massive swap file. |
Any update on this? We can't use Thanos at the scale that we want to because of this. |
Any update on this? Can we have a timeline for this fix? |
Thanks guys for this, especially @awprice for detailed work. This is interesting as our heap profiles were totally different - suggesting proper place -> fetching bytes into buffer for actual series in the query. Maybe you have bigger indexes and not much traffic on the query side? You said:
This is totally reasonable number. Can you check you biggest blocks? How larger they are and notably what is the index size? (: |
Also sorry for delay, I totally missed this issue |
@bwplotka Apologies for the late reply, here is some info on our largest blocks/index sizes: Largest block - 14 GiB Our largest index is 7 GiB |
Let's get back to this. We need better OOM flow for our store gateway. Some improvements that needs to be done:
Lot's of work, so help is wanted (: |
Two more info items:
|
FYI: google/cadvisor#2242 |
Nice, but I would say diving why such decision for Golang itself would be more useful? |
Also golang/go#28466
|
We need to move Thanos to Go 1.12.5: prometheus/prometheus#5524 |
☝️ Deleted the comment as it does not help to resolve this particular issue for the community (: |
We're also seeing massive memory consumption by Thanos :( What impact should we expect if any by reducing |
Any update on this? |
Also mention than VictoriaMetrics uses less RAM than Thanos Store Gateway - see thanos-io/thanos#448 for details.
Also mention than VictoriaMetrics uses less RAM than Thanos Store Gateway - see thanos-io/thanos#448 for details.
FYI: This issue was closed as the major rewrite happened on master above 0.10.0. It's still experimental but you can enable it via https://github.com/thanos-io/thanos/blob/master/cmd/thanos/store.go#L78 ( We are still working on various benchmarks especially around query resource usage, but functionally it should work! (: Please try it our on dev/testing/staging environments and give us feedback! ❤️ |
Hey @bwplotka, I just tried here and I got:
What am I doing wrong? 🤔 |
@caarlos0 Hi, this feature is not included in v0.10.1 release. You can use the latest master branch docker image to try it.
|
Oh ok, sorry, I misread 🙏 |
How is the startup latency? I assume you use the experimental flag as well,
right?
Can you send me the heap profile? (:
Knd Regards,
Bartek
…On Tue, 11 Feb 2020 at 21:19, Jorge Arco ***@***.***> wrote:
Got this master-2020-01-25-cf4e4500 running for some time. 50% memory
improvement. Great work and thanks to all people involved.
[image: chart]
<https://camo.githubusercontent.com/1e2d43903f158f2043f569467d88d0ef3b11ef93/68747470733a2f2f692e696d6775722e636f6d2f676b36554459652e706e67>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#448?email_source=notifications&email_token=ABVA3O226PCG6DTSMMCTYLTRCMI6NA5CNFSM4FMO6TB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELODR4I#issuecomment-584857841>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3O4IS3MGR2C4IBGDYHDRCMI6NANCNFSM4FMO6TBQ>
.
|
Yes, experimental flag is enabled. I'll not touch this in the next days as I'm quite busy with other stuff but will try to provide you the profile |
I am getting the below error with thanos store when using latest master branch docker image(quay.io/thanos/thanos:master-2020-01-25-cf4e4500) and by enabling --experimental.enable-index-header flag. I am having kubernetes for thanos deployment deployment.
I am using the same bucket before when the thanos-store docker image improbable/thanos:v0.3.2 and there were no this |
We see the same access denied error as uvaisibrahim with going to the new release. s3 backed storage, no changes other than updating what image we're running and adding the --experimental.enable-index-header flag. If I revert to v0.10.1 and remove the experimental flag, things start up and run (though need a lot of memory to do so) |
Try just a new release without the flag. This error which is really client not being able to talk to S3 does not have anything to do with the experimental feature. (: It might be miconfiguration. |
Yes, it's not flag specific, but just changing the image to use the new release causes the error. No other settings are changed. It's a k8s deploy so config is identical except for the image change. Maybe there are other changes in the master-2020-01-25-cf4e4500 release, that introduce this error, but figured that it's worth mentioning since it seemed was the identical error that at least one other user on the release was seeing. |
FYI, the issue is with the updated Thanos version not working with existing configs. With the experimental flag, my store's memory usage is reduced by 27% and so far everything seems to be healthy and functioning as expected. |
Thanks for that info @genericgithubuser! Would you like to open a PR to update the |
In case people still needs this, you can now test with the container v0.11.0-rc.1. |
We're testing the changes currently in our testing/staging environment in an EKS cluster. The memory has reduced by 60-70%. I'll keep you updated after more tests. |
Thanos, Prometheus and Golang version used
thanos v0.1.0rc2
What happened
Thanos-store is consuming 50gb of memory during startup
What you expected to happen
Thanos-store does not consume so much memory for starting up
Full logs to relevant components
store:
Anything else we need to know
Some time after initialization the ram usage goes down to normal levels, something around 8Gb
Another thing that's happening is that my thanos-compactor consumer way too much ram memory as well, the last time it ran, it used up to 60Gb of memory.
I run store with this args:
Environment:
The text was updated successfully, but these errors were encountered: