-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos receive fails "no space left on device" #7391
Comments
We are facing the same issue with Thanos receivers. it's not clear yet what is the issue. ts=2024-08-21T17:00:04.446859956Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=XXXX msg="compaction failed" err="preallocate: no space left on device"
We had 3 replicas and we are mounting a 10GB volume. I checked the PVC, and it used only 50% of it. Will decrease the retention to 12h will fix the issue? |
@narutolied If you set retention time to 0 then Prometheus will set it to the default value 15 day. @atayfour Prometheus applies retention a bit different. Time based retention is block based and it only deletes a block if the newest block time - current block time > retention then it removes your current block. So if you don't even have a block or you only have one block I believe the block will not be deleted even though it is past the retention time. In your situation, increasing volume size is probably the easiest way to proceed. |
FYI prometheus/prometheus#10015 this is the issue in Prometheus TSDB |
Thanos, Prometheus and Golang version used:
v0.35.0 and Prometheus v2.48.0
Object Storage Provider: Azure Blob
What happened: The receive pod run couple of days without errors, then it started to crash loop back. The receive is running on a cluster, the compactor is running on a different cluster.
All the Thanos store components are using the same storage config (Azure Blob Storage)
What you expected to happen: The receive in cluster #2 keeps running the same way the other receives in cluster #1 and #3
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
16062400000 ulid=01HY6R27XEJASRDRJZPCQFH4MM ts=2024-05-26T03:08:38.106234246Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716062400012 maxt=1716069600000 ulid=01HY6YXZHGXYNHB0V23SR7HTFR ts=2024-05-26T03:08:38.106255535Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716069600029 maxt=1716076800000 ulid=01HY75SPSN64JGPNEVMPH9JY5H ts=2024-05-26T03:08:38.10627417Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716076800032 maxt=1716084000000 ulid=01HY7CNDQXW6RZ9RA39029ZX1G ts=2024-05-26T03:08:38.106915603Z caller=receive.go:601 level=info component=receive msg="shutting down storage" ts=2024-05-26T03:08:38.106926284Z caller=receive.go:605 level=info component=receive msg="storage is flushed successfully" ts=2024-05-26T03:08:38.1069309Z caller=receive.go:611 level=info component=receive msg="storage is closed" ts=2024-05-26T03:08:38.106943423Z caller=http.go:91 level=info component=receive service=http/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.106963196Z caller=receive.go:693 level=info component=receive component=uploader msg="uploading the final cut block before exiting" ts=2024-05-26T03:08:38.106983989Z caller=receive.go:702 level=info component=receive component=uploader msg="the final cut block was uploaded" uploaded=0 ts=2024-05-26T03:08:38.107007441Z caller=http.go:110 level=info component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107022125Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107064152Z caller=grpc.go:138 level=info component=receive service=gRPC/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.10708308Z caller=grpc.go:151 level=info component=receive service=gRPC/server component=receive msg="gracefully stopping internal server" ts=2024-05-26T03:08:38.107113074Z caller=grpc.go:164 level=info component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107129198Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107211886Z caller=main.go:171 level=error err="open /var/thanos/receive/default-tenant/wal/00001125: no space left on device\nopening storage\nmain.startTSDBAndUpload.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:643\ngithub.com/oklog/run.(*Group).Run.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650\nreceive command failed\nmain.main\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/main.go:171\nruntime.main\n\t/opt/bitnami/go/src/runtime/proc.go:267\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650"
Anything else we need to know:
The text was updated successfully, but these errors were encountered: