Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos receive fails "no space left on device" #7391

Open
kbajy opened this issue May 26, 2024 · 4 comments
Open

Thanos receive fails "no space left on device" #7391

kbajy opened this issue May 26, 2024 · 4 comments

Comments

@kbajy
Copy link

kbajy commented May 26, 2024

Thanos, Prometheus and Golang version used:
v0.35.0 and Prometheus v2.48.0

Object Storage Provider: Azure Blob

What happened: The receive pod run couple of days without errors, then it started to crash loop back. The receive is running on a cluster, the compactor is running on a different cluster.

  • Cluster1: Thanos query and thanos receive without "no space left on device" issue
  • Cluster2: Thanos query and thanos receive with "no space left on device" issue
  • Cluster3: Thanos query, ruler, storegateway, the compactor and the receive no issue

All the Thanos store components are using the same storage config (Azure Blob Storage)

What you expected to happen: The receive in cluster #2 keeps running the same way the other receives in cluster #1 and #3

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Logs

16062400000 ulid=01HY6R27XEJASRDRJZPCQFH4MM ts=2024-05-26T03:08:38.106234246Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716062400012 maxt=1716069600000 ulid=01HY6YXZHGXYNHB0V23SR7HTFR ts=2024-05-26T03:08:38.106255535Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716069600029 maxt=1716076800000 ulid=01HY75SPSN64JGPNEVMPH9JY5H ts=2024-05-26T03:08:38.10627417Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716076800032 maxt=1716084000000 ulid=01HY7CNDQXW6RZ9RA39029ZX1G ts=2024-05-26T03:08:38.106915603Z caller=receive.go:601 level=info component=receive msg="shutting down storage" ts=2024-05-26T03:08:38.106926284Z caller=receive.go:605 level=info component=receive msg="storage is flushed successfully" ts=2024-05-26T03:08:38.1069309Z caller=receive.go:611 level=info component=receive msg="storage is closed" ts=2024-05-26T03:08:38.106943423Z caller=http.go:91 level=info component=receive service=http/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.106963196Z caller=receive.go:693 level=info component=receive component=uploader msg="uploading the final cut block before exiting" ts=2024-05-26T03:08:38.106983989Z caller=receive.go:702 level=info component=receive component=uploader msg="the final cut block was uploaded" uploaded=0 ts=2024-05-26T03:08:38.107007441Z caller=http.go:110 level=info component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107022125Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107064152Z caller=grpc.go:138 level=info component=receive service=gRPC/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.10708308Z caller=grpc.go:151 level=info component=receive service=gRPC/server component=receive msg="gracefully stopping internal server" ts=2024-05-26T03:08:38.107113074Z caller=grpc.go:164 level=info component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107129198Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107211886Z caller=main.go:171 level=error err="open /var/thanos/receive/default-tenant/wal/00001125: no space left on device\nopening storage\nmain.startTSDBAndUpload.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:643\ngithub.com/oklog/run.(*Group).Run.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650\nreceive command failed\nmain.main\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/main.go:171\nruntime.main\n\t/opt/bitnami/go/src/runtime/proc.go:267\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650"

Anything else we need to know:

  • receive
    • --log.level=info
    • --log.format=logfmt
    • --grpc-address=0.0.0.0:10901
    • --http-address=0.0.0.0:10902
    • --remote-write.address=0.0.0.0:19291
    • --objstore.config=$(OBJSTORE_CONFIG)
    • --tsdb.path=/var/thanos/receive
    • --label=replica="$(NAME)"
    • --label=receive="true"
    • --tsdb.retention=15d
    • --receive.local-endpoint=127.0.0.1:10901
    • --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
    • --receive.replication-factor=1
@atayfour
Copy link

We are facing the same issue with Thanos receivers. it's not clear yet what is the issue.

ts=2024-08-21T17:00:04.446859956Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=XXXX msg="compaction failed" err="preallocate: no space left on device"

- receive
    - --log.level=warn
    - --log.format=logfmt
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --receive.replication-factor=2
    - --tsdb.retention=1d
    - --label=receive="true"
    - --objstore.config-file=/config/thanos-store.yml
    - --tsdb.path=/var/thanos/receive
    - --receive.default-tenant-id=default
    - --label=receive_replica="$(NAME)"
    - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
    - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json

We had 3 replicas and we are mounting a 10GB volume. I checked the PVC, and it used only 50% of it.

Will decrease the retention to 12h will fix the issue?

@narutolied
Copy link

We have the same problem.
changed --tsdb.retention=0d
Everything seams fine but storage still build up and crashes
image

@yeya24
Copy link
Contributor

yeya24 commented Nov 11, 2024

@narutolied If you set retention time to 0 then Prometheus will set it to the default value 15 day.

@atayfour Prometheus applies retention a bit different. Time based retention is block based and it only deletes a block if the newest block time - current block time > retention then it removes your current block. So if you don't even have a block or you only have one block I believe the block will not be deleted even though it is past the retention time.

In your situation, increasing volume size is probably the easiest way to proceed.

@yeya24
Copy link
Contributor

yeya24 commented Nov 15, 2024

FYI prometheus/prometheus#10015 this is the issue in Prometheus TSDB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants