Thanos receive fails "no space left on device" #7391

kbajy · 2024-05-26T03:24:34Z

Thanos, Prometheus and Golang version used:
v0.35.0 and Prometheus v2.48.0

Object Storage Provider: Azure Blob

What happened: The receive pod run couple of days without errors, then it started to crash loop back. The receive is running on a cluster, the compactor is running on a different cluster.

Cluster1: Thanos query and thanos receive without "no space left on device" issue
Cluster2: Thanos query and thanos receive with "no space left on device" issue
Cluster3: Thanos query, ruler, storegateway, the compactor and the receive no issue

All the Thanos store components are using the same storage config (Azure Blob Storage)

What you expected to happen: The receive in cluster #2 keeps running the same way the other receives in cluster #1 and #3

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Logs

16062400000 ulid=01HY6R27XEJASRDRJZPCQFH4MM ts=2024-05-26T03:08:38.106234246Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716062400012 maxt=1716069600000 ulid=01HY6YXZHGXYNHB0V23SR7HTFR ts=2024-05-26T03:08:38.106255535Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716069600029 maxt=1716076800000 ulid=01HY75SPSN64JGPNEVMPH9JY5H ts=2024-05-26T03:08:38.10627417Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1716076800032 maxt=1716084000000 ulid=01HY7CNDQXW6RZ9RA39029ZX1G ts=2024-05-26T03:08:38.106915603Z caller=receive.go:601 level=info component=receive msg="shutting down storage" ts=2024-05-26T03:08:38.106926284Z caller=receive.go:605 level=info component=receive msg="storage is flushed successfully" ts=2024-05-26T03:08:38.1069309Z caller=receive.go:611 level=info component=receive msg="storage is closed" ts=2024-05-26T03:08:38.106943423Z caller=http.go:91 level=info component=receive service=http/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.106963196Z caller=receive.go:693 level=info component=receive component=uploader msg="uploading the final cut block before exiting" ts=2024-05-26T03:08:38.106983989Z caller=receive.go:702 level=info component=receive component=uploader msg="the final cut block was uploaded" uploaded=0 ts=2024-05-26T03:08:38.107007441Z caller=http.go:110 level=info component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107022125Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107064152Z caller=grpc.go:138 level=info component=receive service=gRPC/server component=receive msg="internal server is shutting down" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.10708308Z caller=grpc.go:151 level=info component=receive service=gRPC/server component=receive msg="gracefully stopping internal server" ts=2024-05-26T03:08:38.107113074Z caller=grpc.go:164 level=info component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107129198Z caller=intrumentation.go:81 level=info component=receive msg="changing probe status" status=not-healthy reason="opening storage: open /var/thanos/receive/default-tenant/wal/00001125: no space left on device" ts=2024-05-26T03:08:38.107211886Z caller=main.go:171 level=error err="open /var/thanos/receive/default-tenant/wal/00001125: no space left on device\nopening storage\nmain.startTSDBAndUpload.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:643\ngithub.com/oklog/run.(*Group).Run.func1\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650\nreceive command failed\nmain.main\n\t/bitnami/blacksmith-sandox/thanos-0.35.0/src/github.com/thanos-io/thanos/cmd/thanos/main.go:171\nruntime.main\n\t/opt/bitnami/go/src/runtime/proc.go:267\nruntime.goexit\n\t/opt/bitnami/go/src/runtime/asm_amd64.s:1650"

Anything else we need to know:

receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/var/thanos/receive
- --label=replica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=15d
- --receive.local-endpoint=127.0.0.1:10901
- --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
- --receive.replication-factor=1

atayfour · 2024-08-23T07:55:22Z

We are facing the same issue with Thanos receivers. it's not clear yet what is the issue.

ts=2024-08-21T17:00:04.446859956Z caller=db.go:1014 level=error component=receive component=multi-tsdb tenant=XXXX msg="compaction failed" err="preallocate: no space left on device"

- receive
    - --log.level=warn
    - --log.format=logfmt
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --receive.replication-factor=2
    - --tsdb.retention=1d
    - --label=receive="true"
    - --objstore.config-file=/config/thanos-store.yml
    - --tsdb.path=/var/thanos/receive
    - --receive.default-tenant-id=default
    - --label=receive_replica="$(NAME)"
    - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
    - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json

We had 3 replicas and we are mounting a 10GB volume. I checked the PVC, and it used only 50% of it.

Will decrease the retention to 12h will fix the issue?

narutolied · 2024-11-11T13:55:42Z

We have the same problem.
changed --tsdb.retention=0d
Everything seams fine but storage still build up and crashes

yeya24 · 2024-11-11T22:24:31Z

@narutolied If you set retention time to 0 then Prometheus will set it to the default value 15 day.

@atayfour Prometheus applies retention a bit different. Time based retention is block based and it only deletes a block if the newest block time - current block time > retention then it removes your current block. So if you don't even have a block or you only have one block I believe the block will not be deleted even though it is past the retention time.

In your situation, increasing volume size is probably the easiest way to proceed.

yeya24 · 2024-11-15T07:29:05Z

FYI prometheus/prometheus#10015 this is the issue in Prometheus TSDB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos receive fails "no space left on device" #7391

Thanos receive fails "no space left on device" #7391

kbajy commented May 26, 2024 •

edited

Loading

atayfour commented Aug 23, 2024

narutolied commented Nov 11, 2024

yeya24 commented Nov 11, 2024

yeya24 commented Nov 15, 2024

Thanos receive fails "no space left on device" #7391

Thanos receive fails "no space left on device" #7391

Comments

kbajy commented May 26, 2024 • edited Loading

atayfour commented Aug 23, 2024

narutolied commented Nov 11, 2024

yeya24 commented Nov 11, 2024

yeya24 commented Nov 15, 2024

kbajy commented May 26, 2024 •

edited

Loading