Cannot access thanos metrics #688

robvalca · 2018-12-18T08:47:48Z

Versions

thanos, version 0.1.0 (branch: HEAD, revision: ebb58c2)
build user: root@393679b8c49c
build date: 20180918-17:02:30
go version: go1.10.3

prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
build user: root@578ab108d0b9
build date: 20181106-11:40:44
go version: go1.11.1

Our setup

1 x prometheus server + thanos sidecar (8c, 16G RAM)
2 x thanos queriers (4c, 8G RAM)
1 x thanos-store + thanos compactor (8c, 16G RAM)
backend: S3 ceph+radosgw
OS: CentOS 7.6

Issue

Debugging other issues I've realized that lately I cannot access to the metrics older than 15d (prometheus local retention) while we have a few months (~140G) of metrics in S3 and it worked fine before. From the thanos dash I got the following message if I put time range of 1w or older (the metrics are shown but no data older than 15d)

receive series: rpc error: code = Aborted desc = fetch series for block 01CYM85GN9Q7XRA5HZJJSCRARD: add chunk preload: reference sequence 0 out of range

The thanos store is looping with the following warning:

Dec 18 09:19:36 cephthanos-store thanos[11993]: level=warn ts=2018-12-18T08:19:36.704180131Z caller=bucket.go:240 msg="loading block failed" id=01CYX8J17RW9EAECCDSV026JD2 err="new bucket block: load index cache: download index file: get file: The specified key does not exist."

And the thanos compactor seems broken:

Dec 18 09:02:53 cephthanos-store thanos[13556]: level=error ts=2018-12-18T08:02:53.638097634Z caller=compact.go:201 msg="critical error detected; halting" err="compaction failed: compaction: compact blocks [/var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYKKJB05D515N47V38JH7KJY /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYKTE22ZY6W7RK4JK5V9EFE7 /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYM19SAKRWFWYVYYCMV3SSDM /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYM85GN9Q7XRA5HZJJSCRARD]: write compaction: chunk 8 not found: reference sequence 0 out of range"

Possible cause

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

I've also upgraded to prometheus v.2.5.0 but all was working fine after that, I don't think that is related. I don't have any specific configuration for thanos nor prometheus, just default values.

Any help/comments about how to unblock the situation? thank you in advance!!

Roberto

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-12-18T15:19:24Z

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

Why wrong idea? (:

bwplotka · 2018-12-18T15:25:50Z

I've also upgraded to prometheus v.2.5.0 but all was working fine after that,

Well, compaction is done after some time, and seems like 01CYM85GN9Q7XRA5HZJJSCRARD is affected during compaction...

Something you can do is to move 01CYM85GN9Q7XRA5HZJJSCRARD block somewhere (another bucket or download it and remove?) and see if that's the only problematic one (restart store and compactor).

Can you copy here it's meta.json and overview of 01CYM85GN9Q7XRA5HZJJSCRARD directory?

Looking into error message: reference sequence 0 out of range" is interesting, because it means that somehow this block does not have ANY chunk files? Printing what 01CYM85GN9Q7XRA5HZJJSCRARD directory has inside would be useful. It looks like partial/malformed upload to me?

bwplotka · 2018-12-18T15:30:10Z

Also please update to Thanos v0.2.0 or master, because there were fixes to partial upload issues.

bwplotka · 2018-12-18T17:07:02Z

Related to #377

robvalca · 2018-12-19T08:24:47Z

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

Why wrong idea? (:

I though that maybe the temporary files that the compactor stores locally were important for the compactor process and I did not copy them to the new host but I guess that this is not a problem right?

I've also upgraded to prometheus v.2.5.0 but all was working fine after that,

Well, compaction is done after some time, and seems like 01CYM85GN9Q7XRA5HZJJSCRARD is affected during compaction...

Something you can do is to move 01CYM85GN9Q7XRA5HZJJSCRARD block somewhere (another bucket or download it and remove?) and see if that's the only problematic one (restart store and compactor).

Can you copy here it's meta.json and overview of 01CYM85GN9Q7XRA5HZJJSCRARD directory?

Looking into error message: reference sequence 0 out of range" is interesting, because it means that somehow this block does not have ANY chunk files? Printing what 01CYM85GN9Q7XRA5HZJJSCRARD directory has inside would be useful. It looks like partial/malformed upload to me?

Yes, the directory seems empty in S3, no chunks folder.

2018-12-19 07:46  25334736   s3://prometheus-storage-backup/01CYM85GN9Q7XRA5HZJJSCRARD/index
2018-12-19 07:46       424   s3://prometheus-storage-backup/01CYM85GN9Q7XRA5HZJJSCRARD/meta.json

This is the content of the meta.json file:

{
	"version": 1,
	"ulid": "01CYM85GN9Q7XRA5HZJJSCRARD",
	"minTime": 1544709600000,
	"maxTime": 1544716800000,
	"stats": {
		"numSamples": 67046375,
		"numSeries": 277748,
		"numChunks": 558786
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CYM85GN9Q7XRA5HZJJSCRARD"
		]
	},
	"thanos": {
		"labels": {
			"monitor": "cern",
			"replica": "A"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar"
	}
}

So I've just moved the folder and now I can access old metrics again ;).

Although, the store still complains about other block (it's on my first post). I've checked that on S3 and seems that has data but the index file is missing:

[root@cephthanos-store tmp]# s3cmd ls s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/
                       DIR   s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/chunks/
2018-12-17 05:06       424   s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/meta.json

and the meta.json:

 424 of 424   100% in    0s    21.75 kB/s{
	"version": 1,
	"ulid": "01CYX8J17RW9EAECCDSV026JD2",
	"minTime": 1545012000000,
	"maxTime": 1545019200000,
	"stats": {
		"numSamples": 67483319,
		"numSeries": 279468,
		"numChunks": 562311
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CYX8J17RW9EAECCDSV026JD2"
		]
	},
	"thanos": {
		"labels": {
			"monitor": "cern",
			"replica": "A"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar"
	}
}

And compactor is complaining too:

Dec 19 09:11:16 cephthanos-store systemd: Started Thanos Compact Component.
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417040248Z caller=compact.go:228 msg="starting compact node"
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417297344Z caller=main.go:244 msg="Listening for metrics" address=0.0.0.0:10903
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417289769Z caller=compact.go:811 msg="start sync of metas"
Dec 19 09:11:17 cephthanos-store thanos: level=info ts=2018-12-19T08:11:17.823147031Z caller=compact.go:817 msg="start of GC"
Dec 19 09:11:20 cephthanos-store thanos: level=error ts=2018-12-19T08:11:20.380739137Z caller=main.go:161 msg="running command failed" err="error executing compaction: compaction failed: compaction: gather index issues for block /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYX8J17RW9EAECCDSV026JD2: open index file: try lock file: open /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYX8J17RW9EAECCDSV026JD2/index: no such file or directory"

It's possible to re-create that index file or should I get rid off the block?

Also please update to Thanos v0.2.0 or master, because there were fixes to partial upload issues.

Yes, that's one of my new year's wishes :)

Thank you a lot for your great work and support !

Roberto

bwplotka · 2018-12-19T19:22:28Z

I though that maybe the temporary files that the compactor stores locally were important for the compactor process and I did not copy them to the new host but I guess that this is not a problem right?

Tmp files can be removed. It's leftover from compactor being restart in between some job.

Please upgrade to v0.2.0 as some of those bugs are fixed. In terms of index, you cannot recreate it because essentially you lost all the label names and locations of encoded samples.

I think you are hitting some old issue already fixed, when we were not correctly checking error: #403 so please update (:

robvalca · 2018-12-20T07:39:50Z

Deleted the block and now all seems fine. I will update to 0.2.0 ASAP.
Thanks again for the great support :)

robvalca closed this as completed Dec 20, 2018

bjakubski mentioned this issue Jul 3, 2019

thanos compactor crashes with "write compaction: chunk 8 not found: reference sequence 0 out of range" #1300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot access thanos metrics #688

Cannot access thanos metrics #688

robvalca commented Dec 18, 2018

bwplotka commented Dec 18, 2018

bwplotka commented Dec 18, 2018 •

edited

Loading

bwplotka commented Dec 18, 2018

bwplotka commented Dec 18, 2018

robvalca commented Dec 19, 2018

bwplotka commented Dec 19, 2018

robvalca commented Dec 20, 2018

Cannot access thanos metrics #688

Cannot access thanos metrics #688

Comments

robvalca commented Dec 18, 2018

Versions

Our setup

Issue

Possible cause

bwplotka commented Dec 18, 2018

bwplotka commented Dec 18, 2018 • edited Loading

bwplotka commented Dec 18, 2018

bwplotka commented Dec 18, 2018

robvalca commented Dec 19, 2018

bwplotka commented Dec 19, 2018

robvalca commented Dec 20, 2018

bwplotka commented Dec 18, 2018 •

edited

Loading