Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access thanos metrics #688

Closed
robvalca opened this issue Dec 18, 2018 · 7 comments
Closed

Cannot access thanos metrics #688

robvalca opened this issue Dec 18, 2018 · 7 comments

Comments

@robvalca
Copy link

Versions

thanos, version 0.1.0 (branch: HEAD, revision: ebb58c2)
build user: root@393679b8c49c
build date: 20180918-17:02:30
go version: go1.10.3

prometheus, version 2.5.0 (branch: HEAD, revision: 67dc912ac8b24f94a1fc478f352d25179c94ab9b)
build user: root@578ab108d0b9
build date: 20181106-11:40:44
go version: go1.11.1

Our setup

1 x prometheus server + thanos sidecar (8c, 16G RAM)
2 x thanos queriers (4c, 8G RAM)
1 x thanos-store + thanos compactor (8c, 16G RAM)
backend: S3 ceph+radosgw
OS: CentOS 7.6

Issue

Debugging other issues I've realized that lately I cannot access to the metrics older than 15d (prometheus local retention) while we have a few months (~140G) of metrics in S3 and it worked fine before. From the thanos dash I got the following message if I put time range of 1w or older (the metrics are shown but no data older than 15d)

receive series: rpc error: code = Aborted desc = fetch series for block 01CYM85GN9Q7XRA5HZJJSCRARD: add chunk preload: reference sequence 0 out of range

The thanos store is looping with the following warning:

Dec 18 09:19:36 cephthanos-store thanos[11993]: level=warn ts=2018-12-18T08:19:36.704180131Z caller=bucket.go:240 msg="loading block failed" id=01CYX8J17RW9EAECCDSV026JD2 err="new bucket block: load index cache: download index file: get file: The specified key does not exist."

And the thanos compactor seems broken:

Dec 18 09:02:53 cephthanos-store thanos[13556]: level=error ts=2018-12-18T08:02:53.638097634Z caller=compact.go:201 msg="critical error detected; halting" err="compaction failed: compaction: compact blocks [/var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYKKJB05D515N47V38JH7KJY /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYKTE22ZY6W7RK4JK5V9EFE7 /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYM19SAKRWFWYVYYCMV3SSDM /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYM85GN9Q7XRA5HZJJSCRARD]: write compaction: chunk 8 not found: reference sequence 0 out of range"

Possible cause

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

I've also upgraded to prometheus v.2.5.0 but all was working fine after that, I don't think that is related. I don't have any specific configuration for thanos nor prometheus, just default values.

Any help/comments about how to unblock the situation? thank you in advance!!

Roberto

@bwplotka
Copy link
Member

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

Why wrong idea? (:

@bwplotka
Copy link
Member

bwplotka commented Dec 18, 2018

I've also upgraded to prometheus v.2.5.0 but all was working fine after that,

Well, compaction is done after some time, and seems like 01CYM85GN9Q7XRA5HZJJSCRARD is affected during compaction...

Something you can do is to move 01CYM85GN9Q7XRA5HZJJSCRARD block somewhere (another bucket or download it and remove?) and see if that's the only problematic one (restart store and compactor).

Can you copy here it's meta.json and overview of 01CYM85GN9Q7XRA5HZJJSCRARD directory?

Looking into error message: reference sequence 0 out of range" is interesting, because it means that somehow this block does not have ANY chunk files? Printing what 01CYM85GN9Q7XRA5HZJJSCRARD directory has inside would be useful. It looks like partial/malformed upload to me?

@bwplotka
Copy link
Member

Also please update to Thanos v0.2.0 or master, because there were fixes to partial upload issues.

@bwplotka
Copy link
Member

Related to #377

@robvalca
Copy link
Author

Lately we had problems with the compactor crashing due to OOM, and eventually I ran it from a different (bigger) machine and this was probably not a good idea :)

Why wrong idea? (:

I though that maybe the temporary files that the compactor stores locally were important for the compactor process and I did not copy them to the new host but I guess that this is not a problem right?

I've also upgraded to prometheus v.2.5.0 but all was working fine after that,

Well, compaction is done after some time, and seems like 01CYM85GN9Q7XRA5HZJJSCRARD is affected during compaction...

Something you can do is to move 01CYM85GN9Q7XRA5HZJJSCRARD block somewhere (another bucket or download it and remove?) and see if that's the only problematic one (restart store and compactor).

Can you copy here it's meta.json and overview of 01CYM85GN9Q7XRA5HZJJSCRARD directory?

Looking into error message: reference sequence 0 out of range" is interesting, because it means that somehow this block does not have ANY chunk files? Printing what 01CYM85GN9Q7XRA5HZJJSCRARD directory has inside would be useful. It looks like partial/malformed upload to me?

Yes, the directory seems empty in S3, no chunks folder.

2018-12-19 07:46  25334736   s3://prometheus-storage-backup/01CYM85GN9Q7XRA5HZJJSCRARD/index
2018-12-19 07:46       424   s3://prometheus-storage-backup/01CYM85GN9Q7XRA5HZJJSCRARD/meta.json

This is the content of the meta.json file:

{
	"version": 1,
	"ulid": "01CYM85GN9Q7XRA5HZJJSCRARD",
	"minTime": 1544709600000,
	"maxTime": 1544716800000,
	"stats": {
		"numSamples": 67046375,
		"numSeries": 277748,
		"numChunks": 558786
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CYM85GN9Q7XRA5HZJJSCRARD"
		]
	},
	"thanos": {
		"labels": {
			"monitor": "cern",
			"replica": "A"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar"
	}
}

So I've just moved the folder and now I can access old metrics again ;).

Although, the store still complains about other block (it's on my first post). I've checked that on S3 and seems that has data but the index file is missing:

[root@cephthanos-store tmp]# s3cmd ls s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/
                       DIR   s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/chunks/
2018-12-17 05:06       424   s3://prometheus-storage/01CYX8J17RW9EAECCDSV026JD2/meta.json

and the meta.json:

 424 of 424   100% in    0s    21.75 kB/s{
	"version": 1,
	"ulid": "01CYX8J17RW9EAECCDSV026JD2",
	"minTime": 1545012000000,
	"maxTime": 1545019200000,
	"stats": {
		"numSamples": 67483319,
		"numSeries": 279468,
		"numChunks": 562311
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CYX8J17RW9EAECCDSV026JD2"
		]
	},
	"thanos": {
		"labels": {
			"monitor": "cern",
			"replica": "A"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar"
	}
}

And compactor is complaining too:

Dec 19 09:11:16 cephthanos-store systemd: Started Thanos Compact Component.
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417040248Z caller=compact.go:228 msg="starting compact node"
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417297344Z caller=main.go:244 msg="Listening for metrics" address=0.0.0.0:10903
Dec 19 09:11:16 cephthanos-store thanos: level=info ts=2018-12-19T08:11:16.417289769Z caller=compact.go:811 msg="start sync of metas"
Dec 19 09:11:17 cephthanos-store thanos: level=info ts=2018-12-19T08:11:17.823147031Z caller=compact.go:817 msg="start of GC"
Dec 19 09:11:20 cephthanos-store thanos: level=error ts=2018-12-19T08:11:20.380739137Z caller=main.go:161 msg="running command failed" err="error executing compaction: compaction failed: compaction: gather index issues for block /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYX8J17RW9EAECCDSV026JD2: open index file: try lock file: open /var/lib/thanos/compactor/compact/0@{monitor=\"cern\",replica=\"A\"}/01CYX8J17RW9EAECCDSV026JD2/index: no such file or directory"

It's possible to re-create that index file or should I get rid off the block?

Also please update to Thanos v0.2.0 or master, because there were fixes to partial upload issues.

Yes, that's one of my new year's wishes :)

Thank you a lot for your great work and support !

Roberto

@bwplotka
Copy link
Member

I though that maybe the temporary files that the compactor stores locally were important for the compactor process and I did not copy them to the new host but I guess that this is not a problem right?

Tmp files can be removed. It's leftover from compactor being restart in between some job.

Please upgrade to v0.2.0 as some of those bugs are fixed. In terms of index, you cannot recreate it because essentially you lost all the label names and locations of encoded samples.

I think you are hitting some old issue already fixed, when we were not correctly checking error: #403 so please update (:

@robvalca
Copy link
Author

Deleted the block and now all seems fine. I will update to 0.2.0 ASAP.
Thanks again for the great support :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants