Help with thanos on huge prometheus environment #569

dmitriy-lukyanchikov · 2018-10-11T13:40:20Z

We need you help, and it will be much appreciated if you could look at this problems, give some advice how better build environment using Thanos and our infrastructure or maybe how to properly configure it. We can give any type of logs or info you need.

There are several problems we would like to address with Thanos:

use one datasource in Grafana
save on storage using downsampling for older data
HA for prometheus servers
keep metrics for 3 years

We use Thanos with 5 prometheus servers with following structure:

every prometheus server have different number of metrics
every prometheus server have 3-4 millions series that we keep for 30 days
we also have one Long Retention Prometheus that store specific metrics for more than year
all prometheus servers except the Long Retention Prometheus send data blocks to s3 to compact, downsample and query by Thanos
every Prometheus servers have near 200Gb of data blocks, so its potentially almost 1Tb of data that need to be queried by Thanos even if it will be stored only in s3 for longer than 1 day
we use trickster for caching queries from Thanos for speedup of loading dashboards
use s3 storage for Thanos
prometheus servers and Thanos located on different providers, Thanos on DO, prometheus servers are on OVH in the same geo location
all prometheus servers are 2.4.2 version
all thanos components use Thanos docker image master-2018-10-05-74b13ba

We have several problems after almost one week of testing with production metrics:

Sometimes query does not run on the reload of grafana dashboard and some graphs failed, but if reloaded again its ok. We could see in logs, that some new compacted blocks and old deleted by compactor are not fully loaded and looks like it’s the causing this query error. When blocks is fully loaded its running normally so looks like this error appear only when blocks are not fully loaded or when compacted blocks are deleted by compactor.

level=warn ts=2018-10-10T16:02:26.166320641Z caller=bucket.go:240 msg="loading block failed" id=01CSFAYZ4QZR4SK5NSG4T3EFKR err="new bucket block: load meta: download meta.json: get file: The specified key does not exist."

Thanos storage use a lot of memory when there is more than 300Gb of metrics in s3. Potentially we will have more than 1Tb of metrics even after compaction and downsampling. We try to use server with 380Gb of ram and 56 cores but store component sometimes fail with error like this.

signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xc3b6a2]

goroutine 367384 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc4310893c0, 0x0)
	<autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc587bb7e00, 0xc5469854a0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc587bb7e90, 0x1)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc587bb7fb0, 0x0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd646210, 0xc5c4b9f8f0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd6466c0, 0x1)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd647020, 0x31d)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd630300, 0x1)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd632870, 0xc46b374000)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc7fd553350, 0xc43fd5f180)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:361 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc7fd553380, 0xc46b374000)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:312 +0x33
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x100b240, 0xc7fd553380, 0x0, 0x2, 0x2, 0x100b240, 0xc7fd553380)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:221 +0x57
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4201b8c80, 0x100a7c0, 0xc4872d67c0, 0x886a589f7d616601, 0xd71702128da0920c, 0xc830191650, 0xc54eb3bb20, 0xc4d1fb17a0, 0xc43da6e020, 0x2, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:489 +0x152
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc43e6dc7e0, 0xc43e6dc7e0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:715 +0xe3
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4d1fb1800, 0xc54eb3bb90, 0xc4872d45d0)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

We tried to play with configuration of parameters indexCacheSize and chunkPoolSize but no luck its still failed sometimes. As I understand its failed mostly when try to run query for metrics over long time range, more than 30 days

Thanos compactor downsample do not decrease a lot of space usage, for example after first step of downsampling is done for 3 big blocks(each near 15Gb) 2 of blocks still has the same size and only third block is reduced from 17Gb to 13.3Gb. It will be huge impact on performance if downsampled blocks will be that big after downsampling, not sure if its normal but I expected that it should be 3-4 times smaller than original size. Here is example

level=info ts=2018-10-11T06:12:18.42687053Z caller=compact.go:156 msg="start first pass of downsampling"
level=info ts=2018-10-11T06:20:09.289351981Z caller=downsample.go:209 msg="downloaded block" id=01CSF730JSJR3Z9EJXXFZTJTR1 duration=7m48.351778475s
level=info ts=2018-10-11T06:39:51.304695606Z caller=downsample.go:236 msg="downsampled block" from=01CSF730JSJR3Z9EJXXFZTJTR1 to=01CSGX47CZCX85NK622SSWQ1CZ duration=19m1.198248368s
level=info ts=2018-10-11T06:51:04.848142335Z caller=downsample.go:250 msg="uploaded block" id=01CSGX47CZCX85NK622SSWQ1CZ duration=10m33.315424808s
level=info ts=2018-10-11T07:00:34.490051242Z caller=downsample.go:209 msg="downloaded block" id=01CSF8WAA36WAXP5N1BMPCV2W6 duration=9m27.611261216s
level=info ts=2018-10-11T07:19:00.063531863Z caller=downsample.go:236 msg="downsampled block" from=01CSF8WAA36WAXP5N1BMPCV2W6 to=01CSGZG21W1TR3R6SH7TGRM81F duration=17m52.830708558s
level=info ts=2018-10-11T07:31:28.509760634Z caller=downsample.go:250 msg="uploaded block" id=01CSGZG21W1TR3R6SH7TGRM81F duration=12m0.799189303s
level=info ts=2018-10-11T07:40:02.236614302Z caller=downsample.go:209 msg="downloaded block" id=01CSFAYZ4QZR4SK5NSG4T3EFKR duration=8m31.532914539s
level=info ts=2018-10-11T07:57:57.533182215Z caller=downsample.go:236 msg="downsampled block" from=01CSFAYZ4QZR4SK5NSG4T3EFKR to=01CSH1QHZAKQC99QRTVPZWKXKB duration=17m22.260969756s
level=info ts=2018-10-11T08:09:29.121496796Z caller=downsample.go:250 msg="uploaded block" id=01CSH1QHZAKQC99QRTVPZWKXKB duration=11m4.472428821s

original block 01CSF730JSJR3Z9EJXXFZTJTR1

aws s3 ls --summarize --human-readable --recursive s3://af-thanos-storage-prod/01CSF730JSJR3Z9EJXXFZTJTR1
2018-10-10 17:53:37  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000001
2018-10-10 17:53:54  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000002
2018-10-10 17:54:55  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000003
2018-10-10 17:55:12  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000004
2018-10-10 17:56:02  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000005
2018-10-10 17:56:18  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000006
2018-10-10 17:56:34  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000007
2018-10-10 17:56:51  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000008
2018-10-10 17:57:15  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000009
2018-10-10 17:57:36  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000010
2018-10-10 17:57:53  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000011
2018-10-10 17:58:09  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000012
2018-10-10 17:58:26  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000013
2018-10-10 17:59:02  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000014
2018-10-10 17:59:29  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000015
2018-10-10 17:59:46  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000016
2018-10-10 18:00:01  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000017
2018-10-10 18:00:17  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000018
2018-10-10 18:00:34  512.0 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000019
2018-10-10 18:00:50  193.1 MiB 01CSF730JSJR3Z9EJXXFZTJTR1/chunks/000020
2018-10-10 18:00:57    5.1 GiB 01CSF730JSJR3Z9EJXXFZTJTR1/index
2018-10-10 18:06:05    1.1 KiB 01CSF730JSJR3Z9EJXXFZTJTR1/meta.json

Total Objects: 22
   Total Size: 14.8 GiB

and compacted 01CSGX47CZCX85NK622SSWQ1CZ

aws s3 ls --summarize --human-readable --recursive s3://af-thanos-storage-prod/01CSGX47CZCX85NK622SSWQ1CZ
2018-10-11 09:40:35  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000001
2018-10-11 09:41:52  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000002
2018-10-11 09:42:12  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000003
2018-10-11 09:42:31  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000004
2018-10-11 09:42:51  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000005
2018-10-11 09:43:12  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000006
2018-10-11 09:43:44  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000007
2018-10-11 09:44:03  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000008
2018-10-11 09:44:21  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000009
2018-10-11 09:44:40  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000010
2018-10-11 09:45:06  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000011
2018-10-11 09:45:26  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000012
2018-10-11 09:45:45  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000013
2018-10-11 09:46:06  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000014
2018-10-11 09:46:26  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000015
2018-10-11 09:46:46  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000016
2018-10-11 09:47:05  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000017
2018-10-11 09:47:21  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000018
2018-10-11 09:47:37  512.0 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000019
2018-10-11 09:47:53  366.1 MiB 01CSGX47CZCX85NK622SSWQ1CZ/chunks/000020
2018-10-11 09:48:06    4.8 GiB 01CSGX47CZCX85NK622SSWQ1CZ/index
2018-10-11 09:51:04    1.1 KiB 01CSGX47CZCX85NK622SSWQ1CZ/meta.json

Total Objects: 22
   Total Size: 14.6 GiB

We switched some of most the important dashboards to Thanos datasource and graphs loaded 3-10 times slower. For example if dashboard with 8 open graphs reloaded in original dashboard it will loads 10-20 seconds using only prometheus as datasource. it took 1-2 minutes using Thanos and sometimes even could not loaded due to timeout. We try to decrease number of prometheus servers from where query is loaded, one s3 storage and one prometheus, but it slower 2-3 times than original prometheus anyway, maybe we need to use the same provider?
In order to speedup process we tried to use trickster as cache between grafana and Thanos, and it helps a lot, it is still 2+ times slower, but there is another issue if one of store is down it will cache wrong result and will not overwrite it if store is up again. This issue is more to proxy authors. This information is just to give you an idea why solving problems blocks us, and simple caching on trickster did not work out.

Having all this problems we are blocked with moving on to production, as we anticipate much more problems when many people will start to query and run grafana dashboards simultaneously.

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-10-11T21:25:22Z

Thank you for this write up! So many details <3

Some initial questions:

every prometheus server have 3-4 millions series that we keep for 30 days

Are you describing current setup? If yes, Which of those prometheuses are actually uploading to S3? All? And they have 30 days locally retention as well? Is local compaction enabled?

We could see in logs, that some new compacted blocks and old deleted by compactor are not fully loaded and looks like it’s the causing this query error. When blocks is fully loaded its running normally so looks like this error appear only when blocks are not fully loaded or when compacted blocks are deleted by compactor.

Awesome, super useful data. Sounds like a valid bug to me, let's think how to improve this.

Thanos storage use a lot of memory when there is more than 300Gb of metrics in s3.

Do you mean for query purposes? Do you have numbers how much it takes during idle time?

As I understand its failed mostly when try to run query for metrics over long time range, more than 30 days

Interesting observation, wonder if that hits downsampled data. The panic itself has to be fixed. Did not have time to dig into it properly.

Thanos compactor downsample do not decrease a lot of space usage,

Nice finding I feel it is because you have lots of series, we cannot downsample that dimensions. Those numbers are for biggest 2w blocks?

Re 4. Yea trickster is not ideal - but have you looked on query distribution, where the latency comes from? Have you looked on max concurrent queries flag that is available for thanos-query as well as Prometheus? If we are missing some instrumentation there, we need to add this as well. We actually hit recently same issue, upping the max concurrent queries, helped a lot. Also we fixed this: #563

Thanks for writing it up in details, we really appreciate the feedback. We also appreciate contributions if you want to improve this quicker (((:

Cheers!

dmitriy-lukyanchikov · 2018-10-12T07:02:42Z

Thank you for quick response )

every prometheus server have 3-4 millions series that we keep for 30 days

Are you describing current setup? If yes, Which of those prometheuses are actually uploading to S3? All? And they have 30 days locally retention as well? Is local compaction enabled?

Yes the all upload to s3 because the have different type of metrics, and only in s3 they all in one place
They have 30 days retention, we need historical data.
local compaction was enabled before thanos but now max/min tsdb is 2 hours, and only this blocks is uploaded to s3 as far as i can see.

We could see in logs, that some new compacted blocks and old deleted by compactor are not fully loaded and looks like it’s the causing this query error. When blocks is fully loaded its running normally so looks like this error appear only when blocks are not fully loaded or when compacted blocks are deleted by compactor.

Awesome, super useful data. Sounds like a valid bug to me, let's think how to improve this.

i think its possible to use tmp prefix for new not fully created data block in s3 and then rename it, but not sure if its possible with s3, and this does not fix situation when compacted old blocks is deleted by compactor and do not synced properly by s3, but maybe its different problem not sure.

Thanos storage use a lot of memory when there is more than 300Gb of metrics in s3.

Do you mean for query purposes? Do you have numbers how much it takes during idle time?

Not only query, mostly it consume max memory during initialization of s3 bucket where we have 900GB of data blocks. During idle it mostly use 160-200GB with this backet.

As I understand its failed mostly when try to run query for metrics over long time range, more than 30 days

Interesting observation, wonder if that hits downsampled data. The panic itself has to be fixed. Did not have time to dig into it properly.

This was using data that are not downsampled by thanos, it was uploaded by hand from prometheus just to store it in s3. But we sometimes we have this problem with small data that already compacted and downsampled by thanos, not sure why its acquired yet.

Thanos compactor downsample do not decrease a lot of space usage,

Nice finding I feel it is because you have lots of series, we cannot downsample that dimensions. Those numbers are for biggest 2w blocks?

No this blocks have data only for 2 days. This blocks was compacted from 2h blocks by compactor and then on second day they start downsampled, but the size still almost the same, i just wonder if this is normal, because i believed that it will be much smaller than original block

Re 4. Yea trickster is not ideal - but have you looked on query distribution, where the latency comes from? Have you looked on max concurrent queries flag that is available for thanos-query as well as Prometheus? If we are missing some instrumentation there, we need to add this as well. We actually hit recently same issue, upping the max concurrent queries, helped a lot. Also we fixed this: #563

Yes i tried to use this option and increase it to 300 concurrent queries, but looks like this problem is because huge number of data sources and connection speed, i will try to run query and store with some of prometheus servers on one server to test if this will improve the query speed

bwplotka · 2018-10-12T10:45:52Z

Yes the all upload to s3 because the have different type of metrics, and only in s3 they all in one place
They have 30 days retention, we need historical data.
local compaction was enabled before thanos but now max/min tsdb is 2 hours, and only this blocks is uploaded to s3 as far as i can see.

So are you saying that your Grafana -> thanos-query asks store gateway for old data AND 30days non compacted data from Prometheus? Maybe that's causing your latencies then? Why not making Prometheus local retention to 1day to eliminate this problem. Rationales why we cannot tell (for now) how to not ask for both sidecar and store node: #283

i will try to run query and store with some of prometheus servers on one server to test if this will improve the query speed

Worth to grab some traces maybe?

dmitriy-lukyanchikov · 2018-10-12T15:26:46Z

Sorry i was not clear. Before thanos we use local compaction by prometheus itself with max.tsdb=1d and min.tsdb=2h. Before running thanos we tested if we can upload all existing data blocks compacted by prometheus. During init of store component thanos consume near 200-250GB of RAM to load 900GB of metrics. This setup used only one static configured store on query component and this store was s3. After this we test if we can use this metrics by running only thanos. And it worked well, but sometimes it falls when run queries bigger than 30 days interval, but maybe it falls because of other things not sure.
Before running thanos cluster we change to max.tsdb=2h as readme recommend to enable compaction and downsampling by thanos. But thanos was too slow using 4 prometheus remote stores and one static s3 store, and not stable, so i was wonder if we can do anything about it.
Also regarding downsampling is not decreased data blocks to much, do you think its problem because we have to many different metrics? I mean if we scrape every 30-60 seconds every metric than downsample for 300 seconds need to be at least 4 times smaller, or i do not understand some specific of this operation?

dmitriy-lukyanchikov · 2018-10-18T09:57:32Z

Test prometheus and store component in AWS with s3 store and it working well almost all time, so looks like its slow network issue, will close this issue for now. Also have last question, @bwplotka does it make sense to make local storage(blocked device) for thanos that will be synced from s3 if needed some speedup

bwplotka · 2018-10-18T17:03:48Z

Test prometheus and store component in AWS with s3 store and it working well almost all time

All of those issue are fine, or what exactly? (: Just want to make sure all is covered.

@bwplotka does it make sense to make local storage(blocked device) for thanos that will be synced from s3 if needed some speedup

Can you be more specific? To what thanos component? sidecar?

bwplotka · 2018-10-18T17:04:40Z

for 1. issue I created ticket: #564

dmitriy-lukyanchikov · 2018-10-18T17:59:32Z

Test prometheus and store component in AWS with s3 store and it working well almost all time

All of those issue are fine, or what exactly? (: Just want to make sure all is covered.

I mean all components, i deploy prometheus with thanos store and query in aws and it worked fast and stable

@bwplotka does it make sense to make local storage(blocked device) for thanos that will be synced from s3 if needed some speedup

Can you be more specific? To what thanos component? sidecar?
Sorry, i mean Thanos store component that will sync from s3 to local blocked device all blocks to be able to read it from local storage, not from s3 itself when querying

bwplotka · 2018-10-19T16:11:00Z

I mean all components, i deploy prometheus with thanos store and query in aws and it worked fast and stable

Nice! For context, where you have been running before, when you were having this issues?

Sorry, i mean Thanos store component that will sync from s3 to local blocked device all blocks to be able to read it from local storage, not from s3 itself when querying

That makes sense only for small number of blocks. The main advantage of using object storage is that it is cheap and reliable. Downloading ALL blocks in store gateway (when only very smal % will be actually queried) is waste of your disk and time. Also it is not cheap to store petabytes of data on your local storage (: and you can store that many data on object storage just fine, with relatively quick access.

bwplotka · 2019-10-07T14:01:49Z

Ok, I think it's time to close this issue as it's not really actionable, plus very old.

We improved many things from last year and we are rewriting store GW block loading #1471 so things are changing here (:

Let's add issue / notes to existing issue for each item separately if still problematic. Let me know if that makes sense!

valyala · 2019-10-08T14:27:09Z

FYI, the following article could help understanding better Thanos issues on large-scale Prometheus setup.

bwplotka · 2019-10-08T14:50:14Z

I don't think there is anything wrong with running Thanos on huge Prometheus environment. If only, Thanos makes it much easier.

The main reason for this issue looked like "high memory consumption". Issue was claimed a long time ago for v0.1.0 version of Thanos (!). Plus we are working on improving it even more here: #1471 (:

I wouldn't also personally suggest running remote-write streaming for a high volume of data, but definitely it's worth to know your options. Thanos now supports remote write as well: https://thanos.io/proposals/201812_thanos-remote-receive.md/

bwplotka closed this as completed Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with thanos on huge prometheus environment #569

Help with thanos on huge prometheus environment #569

dmitriy-lukyanchikov commented Oct 11, 2018

bwplotka commented Oct 11, 2018

dmitriy-lukyanchikov commented Oct 12, 2018

bwplotka commented Oct 12, 2018

dmitriy-lukyanchikov commented Oct 12, 2018 •

edited

Loading

dmitriy-lukyanchikov commented Oct 18, 2018

bwplotka commented Oct 18, 2018

bwplotka commented Oct 18, 2018

dmitriy-lukyanchikov commented Oct 18, 2018

bwplotka commented Oct 19, 2018

bwplotka commented Oct 7, 2019

valyala commented Oct 8, 2019

bwplotka commented Oct 8, 2019

Help with thanos on huge prometheus environment #569

Help with thanos on huge prometheus environment #569

Comments

dmitriy-lukyanchikov commented Oct 11, 2018

bwplotka commented Oct 11, 2018

dmitriy-lukyanchikov commented Oct 12, 2018

bwplotka commented Oct 12, 2018

dmitriy-lukyanchikov commented Oct 12, 2018 • edited Loading

dmitriy-lukyanchikov commented Oct 18, 2018

bwplotka commented Oct 18, 2018

bwplotka commented Oct 18, 2018

dmitriy-lukyanchikov commented Oct 18, 2018

bwplotka commented Oct 19, 2018

bwplotka commented Oct 7, 2019

valyala commented Oct 8, 2019

bwplotka commented Oct 8, 2019

dmitriy-lukyanchikov commented Oct 12, 2018 •

edited

Loading