Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Thanos store consumes almost 100G of memory at startup #325

Closed
mihailgmihaylov opened this issue May 8, 2018 · 16 comments
Closed
Assignees
Labels

Comments

@mihailgmihaylov
Copy link

mihailgmihaylov commented May 8, 2018

We have a strange issue with the memory consumption of Thanos Store in our Thanos setup.
Currently, we have a working setup of Prometheus and alertmanager as follows:

  • 3 Prometheus 2.2.1 pods (in a statefulSet) with TSDB as backend. Docker image: quay.io/prometheus/prometheus:v2.2.1. Each Prometheus pod has a thanos sidecar. The docker image of the sidecar is: improbable/thanos:master (image id 2950dff67c9a).
  • 3 Alertmanager v0.15 pods (in a cluster). Docker image: quay.io/prometheus/alertmanager:v0.15.0-rc.1
  • 2 Thanos Query pods.
    The kubernetes server is running on AWS EC2 instances with S3 object storage as Thanos storage.

I am trying to implement the thanos-store, however, there is a very strange issue with memory consumption there.
We started a prometheus server from scratch 7d ago. Each prometheus pod consumes on avarage 1G of memory:
prom-beforethanos-7d-data-on-demo-test

When I start the thanos-store and include it in the thanos cluster, the memory consumption is rocket high for the thanos-store.
It starts from 0 and consumes almost all resources on the node and finishes with OEMKill:
prom-thanos-store-7d-data-on-demo-test
After dedicating one 122G node only for thanos-store, I managed to fit the pot in. After the initial burst of memory, the memory eventually settles on about 60G and stays there:
prom-thanos-store-7d-data-on-demo-test-120gnode

I changed some options like storage.tsdb.retention which was 30d to 24h and chunk-pool-size (from 2g to 512MB) but without any effect.

Here is some detailed information about our configuration:
Store:

      - name: thanos-store
        image: improbable/thanos:master
        env:
        - name: S3_ACCESS_KEY
          value: '***'
        - name: S3_SECRET_KEY
          value: '***'
        - name: S3_BUCKET
          value: '***'
        - name: S3_ENDPOINT
          value: 's3.eu-west-1.amazonaws.com'
        args:
        - "store"
        - "--log.level=debug"
        - "--tsdb.path=/var/thanos/store"
        - "--cluster.peers=monitoring-thanos-peers:10900"
        - "--chunk-pool-size=512MB"
        - "--index-cache-size=64MB"

Prometheus:

      - name: prometheus
        image: quay.io/prometheus/prometheus:v2.2.1
        args:
          - '--storage.tsdb.retention=24h'
          - '--storage.tsdb.no-lockfile'
          - '--storage.tsdb.path=/prometheus/data'
          - '--storage.tsdb.min-block-duration=2h'
          - '--storage.tsdb.max-block-duration=2h'
          - '--config.file=/etc/prometheus-shared/prometheus.yml'
          - '--web.enable-admin-api'
          - '--web.enable-lifecycle'
          - '--web.route-prefix=/'

Sidecar:

      - name: thanos-sidecar
        image: improbable/thanos:master
        env:
        - name: S3_ACCESS_KEY
          value: '***'
        - name: S3_SECRET_KEY
          value: '***'
        - name: S3_BUCKET
          value: '***'
        - name: S3_ENDPOINT
          value: 's3.eu-west-1.amazonaws.com'
        args:
        - "sidecar"
        - "--tsdb.path=/prometheus/data"
        - "--prometheus.url=http://localhost:9090"
        - "--cluster.peers=monitoring-thanos-peers:10900"
        - "--reloader.config-file=/etc/prometheus/prometheus.yml.tmpl"
        - "--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml"

I cannot see anything important that can give us meaningful information in thanos-sidecar logs:

level=debug ts=2018-05-08T13:11:22.604134111Z caller=cluster.go:117 msg="resolved peers to following addresses" peers=100.108.0.9:10900,100.112.0.7:10900,100.122.0.6:10900,100.122.0.9:10900,100.108.0.4:10900
level=debug ts=2018-05-08T13:11:22.611780542Z caller=store.go:147 msg="initializing bucket store"
level=debug ts=2018-05-08T13:18:16.247390978Z caller=store.go:151 msg="bucket store ready" init_duration=6m53.635610515s
level=info ts=2018-05-08T13:18:16.247636366Z caller=store.go:224 msg="starting store node"
level=debug ts=2018-05-08T13:18:16.252193094Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZY1EQKVGH5Y9A7BSTV80NP addr=100.119.0.4:10900
level=debug ts=2018-05-08T13:18:16.258597094Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZRRJQX37W64W2V0BYH9QG6 addr=100.108.0.4:10900
level=debug ts=2018-05-08T13:18:16.258640807Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZRRQWC61PSNSENQZR4M0M2 addr=100.122.0.9:10900
level=debug ts=2018-05-08T13:18:16.258661307Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWET40SSP0VNGP0PVY247G addr=100.112.0.7:10900
level=debug ts=2018-05-08T13:18:16.258690371Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWPM08E2TMT7A2NQ2KXRGK addr=100.108.0.9:10900
level=debug ts=2018-05-08T13:18:16.258727886Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWXS7TRT7DC9XHQEBXPJCG addr=100.122.0.6:10900
level=debug ts=2018-05-08T13:18:16.268763726Z caller=cluster.go:201 component=cluster msg="joined cluster" peers=6

The sidecar logs are showing warnings, but I found out that this was due to the fact that prometheus took 1m to load. In the period which prometheus is loading, the sidecar cannot get the configuration from api/v1/status/config and issues this logs:

level=info ts=2018-05-08T12:43:43.105302282Z caller=sidecar.go:315 msg="starting sidecar" peer=
level=info ts=2018-05-08T12:43:43.105425435Z caller=reloader.go:77 component=reloader msg="started watching config file for changes" in=/etc/prometheus/prometheus.yml.tmpl out=/etc/prometheus-shared/prometheus.yml
level=warn ts=2018-05-08T12:43:43.112071103Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:45.106530306Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:47.106532299Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:49.106570044Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:51.168940045Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:53.168366187Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:55.10644909Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:57.168280363Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=warn ts=2018-05-08T12:43:59.106533152Z caller=sidecar.go:144 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="decode response: invalid character 'S' looking for beginning of value"
level=info ts=2018-05-08T12:44:00.908563252Z caller=reloader.go:180 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yml.tmpl cfg_out=/etc/prometheus-shared/prometheus.yml rule_dir=
level=info ts=2018-05-08T13:00:13.188416171Z caller=shipper.go:185 msg="upload new block" id=01CCZXCMDGN4AM8G8MTB6F9DX7

Finally, the tsdb samples are not that many and therefore this overuse of memory is definitely not OK:
prom-thanos-store-7d-data-on-demo-test-tsdb

Any help would be appreciated!
I am very eager to make it work since it is a great tool!

@bwplotka bwplotka added the bug label May 8, 2018
@bwplotka
Copy link
Member

bwplotka commented May 8, 2018

  1. Ok, first of all: TODO for me -> push master-<commit-sha> and avoid latest or master images. Otherwise, I have no idea how to track image ID ): (2950dff67c9a). When approximately it was pulled? (:

  2. Prometheus retention should not matter for this issue

  3. You have really small chunk cache -> Basically Index cache size (could be ~1GB) + Chunk pool size (whatever you have spare) + some 1-2Gb buffer = max GB store can use.

I would first of all change index cache. 64MB might be way too low. Give it a 1GB. For chunk you can give 20Gb, should be enough. Both of these numbers should limit memory consumption for store.. maybe extremely small number 64MB results with unexpected things?

  1. (Important) Before changing anything. Can you get dump of heap and go routines? just go tool pprof -symbolize=remote -alloc_space thanos "http://<thanos-store>:<store http port>/debug/pprof/heap" and top 5 command. Same for debug/pprof/goroutines. This will give us the exact place of the mem usage or go routine leak.

  2. You did not give thanos-store logs, which are the most interesting here. Can we have some? (:

  3. What about go_goroutines graph for thanos-store? how it looks? Probably sky rocketing as well?

BTW thanks for so much input! It sounds quite bad, hopefully, we find the root cause soon. (:

@mihailgmihaylov
Copy link
Author

mihailgmihaylov commented May 9, 2018

Thank you for the fast reply!

  1. Initially, I worked with the latest docker image, but I had an issue which was fixed several weeks ago so I had to switch to the master. I pulled the image yesterday - 08.05 approximately 11am CET.
  2. I changed the retention to 24h

3/4. Before I changed the cache and chunk pool size I did a dump of the heap and go routines. Initially, I tried with -alloc_space thanos in the command as you said but I got thanos: open thanos: no such file or directory in the output:

$ go tool pprof -symbolize=remote -alloc_space thanos "http://thanos-store:80/debug/pprof/heap"
Fetching profile over HTTP from http://thanos-store.demo-test.receipt-labs.com:80/debug/pprof/heap
thanos: open thanos: no such file or directory
Fetched 1 source profiles out of 2
Saved profile in /Users/mihailgmihaylov/pprof/pprof.thanos.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: thanos
Type: alloc_space
Time: May 9, 2018 at 10:21am (EEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 5
Showing nodes accounting for 361.58GB, 81.26% of 444.98GB total
Dropped 625 nodes (cum <= 2.22GB)
Showing top 5 nodes out of 62
      flat  flat%   sum%        cum   cum%
  184.88GB 41.55% 41.55%   184.88GB 41.55%  encoding/json.(*Decoder).refill
   63.05GB 14.17% 55.72%    63.05GB 14.17%  reflect.unsafe_NewArray
   44.97GB 10.11% 65.82%    44.97GB 10.11%  github.com/improbable-eng/thanos/pkg/block.ReadIndexCache.func1
   42.42GB  9.53% 75.36%    43.67GB  9.81%  encoding/json.(*decodeState).literalStore
   26.27GB  5.90% 81.26%    26.27GB  5.90%  reflect.mapassign
(pprof)

So I ditched the alloc_space option:

$ go tool pprof -symbolize=remote "http://thanos-store:80/debug/pprof/heap"
Fetching profile over HTTP from http://thanos-store:80/debug/pprof/heap
Saved profile in /Users/mihailgmihaylov/pprof/pprof.thanos.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
File: thanos
Type: inuse_space
Time: May 9, 2018 at 10:55am (EEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 5
Showing nodes accounting for 50324.67MB, 99.79% of 50430.30MB total
Dropped 114 nodes (cum <= 252.15MB)
Showing top 5 nodes out of 20
      flat  flat%   sum%        cum   cum%
19491.33MB 38.65% 38.65% 50374.16MB 99.89%  github.com/improbable-eng/thanos/pkg/block.ReadIndexCache
12482.44MB 24.75% 63.40% 12482.44MB 24.75%  encoding/json.(*decodeState).literalStore
12212.84MB 24.22% 87.62% 12212.84MB 24.22%  reflect.mapassign
 5375.55MB 10.66% 98.28%  5375.55MB 10.66%  reflect.unsafe_NewArray
  762.51MB  1.51% 99.79%   762.51MB  1.51%  reflect.unsafe_New
(pprof)

top 5 (debug/pprof/goroutines):

$ go tool pprof -symbolize=remote "http://thanos-store:80/debug/pprof/goroutine"
Fetching profile over HTTP from http://thanos-store:80/debug/pprof/goroutine
Saved profile in /Users/mihailgmihaylov/pprof/pprof.thanos.goroutine.002.pb.gz
File: thanos
Type: goroutine
Time: May 9, 2018 at 10:54am (EEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 5
Showing nodes accounting for 26, 100% of 26 total
Showing top 5 nodes out of 69
      flat  flat%   sum%        cum   cum%
        24 92.31% 92.31%         24 92.31%  runtime.gopark
         1  3.85% 96.15%          1  3.85%  runtime.notetsleepg
         1  3.85%   100%          1  3.85%  runtime/pprof.writeRuntimeProfile
         0     0%   100%          2  7.69%  bufio.(*Reader).Read
         0     0%   100%          1  3.85%  github.com/improbable-eng/thanos/pkg/cluster.warnIfAlone
  1. Thanos-store logs:
level=debug ts=2018-05-08T13:11:22.604134111Z caller=cluster.go:117 msg="resolved peers to following addresses" peers=100.108.0.9:10900,100.112.0.7:10900,100.122.0.6:10900,100.122.0.9:10900,100.108.0.4:10900
level=debug ts=2018-05-08T13:11:22.611780542Z caller=store.go:147 msg="initializing bucket store"
level=debug ts=2018-05-08T13:18:16.247390978Z caller=store.go:151 msg="bucket store ready" init_duration=6m53.635610515s
level=info ts=2018-05-08T13:18:16.247636366Z caller=store.go:224 msg="starting store node"
level=debug ts=2018-05-08T13:18:16.252193094Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZY1EQKVGH5Y9A7BSTV80NP addr=100.119.0.4:10900
level=debug ts=2018-05-08T13:18:16.258597094Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZRRJQX37W64W2V0BYH9QG6 addr=100.108.0.4:10900
level=debug ts=2018-05-08T13:18:16.258640807Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZRRQWC61PSNSENQZR4M0M2 addr=100.122.0.9:10900
level=debug ts=2018-05-08T13:18:16.258661307Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWET40SSP0VNGP0PVY247G addr=100.112.0.7:10900
level=debug ts=2018-05-08T13:18:16.258690371Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWPM08E2TMT7A2NQ2KXRGK addr=100.108.0.9:10900
level=debug ts=2018-05-08T13:18:16.258727886Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CCZWXS7TRT7DC9XHQEBXPJCG addr=100.122.0.6:10900
level=debug ts=2018-05-08T13:18:16.268763726Z caller=cluster.go:201 component=cluster msg="joined cluster" peers=6
level=debug ts=2018-05-08T14:53:08.114805896Z caller=delegate.go:88 component=cluster received=NotifyLeave node=01CCZWET40SSP0VNGP0PVY247G addr=100.112.0.7:10900
level=debug ts=2018-05-08T14:53:37.014178887Z caller=delegate.go:88 component=cluster received=NotifyLeave node=01CCZWPM08E2TMT7A2NQ2KXRGK addr=100.108.0.9:10900
level=debug ts=2018-05-08T14:55:08.992652564Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CD03YSNT5B0C1MK9WFS06ZC0 addr=100.108.0.8:10900
level=debug ts=2018-05-08T14:58:31.459201575Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CD0451BRBRNCZXHVKP6C82HS addr=100.112.0.7:10900
level=warn ts=2018-05-08T15:00:18.997211577Z caller=bucket.go:233 msg="loading block failed" id=01CD048BQ8VA7PY7K2ET378QQW err="new bucket block: load meta: download meta.json: copy object to file: The specified key does not exist."
level=debug ts=2018-05-08T19:58:06.712039888Z caller=bucket.go:754 msg="series query processed" stats="&{blocksQueried:3 postingsTouched:6 postingsTouchedSizeSum:12484 postingsFetched:6 postingsFetchedSizeSum:12484 postingsFetchCount:6 postingsFetchDurationSum:841965794 seriesTouched:6 seriesTouchedSizeSum:359 seriesFetched:6 seriesFetchedSizeSum:279968 seriesFetchCount:3 seriesFetchDurationSum:191563543 chunksTouched:6 chunksTouchedSizeSum:971 chunksFetched:6 chunksFetchedSizeSum:298380 chunksFetchCount:3 chunksFetchDurationSum:332463129 getAllDuration:458034833 mergedSeriesCount:6 mergedChunksCount:6 mergeDuration:53300}"
level=warn ts=2018-05-09T01:00:18.445088563Z caller=bucket.go:233 msg="loading block failed" id=01CD16JZZ5C6E80ZV5ZENWAM29 err="new bucket block: load meta: download meta.json: copy object to file: The specified key does not exist."
  1. Strangely, the goroutines are not that many. I used the metric from node_exporter and showed a variation from 15 to 30. Also this is /debug/pprof:
    screen shot 2018-05-09 - goroutines

Also, I think it is worth mentioning that I am not setting up Thanos from scratch and actually migrating some data form when Prometheus was running alone. Furthermore, today I ran the thanos compactor and apparently I have managed to have a duplicate records in the S3 store:
msg="running command failed" err="compaction: pre compaction overlap check: overlaps found while gathering blocks. I am now cleaning them up but I cannot see how this is involved in the memory issue.

Finally, when changed chunk-pool-size and index-cache-size as you advised, there was almost no change in the top 5 for heap and goroutine. The number of heap objects (in debug/pprof/) was significantly lower (1037) but it started building up.

@bwplotka
Copy link
Member

bwplotka commented May 9, 2018

Thanks!

Can we do one more thing with pprof? Let's grab heap again and run pdf and let's see the full overview please. That would be helpful

@bwplotka
Copy link
Member

bwplotka commented May 9, 2018

The numbers are insane 0.o

@bwplotka
Copy link
Member

bwplotka commented May 9, 2018

This is weird:
level=warn ts=2018-05-08T15:00:18.997211577Z caller=bucket.go:233 msg="loading block failed" id=01CD048BQ8VA7PY7K2ET378QQW err="new bucket block: load meta: download meta.json: copy object to file: The specified key does not exist."

Looking now if the above error can cause some mem leak

@mihailgmihaylov
Copy link
Author

mihailgmihaylov commented May 9, 2018

Here is the pdf of the heap:
profile001.pdf

Yes, I noticed that issue with the key, too. I found the block in S3 and it is there with a meta.json that looks fine. However, since now I am running compactor and it fails every 1h with overlaps found while gathering blocks I decided that this may be somehow connected to that issue.

I may have some issues with my data, so I thought of dumping production prometheus data to staging and setting up Thanos, but I postponed it for the time being.

@bwplotka
Copy link
Member

bwplotka commented May 9, 2018

Wonder if that is related to 64MB cache limit.

Will look closer into code soon, thanks for all details. This should be enough for now.

@bwplotka
Copy link
Member

bwplotka commented May 9, 2018

First of all the meta.json error:

  • Minio-go gives us:
    The specified key does not exist. on Get.Read(byte) while downloading the meta.json file from S3. It corresponds to:
< if objectName != "" >
errResp = ErrorResponse{
	StatusCode: resp.StatusCode,
	Code:       "NoSuchKey",
	Message:    "The specified key does not exist.",
	BucketName: bucketName,
	Key:        objectName,
}

Seems like simply there is no 01CD048BQ8VA7PY7K2ET378QQW/meta.json but there is 01CD048BQ8VA7PY7K2ET378QQW directory in your S3 bucket.

  1. Can you confirm that?
  2. Is there any dir that actually have meta.json?

All of this indicates that we had some partial uploads, which is not good. What compactor is saying exactly? It might indicate same issues. Only thanos sidecars or compactor uploads blocks so we need all logs that we have. Especially for thanos-sidecars.. Maybe some thanos-sidecar had some troubles while uploading block?

This however does not explain memory leak yet. The quickest way to tell, might be by adding more integration tests with some mocked bucket store. Can look on that tomorrow, nothing obvious from code.

The mem leak is in ReadIndexCache so after meta.json is loaded. Maybe partially loaded index file causes this if that happened? Also code suggests that small index cache does not matter here as well.

@mihailgmihaylov
Copy link
Author

As far as meta.json is concerned, there is a 01CD048BQ8VA7PY7K2ET378QQW/meta.json and there is a meta.json in every block dir in the S3 bucket.
The content of the file looks fine as well:

{
	"version": 1,
	"ulid": "01CD16JZZ5C6E80ZV5ZENWAM29",
	"minTime": 1525816800000,
	"maxTime": 1525824000000,
	"stats": {
		"numSamples": 34724882,
		"numSeries": 145225,
		"numChunks": 289847
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CD16JZZ5C6E80ZV5ZENWAM29"
		]
	},
	"thanos": {
		"labels": {
			"monitor": "prometheus",
			"replica": "monitoring-prometheus-1"
		},
		"downsample": {
			"resolution": 0
		}
	}
}

Also here is how the compactor fails at after about 40min of work:

level=error ts=2018-05-09T15:13:36.413829592Z caller=main.go:147 msg="running command failed" err="compaction: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1524830400000, maxt: 1524832200000, range: 30m0s, blocks: 13663]: <ulid: 01CCDH9041A4DKYF04VY3VASFH, mint: 1524830400000, maxt: 1524837600000, range: 2h0m0s>, <ulid: 01CC47S7XHE61BCBWQJTHFHZFB, mint: 1524830400000, maxt: 1524837600000, range: 2h0m0s>, <ulid: 01CCJYQYMXFB1QA4DDXV0EZAJJ, mint: 1524830400000, maxt: 1524837600000, range: 2h0m0s>, <ulid: 01CC3ZMV0AS32MM5EANW3PW8XP, mint: 1524830400000, maxt: 1524837600000, range: 2h0m0s>, <ulid: 01CC43TN1EY7Q1VN0GDCZ5P6WS, mint: 1524830400000, maxt: 1524837600000, range: 2h0m0s>, 
....

For now I have dropped the compactor cronjob.

I would agree though that somehow the sidecar or the compactor messed up the data because these messages are far from normal, but I don't know how.

I made an experiment and pointed the store and sidecars to an empty bucket and everything works fine (there is no data and the new blocks are getting uploaded fine). However, I don't know if when there are a lot of indexes, if the spikes will come back.

I started backing up the production EBS prometheus volumes to try and setup a prometheus server with data on staging. Then will try to run the sidecar containers and afterwards the store. Will monitor for errors while uploading and other issues and get back at you with the results.

Once again, I very much appreciate the help!

@bwplotka
Copy link
Member

bwplotka commented May 10, 2018

I think we have two separate issues here.

One is some memory leak, when partial upload happens (for example if index is uploaded in the half or something like this?) - this needs to investigated, I am writing now some local tests that will help confirm this.

Second is scary, because if there was no upload error on sidecar/compactor and meta.json was eventually uploaded, we might hit this issue: #298 so S3 being not really strongly consistent for write-read. (In theory, it is with some caveats, but when I am rereading upon this, we actually might hit these caveats) - this means we need some solution to this as well. I think we need to invest some time in this too.

@mihailgmihaylov
Copy link
Author

Today I did a bunch of tests with 3 months of data.
I could not reproduce the memory leak issue but I could not migrate my 3 months to S3 as well.
That lead me to the question, can Thanos migrate old data?

I read in the description that the sidecar backs up data but does it do so for legacy blocks? Like in my case where I had retention interval of 90d. I implemented thanos sidecar but did not change the retention so that the data is not removed. Then I implemented thanos store and switched to 24h retention but no data was uploaded, it was just deleted.

That raises the question how I managed to put all those blocks to S3 last time. I remember that I did several changes at once - changed the retention to 24h, changed min-block-duration and max-block-duration to very small values.

I continue with my checks and will try to reproduce the issue.

@bwplotka
Copy link
Member

Yup, data migration is a valid use case and we need to figure out how to do it safely: #206

@bwplotka bwplotka self-assigned this May 11, 2018
@mihailgmihaylov
Copy link
Author

I finally managed to did it!
Figuring out that the memory leak is most probably caused by the corrupt or incorrect way of uploading the data to S3, yesterday I took time to babysit the upload.

I manually edit each meta.json of each block in Prometheus setting compaction level to 1. Then deleted the blocks from the thanos.shipper.json one by one and make sure that they are consistently uploaded to S3.

Now all the data is in the block storage. I switched the prometheus retention period back to 24h thus pruning the local data. Now the Thanos store occupies the negligible 2G of memory (which are actually saved from the prometheus memory consumption).

I will perform a few other tests before merging to production but it looks very promising.
Although, the memory leak is not explained I think that in the standard day to day work this corruption of data is not an issue.

Thank you so much for the help!

@bwplotka
Copy link
Member

Wow, yeah you handled data migration perfectly, awesome!

So now store node can access old data just fine? Without any mem leak?

Added more tests to make sure all objstore implementations are equal: #327

Any chance to have you, checkout the branch and run the test while having S3 envvars exported? go test ./pkg/... -v ^^ This will run all test against tmp bucket. No worries if you don't have time for it, I will set up something later.

@mihailgmihaylov
Copy link
Author

Yes, no mem leak issues now - the Thanos Store is occupying about 700m of memory and when I load it with fat queries it tops to 2G of memory which is very efficient. When this happened in Prometheus we had an OEMKill.

I have a snapshot of the EBS volumes and the S3 bucket with the broken setup.
I will try to reproduce it again and run the tests and get back at you.

@bwplotka
Copy link
Member

This seems to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants