receive: high cpu when upgrading from 0.12.2 with old data #2793

sepich · 2020-06-23T06:53:09Z

Thanos, Prometheus and Golang version used:
thanos, version 0.13.0 (branch: HEAD, revision: adf6fac)
build user: circleci@b8cd18f8b553
build date: 20200622-10:04:50
go version: go1.14.2

Object Storage Provider:
GCP

What happened:
We're using 3 thanos-receive v0.12.2 pods running with --receive.replication-factor=3.
At 3:30 i've restarted one pod (green line) as v0.13.0, and it's cpu usage doubled:

Memory usage is 5-10% higher, which is fine.
Here is another graph, from node where pod has been running:

What you expected to happen:
Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.

How to reproduce it (as minimally and precisely as possible):
We run receive with such args:

          args:
          - receive
          - |
            --objstore.config=type: GCS
            config:
              bucket: jb-thanos
          - --tsdb.path=/data
          - --tsdb.retention=1d
          - --tsdb.min-block-duration=30m  # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
          - --tsdb.max-block-duration=30m
          - --label=replica="$(NAME)"
          - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
          - --receive.hashrings-file=/cfg/hashrings.json
          - --receive.replication-factor=3

Full logs to relevant components:
These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:

Logs

level=info ts=2020-06-23T00:37:15.498358444Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=308.339725ms
level=info ts=2020-06-23T00:37:22.528425096Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=287.226708ms
level=info ts=2020-06-23T00:37:25.321693858Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=51 last=52 duration=2.793211656s
level=info ts=2020-06-23T00:37:34.117011112Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=305.063766ms
level=info ts=2020-06-23T00:37:42.249223736Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=353.091759ms
level=info ts=2020-06-23T00:37:45.317240183Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=53 last=54 duration=3.067957396s

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

brancz · 2020-06-23T13:52:41Z

could you take a cpu profile of this?

sepich · 2020-06-23T17:13:56Z

Please find 30s cpu profiles attached, they were taken almost at the same time:
Archive.zip

bwplotka · 2020-06-23T18:32:38Z

Thanks, looks like GC is the current bottleneck:

DIFF 0.12.2 vs 0.13.0:
profile001.pdf

bwplotka · 2020-06-29T13:14:04Z

We can repro this on our side, looking into this more (:

sepich · 2020-06-29T14:15:55Z

JFYI,
i've backported #2621, #2679 to v0.12.2 and it does not lead to cpu usage increase.

bwplotka · 2020-06-29T14:23:50Z

I think it's multiTSDB so #2012

bwplotka · 2020-06-29T17:05:47Z

It's not particularly in v0.13.0, but this code does not help for sure: https://github.com/prometheus/prometheus/blob/b788986717e1597452ca25e5219510bb787165c7/tsdb/db.go#L812

TSDB we use, forces GC multiple times. It might make sense for single TSDB, but not really for multiple.

sepich · 2020-06-30T08:41:26Z

they written each ~15sec comparing to ~15min in prometheus v2.19.0 with almost the same settings for tsdb:

Turns out this high cpu usage happens when i'm start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc)
This leads to each subfolder being converted to separate tsdb:

...
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1KGAFNHYWZ6Z8CPMZ8TQQ6"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1HSC4SJ9KSZRYENXX9PS18"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1G2E84CNBJ3NFY7P4PE5RW"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1EBGTBP4HRW2WGG9KEXN7P"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1CMKR2521YKPS1283WGBKJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1AXMY36Z98AQ64QQRD10MP"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC196PXDRFDA39FW476KR7SJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC17FSYJY8AWTYXCP3QW9J9W"}
...

And then log is full of component=tsdb msg="Head GC completed" events.
If I wipe tsdb folder before starting of v0.13.0 - then it consumes the same cpu as v0.12.2

bwplotka · 2020-06-30T17:24:32Z

Also #2823

bwplotka · 2020-06-30T17:28:13Z

This looks like something we experienced as well, but not when I recall this we wiped the dir indeed.

It's migration issue from non-multiTSDB to multiTSDB.

This can be easily repro in unit test when porting old receive. Let's fix it, might be annoying for Receive users who want to upgrade to 0.13+

bwplotka · 2020-06-30T17:29:42Z

I think for you you can ensure it works with higher CPU long enough for WAL from 0.12 to be non-existent then wipe those dirs. In the mean time we will look on fix. Still help wanted (:

brancz · 2020-07-01T06:15:12Z

Those look like the migration didn't work correctly. The dirs look like they are block directories. Just to make sure though, you are not running a multi tenant setup right?

sepich · 2020-07-01T09:01:04Z

Yes, we're not running multi tenancy, and these are just block dirs.

stale · 2020-07-31T09:53:11Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-08-07T14:15:11Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka · 2020-08-07T14:16:38Z

Still the migration path has to be fixed in theory.

stale · 2020-09-06T15:11:05Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-09-13T16:09:08Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka changed the title ~~receive: high cpu in v0.13.0~~ receive: high cpu when upgrading from 0.12.2 with old WAL Jun 30, 2020

bwplotka changed the title ~~receive: high cpu when upgrading from 0.12.2 with old WAL~~ receive: high cpu when upgrading from 0.12.2 with old data Jun 30, 2020

bwplotka added bug help wanted labels Jun 30, 2020

stale bot added the stale label Jul 31, 2020

stale bot closed this as completed Aug 7, 2020

bwplotka reopened this Aug 7, 2020

stale bot removed the stale label Aug 7, 2020

stale bot added the stale label Sep 6, 2020

stale bot closed this as completed Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: high cpu when upgrading from 0.12.2 with old data #2793

receive: high cpu when upgrading from 0.12.2 with old data #2793

sepich commented Jun 23, 2020

brancz commented Jun 23, 2020

sepich commented Jun 23, 2020

bwplotka commented Jun 23, 2020 •

edited

Loading

bwplotka commented Jun 29, 2020

sepich commented Jun 29, 2020

bwplotka commented Jun 29, 2020

bwplotka commented Jun 29, 2020

sepich commented Jun 30, 2020 •

edited

Loading

bwplotka commented Jun 30, 2020

bwplotka commented Jun 30, 2020

bwplotka commented Jun 30, 2020

brancz commented Jul 1, 2020

sepich commented Jul 1, 2020

stale bot commented Jul 31, 2020

stale bot commented Aug 7, 2020

bwplotka commented Aug 7, 2020

stale bot commented Sep 6, 2020

stale bot commented Sep 13, 2020

receive: high cpu when upgrading from 0.12.2 with old data #2793

receive: high cpu when upgrading from 0.12.2 with old data #2793

Comments

sepich commented Jun 23, 2020

brancz commented Jun 23, 2020

sepich commented Jun 23, 2020

bwplotka commented Jun 23, 2020 • edited Loading

bwplotka commented Jun 29, 2020

sepich commented Jun 29, 2020

bwplotka commented Jun 29, 2020

bwplotka commented Jun 29, 2020

sepich commented Jun 30, 2020 • edited Loading

bwplotka commented Jun 30, 2020

bwplotka commented Jun 30, 2020

bwplotka commented Jun 30, 2020

brancz commented Jul 1, 2020

sepich commented Jul 1, 2020

stale bot commented Jul 31, 2020

stale bot commented Aug 7, 2020

bwplotka commented Aug 7, 2020

stale bot commented Sep 6, 2020

stale bot commented Sep 13, 2020

bwplotka commented Jun 23, 2020 •

edited

Loading

sepich commented Jun 30, 2020 •

edited

Loading