Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: Added thanos bucket tool rewrite (e.g allowing block series deletions). #3421

Merged
merged 1 commit into from
Nov 24, 2020

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Nov 6, 2020

Added rewrite tool:

thanos tools bucket rewrite --no-dry-run \
  --id 01DN3SK96XDAEKRB1AN30AAW6E \
  --objstore.config "
type: FILESYSTEM
config:
  directory: <local dir>
" \
  --rewrite.to-delete-config "
- matchers: \"{__name__!~\\\".*total\\\"}\"
" 

Not that inefficient as for first iteration. Deleting ALL gauges from 2w 8.3mln series block while gathering changelog file took ~1m30s

Head of changelog for prod block I used:

Deleted {To="1007c280a27d8401", __name__="alerts", _id="f10e61fd-069c-49eb-8ed9-b1633c6259de", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.35.145.22:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-control-plane-1", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568135206056 1568178646000}]
Deleted {To="1007c280a27d8401", __name__="alerts", _id="f10e61fd-069c-49eb-8ed9-b1633c6259de", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.35.145.52:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-control-plane-2", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568135206056 1568178646000}]
Deleted {To="10378dab77f14db3", __name__="alerts", _id="d6856f61-6fee-4f3c-a563-34f3d3f86e4a", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="192.168.1.43:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-master-0.bm.int.shifti.us", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568144193553 1568156913564}]
Deleted {To="10378dab77f14db3", __name__="alerts", _id="d6856f61-6fee-4f3c-a563-34f3d3f86e4a", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="192.168.1.45:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-master-2.bm.int.shifti.us", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568138793744 1568153673645}]
Deleted {To="1039efb496924464", __name__="alerts", _id="dee71917-dcd6-495f-bc59-59037658d9f4", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.48.4.213:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-control-plane-2", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568499368241 1568499548241}]
Deleted {To="10c3d0fe3c4349f9", __name__="alerts", _id="3fec22a4-ce8b-4380-a36d-2bce5f61223e", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="172.31.136.170:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-ip-172-31-136-170.us-west-1.compute.internal", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568117253564 1568117313545}]
Deleted {To="10feb7ef9262777e", __name__="alerts", _id="b572f112-998b-45b9-ab2f-d80f99356723", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.3.158.139:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-ip-10-3-158-139.eu-west-3.compute.internal", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1567802082565 1567802622565}]
Deleted {To="1170e6974f79ad38", __name__="alerts", _id="c03103eb-1571-498d-b1fd-70587b445faa", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.0.138.59:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-ip-10-0-138-59.ec2.internal", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1567804893521 1567870173521}]
Deleted {To="1170e6974f79ad38", __name__="alerts", _id="c03103eb-1571-498d-b1fd-70587b445faa", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="10.0.164.93:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-ip-10-0-164-93.ec2.internal", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1567805133521 1568470936744}]
Deleted {To="11c401372d21c33f", __name__="alerts", _id="b6acd906-6fc0-4d26-b4e6-e29d412c43ae", alertname="etcdMemberCommunicationSlow", alertstate="firing", endpoint="etcd-metrics", instance="192.168.50.11:9979", job="etcd", namespace="openshift-etcd", pod="etcd-member-master-1", prometheus="openshift-monitoring/k8s", prometheus_replica="prometheus-telemeter-1", service="etcd", severity="warning"} [{1568427608246 1568789465437}]

cc @SuperQ, you might find it useful (:

Signed-off-by: Bartlomiej Plotka [email protected]

NOTE: Left for future PRs:

  • Relabelling
  • Symbols removal / rebuild
  • Merge with block.Repair.

@bwplotka bwplotka force-pushed the bucket-delete-series branch 4 times, most recently from f82dfad to 2ad9593 Compare November 8, 2020 22:04
bwplotka added a commit to prometheus/prometheus that referenced this pull request Nov 8, 2020
Required for CLI for deletions in Thanos: thanos-io/thanos#3421

Signed-off-by: Bartlomiej Plotka <[email protected]>
@bwplotka bwplotka force-pushed the bucket-delete-series branch 4 times, most recently from 402131e to b7aefd0 Compare November 9, 2020 00:37
@bwplotka bwplotka marked this pull request as ready for review November 9, 2020 00:40
@bwplotka bwplotka force-pushed the bucket-delete-series branch from b7aefd0 to 44f00ae Compare November 9, 2020 00:45
@bwplotka
Copy link
Member Author

bwplotka commented Nov 9, 2020

Example log:

ts=2020-11-09T01:02:57.31884196Z caller=level.go:63 level=info msg="loading bucket configuration"
ts=2020-11-09T01:02:57.822533681Z caller=level.go:63 level=info msg="downloading block" source=01DN3SK96XDAEKRB1AN30AAW6E
ts=2020-11-09T01:03:10.891061066Z caller=level.go:63 level=info msg="changelog will be available" file=/tmp/thanos-rewrite/01EPN8EF1AAER6WZ6H0KSXDAEF/change.log
ts=2020-11-09T01:03:10.901763718Z caller=level.go:63 level=info msg="starting rewrite for block" source=01DN3SK96XDAEKRB1AN30AAW6E new=01EPN8EF1AAER6WZ6H0KSXDAEF toDelete="\n- matchers: \"{__name__!~\\\".*total\\\"}\"\n"
modify [{[__name__!~".*total"] []}]
ts=2020-11-09T01:03:22.477395782Z caller=level.go:63 level=info msg="processed 10.00% of 8377876 series"
ts=2020-11-09T01:03:33.502961899Z caller=level.go:63 level=info msg="processed 20.00% of 8377876 series"
ts=2020-11-09T01:03:46.514366225Z caller=level.go:63 level=info msg="processed 30.00% of 8377876 series"
ts=2020-11-09T01:04:01.191199992Z caller=level.go:63 level=info msg="processed 40.00% of 8377876 series"
ts=2020-11-09T01:04:12.062844251Z caller=level.go:63 level=info msg="processed 50.00% of 8377876 series"
ts=2020-11-09T01:04:22.684655823Z caller=level.go:63 level=info msg="processed 60.00% of 8377876 series"
ts=2020-11-09T01:04:36.599985847Z caller=level.go:63 level=info msg="processed 70.00% of 8377876 series"
ts=2020-11-09T01:04:49.444169472Z caller=level.go:63 level=info msg="processed 80.00% of 8377876 series"
ts=2020-11-09T01:05:02.105350091Z caller=level.go:63 level=info msg="processed 90.00% of 8377876 series"
ts=2020-11-09T01:05:18.180069182Z caller=level.go:63 level=info msg="processed 100.00% of 8377876 series"
ts=2020-11-09T01:05:18.180177192Z caller=level.go:63 level=info msg="wrote new block after modifications; flushing" source=01DN3SK96XDAEKRB1AN30AAW6E new=01EPN8EF1AAER6WZ6H0KSXDAEF
ts=2020-11-09T01:05:18.437332093Z caller=level.go:63 level=info msg="uploading new block" source=01DN3SK96XDAEKRB1AN30AAW6E new=01EPN8EF1AAER6WZ6H0KSXDAEF
ts=2020-11-09T01:05:18.4957816Z caller=level.go:63 level=info msg=uploaded source=01DN3SK96XDAEKRB1AN30AAW6E new=01EPN8EF1AAER6WZ6H0KSXDAEF

@bwplotka bwplotka force-pushed the bucket-delete-series branch 4 times, most recently from de03c9b to 40ca72d Compare November 9, 2020 10:44
@bwplotka
Copy link
Member Author

bwplotka commented Nov 9, 2020

@bwplotka bwplotka force-pushed the bucket-delete-series branch 3 times, most recently from c34fb32 to 9104db5 Compare November 9, 2020 11:28
@chadlwilson
Copy link

This looks really interesting! Excuse my likely ignorance, but I have a couple of observations/queries

  1. is there any particular reason that motivates use of the metadata.DeletionRequest format in the --rewrite.to-delete-config.* flag design here? Since logically this is doing similar stuff that a Prometheus relabel config might be doing on people's proms, I wonder whether this format (or a subset of it) might be a suitable DSL; even if action: drop is all it supports initially? Perhaps such relabel config is rather complex to translate to the actions you need to take inside the tooling - or this is to be consistent with existing "prior art" for such block-oriented tooling, however. Just a thought 👍
  2. how are you anticipating that users will select the blocks that require rewriting/deletion of series? What should I do, if I don't know which block IDs have the problematic time series in them? Can we suggest how that could be achieved in the docs somehow?

bwplotka added a commit to prometheus/prometheus that referenced this pull request Nov 9, 2020
* Exposed DeletionIterator and CompactMetas functions.

Required for CLI for deletions in Thanos: thanos-io/thanos#3421

Signed-off-by: Bartlomiej Plotka <[email protected]>

* Removed Thanos usage mentions.

Signed-off-by: Bartlomiej Plotka <[email protected]>
@bwplotka
Copy link
Member Author

bwplotka commented Nov 9, 2020

Good questions @chadlwilson chadlwilson

is there any particular reason that motivates use of the metadata.DeletionRequest format in the --rewrite.to-delete-config.* flag design here? Since logically this is doing similar stuff that a Prometheus relabel config might be doing on people's proms, I wonder whether this format (or a subset of it) might be a suitable DSL; even if action: drop is all it supports initially? Perhaps such relabel config is rather complex to translate to the actions you need to take inside the tooling - or this is to be consistent with existing "prior art" for such block-oriented tooling, however. Just a thought +1

Good point. Relabelling is amazing, although a bit hard to understand and configure correctly by users. Still, it's a plan to add relabelling on top of the existing deletion request. The main problem for me is that relabelling was made for labels NOW (discovery, scraping), so it's has no time awareness. So it's hard currently to define some relabelling for a certain time range (e.g delete series only from 2:00pm to 3:00pm).

Unless we are ok to do it for the whole block always. OR we can extend relabelling to be time interval based. Any preferences?

how are you anticipating that users will select the blocks that require rewriting/deletion of series? What should I do, if I don't know which block IDs have the problematic time series in them? Can we suggest how that could be achieved in the docs somehow?

Let's add that in later PRs. Usually, you need to remove some series if you have some corruption/problem with a particular block. To print all IDs for certain labels thanos tools bucket ls or thanos tools bucket inspect are useful. On top of that thanos tools bucket web is amazing UI to view things cc @kunal-kushwaha @prmsrswt (: But good point about documenting this.

@yeya24
Copy link
Contributor

yeya24 commented Nov 9, 2020

how are you anticipating that users will select the blocks that require rewriting/deletion of series? What should I do, if I don't know which block IDs have the problematic time series in them? Can we suggest how that could be achieved in the docs somehow?

One useful feature in promtool is the promtool tsdb analyze command, which analyzes a single TSDB block and it shows information like highest cardinality labels. It would be good if we can support this for blocks in the object storage.

But if we can expose this information via UI, then that would be perfect

pkg/compactv2/compactor.go Outdated Show resolved Hide resolved
pkg/block/metadata/meta.go Show resolved Hide resolved
pkg/compactv2/compactor.go Outdated Show resolved Hide resolved
cmd/thanos/tools_bucket.go Show resolved Hide resolved
pkg/objstore/filesystem/filesystem.go Outdated Show resolved Hide resolved
pkg/compactv2/modifiers.go Outdated Show resolved Hide resolved
pkg/compactv2/modifiers.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Nov 10, 2020

If we specify a block that doesn't have any matching series of the specified deletion requests, then we will still write a new block and flush it to the obj store?
This is not a problem since we support dry-run. Just curious about whether we can improve the logic to not create a duplicate block

@bwplotka bwplotka force-pushed the bucket-delete-series branch 4 times, most recently from 7cf2814 to 5590442 Compare November 12, 2020 17:02
@bwplotka bwplotka requested a review from yeya24 November 12, 2020 17:02
@bwplotka
Copy link
Member Author

Fixed all, PTAL (:

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It is really amazing work but it would be good to have maintainers to take a look as well since the change is huge. Maybe @kakkoyun @brancz @pracucci ?

@bwplotka bwplotka force-pushed the bucket-delete-series branch 2 times, most recently from 7c740eb to a638117 Compare November 13, 2020 10:43
@bwplotka
Copy link
Member Author

PTAL 🤗

@bwplotka bwplotka changed the title tools: Added thanos bucket tool rewrite (e.g allowing block series deletions and relabelling). tools: Added thanos bucket tool rewrite (e.g allowing block series deletions). Nov 13, 2020
@bwplotka
Copy link
Member Author

ping (:

docs/components/tools.md Show resolved Hide resolved
docs/components/tools.md Show resolved Hide resolved
pkg/block/index.go Outdated Show resolved Hide resolved
pkg/block/index.go Outdated Show resolved Hide resolved
pkg/block/metadata/meta.go Show resolved Hide resolved
cmd/thanos/tools_bucket.go Show resolved Hide resolved
pkg/block/writer.go Show resolved Hide resolved
pkg/compactv2/compactor.go Outdated Show resolved Hide resolved

symbols, set, err := compactSeries(ctx, sReaders...)
if err != nil {
return errors.Wrapf(err, "compact series from %v", func() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sad (:

pkg/compactv2/compactor.go Outdated Show resolved Hide resolved
@bwplotka bwplotka force-pushed the bucket-delete-series branch 2 times, most recently from 61d1ce2 to 6fa2dc0 Compare November 20, 2020 17:34
@bwplotka
Copy link
Member Author

Is this rdy to be merged? 🤗

@bwplotka bwplotka force-pushed the bucket-delete-series branch from 6fa2dc0 to 870a9ce Compare November 23, 2020 18:48
@bwplotka bwplotka force-pushed the bucket-delete-series branch from 870a9ce to d0f3fbe Compare November 23, 2020 18:49
@kakkoyun
Copy link
Member

@bwplotka Are failing tests flaky ones? In any case, I have re-run them. We'll see soon enough

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯 The flaky document checks fixed on master.

@kakkoyun kakkoyun merged commit 7b45f70 into master Nov 24, 2020
@bwplotka bwplotka deleted the bucket-delete-series branch November 24, 2020 15:31
Oghenebrume50 pushed a commit to Oghenebrume50/thanos that referenced this pull request Dec 7, 2020
…es deletions). (thanos-io#3421)

Signed-off-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: Oghenebrume50 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants