Enable Receiver to extract Tenant from a label present in incoming timeseries #7081

verejoel · 2024-01-19T21:29:44Z

Is your proposal related to a problem?

One way to support a multi-tenant ruler is to enable receives to infer the tenant based not on HTTP headers, but on labels present in the incoming time-series. See the comment in issue #5133 for more information.

As well as helping move forward with the multi-tenant ruler, this would help enable multi-tenancy in clients that deliver telemetry. For example, currently if we want to ship telemetry using the OpenTelemetry collector prometheus remote write exporter, we would need to configure an exporter per tenant (I am aware that one can use the headers setter extension to dynamically set headers, but this only works if you have the same tenant for the whole request context).

Describe the solution you'd like

Introduce a new CLI flag for the receiver, --incoming-tenant-label-name. Thanos receiver will then search each time-series for occurrences of this label, building a map of unique tenant names discovered from the label values to slices of time series belonging to that tenant. Each element of the map can then be distributed according to the hashring config.

The process is summarised in this flow chart:

Notes:

If a custom tenant header is specified and present in the original request, or the THANOS-TENANT header is present, this will override checking each series label for a tenant even if the new CLI flag is set (this behaviour could be controllable as well)
The design shown in the flow chart is not very efficient as we'd loop over the incoming time series twice - once to assign tenancy, and again to check the request size. This can probably be avoided by a smarter design 😅

Describe alternatives you've considered

Performing the manipulations to the THANOS-TENANT header upstream dynamically (using an OTel collector, for example).

Additional context

We'd need to modify the behaviour of the receiveHTTP handler, in particular where we extract the tenant from the HTTP request.

The text was updated successfully, but these errors were encountered:

fpetkovski · 2024-01-20T09:49:34Z

This would be a really cool feature indeed! We've tried to build a proxy that extracts the tenant from a label and sends one request per tenant with the appropriate header, but it overwhelmed receivers and was not worth the hassle. Having the feature natively built into Thanos is the way to go.

verejoel · 2024-01-20T10:12:03Z

Thanks Filip - do you think the proposal for how the feature would work makes sense? I’m a little worried to add too much overhead to the routing receives and cause requests to get backed up.

…

On Sat, 20 Jan 2024 at 10:49, Filip Petkovski ***@***.***> wrote: This would be a really cool feature indeed! We've tried to build a proxy that extracts the tenant from a label and sends one request per tenant with the appropriate header, but it overwhelmed receivers and was not worth the hassle. Having the feature natively built into Thanos is the way to go. — Reply to this email directly, view it on GitHub <#7081 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALE5VOKEKSMDLZSCHTFBCLDYPOHLTAVCNFSM6AAAAABCCR5KMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGA2DSNJUGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

MichaHoffmann · 2024-01-20T17:31:08Z

Thanks Filip - do you think the proposal for how the feature would work makes sense? I’m a little worried to add too much overhead to the routing receives and cause requests to get backed up.

Wouldnt this be evaluated on the ingesting receiver when looking in which tenant the sample should be written to?

I would imagine that instead of one local write, we would inspect the request and group it by tenant and issue multiple local writes here

thanos/pkg/receive/handler.go

Line 793 in 4a73fc3

    
           h.sendLocalWrite(ctx, writeDestination, params.tenant, localWrites[writeDestination], responses)

, right? That would happen on the ingester I think!

fpetkovski · 2024-01-22T16:40:29Z

The ingesting receiver does not always have access to the hashring (e.g. in router-ingester split mode). So routers need to know which ingester to send samples to.

verejoel · 2024-01-23T10:00:15Z

Had some discussions with @MichaHoffmann. We came to conclusion that the proposed implementation would completely break the current rate limiting concept...e.g. if 1 tenant in a batch of 20 is over the limit, what should Thanos do? 429 will be retried and result in the whole batch being ingested again. If we drop all metrics in the batch due to 1 tenant being over the limit, we have created a noisy neighbour problem. But if we accept the metrics from the valid 19/20 tenants, then we will have out of order issues.

So the current design is incompatible with per-tenant rate limiting as it stands.

sepich · 2024-01-24T17:50:14Z

Another way to do it, is via current --receive.relabel-config in router stage by exposing tenant as internal label (ex. __meta_tenant_id) and allowing it's modifications:

- |
    --receive.relabel-config=
    - source_labels: [prometheus]
      target_label: __meta_tenant_id

The same way it is done in mimir:
grafana/mimir#4725

Example of changes needed in Thanos:
44a0728#diff-42c21b7b04cc61ab0cda17794cc1efff14802e0e89a85503d28601e721c1dd31R849

verejoel · 2024-01-25T14:55:25Z

@sepich I like that approach. Do you know how Mimir handles per-tenant limits in that situation?

MichaHoffmann · 2024-01-25T15:02:22Z

@verejoel I think it has the same issues with the ratelimit since all the samples still come from one remote write request!

GiedriusS · 2024-01-31T14:43:33Z

Implemented a PoC for this, works really well. A few caveats:

tenant in Receiver HTTP metrics is embedded into the http.Handler so now what I see is that everything falls under the default-tenant tenant 🤷
__meta_tenant_id is a bad idea because after relabel_configs all labels which begin with __meta are trimmed. You can add it in metric_relabel_configs but for some reason it doesn't apply to meta-metrics like up

verejoel · 2024-03-26T19:59:15Z

@GiedriusS any chance you can open a PR for this issue? Would love to try it out.

GiedriusS · 2024-04-04T09:04:41Z

@verejoel check out #7256

pvlltvk · 2024-05-19T16:09:45Z

@GiedriusS Hi!
I've tried this functionality in v0.35.0, but with no luck. Does it work in router + ingestor mode?

irizzant · 2024-05-27T14:51:37Z

I've tried --receive.split-tenant-label-name too but it's not working.

Thanos is deployed with Bitnami Helm Chart, here is the receive configuration:

- receive
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --remote-write.address=0.0.0.0:19291
            - --objstore.config=$(OBJSTORE_CONFIG)
            - --tsdb.path=/var/thanos/receive
            - --label=replica="$(NAME)"
            - --label=receive="true"
            - --tsdb.retention=15d
            - --receive.local-endpoint=127.0.0.1:10901
            - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
            - --receive.replication-factor=1
            - --receive.split-tenant-label-name="tenant_id"

I created a ServiceMonitor which adds the label tenant_id=test but metrics come out with tenant_id=default-tenant

MichaHoffmann · 2024-05-27T16:06:57Z

Tenant might be a reserved label iirc:

--receive.tenant-label-name="tenant_id"
Label name through which the tenant will be
announced.

irizzant · 2024-05-28T10:12:17Z

So, if I understand, it's impossible to use the --receive.tenant-label-name="tenant_id" because tenant_id is a reserved label?

irizzant · 2024-05-29T09:59:05Z

Also, Receive is adding a tenant label along with tenant_id in the tsdb metrics, is this expected?

benjaminhuo · 2024-08-19T09:24:58Z

@verejoel check out #7256

It's an important feature, so does this PR complete everything in this proposal? @GiedriusS @verejoel
What else is still needed to have a tenant that can remote write evaluated metrics to different ingester based on specific label in metrics?

Thanks!

verejoel · 2024-08-23T08:51:14Z

@benjaminhuo I tried using this with the stateless ruler and the router/ingester setup. It seems to not work as expected, as the ALERTS metrics disappear. I still need to devote some time to work out why that might be the case - it could be they get written to some default tenant that is not quriable in our setup. Would be interested if anyone else has managed to get this working with the stateless ruler, as this is one of the ways to enable a multi-tenanted ruler based on internal metric labels.

yeya24 · 2024-11-03T19:59:29Z

@verejoel Is this still an issue for you to use this feature with stateless ruler? Stateless ruler labels are kind of hard to configure but if you can share your configuration, we can help you debug a bit.

It would be also nice to create a doc for users about how to troubleshoot this. I will create an issue

verejoel · 2024-11-03T20:32:16Z

@yeya24 that sounds good, I will dedicate some time to it this week. Let me know which issue and I will post my findings there

mac-alkira · 2024-12-06T22:50:07Z

Currently experiencing this same issue. Was an issue ever created?

douglascamata added component: receive needs-investigation labels Jan 23, 2024

frezes mentioned this issue May 20, 2024

Thanos v0.35.0 功能关注 WizTelemetry/whizard#522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Receiver to extract Tenant from a label present in incoming timeseries #7081

Enable Receiver to extract Tenant from a label present in incoming timeseries #7081

verejoel commented Jan 19, 2024

fpetkovski commented Jan 20, 2024

verejoel commented Jan 20, 2024 via email

MichaHoffmann commented Jan 20, 2024

fpetkovski commented Jan 22, 2024

verejoel commented Jan 23, 2024 •

edited

Loading

sepich commented Jan 24, 2024

verejoel commented Jan 25, 2024

MichaHoffmann commented Jan 25, 2024 •

edited

Loading

GiedriusS commented Jan 31, 2024

verejoel commented Mar 26, 2024

GiedriusS commented Apr 4, 2024

pvlltvk commented May 19, 2024

irizzant commented May 27, 2024 •

edited

Loading

MichaHoffmann commented May 27, 2024

irizzant commented May 28, 2024

irizzant commented May 29, 2024

benjaminhuo commented Aug 19, 2024 •

edited

Loading

verejoel commented Aug 23, 2024

yeya24 commented Nov 3, 2024

verejoel commented Nov 3, 2024

mac-alkira commented Dec 6, 2024

Enable Receiver to extract Tenant from a label present in incoming timeseries #7081

Enable Receiver to extract Tenant from a label present in incoming timeseries #7081

Comments

verejoel commented Jan 19, 2024

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

fpetkovski commented Jan 20, 2024

verejoel commented Jan 20, 2024 via email

MichaHoffmann commented Jan 20, 2024

fpetkovski commented Jan 22, 2024

verejoel commented Jan 23, 2024 • edited Loading

sepich commented Jan 24, 2024

verejoel commented Jan 25, 2024

MichaHoffmann commented Jan 25, 2024 • edited Loading

GiedriusS commented Jan 31, 2024

verejoel commented Mar 26, 2024

GiedriusS commented Apr 4, 2024

pvlltvk commented May 19, 2024

irizzant commented May 27, 2024 • edited Loading

MichaHoffmann commented May 27, 2024

irizzant commented May 28, 2024

irizzant commented May 29, 2024

benjaminhuo commented Aug 19, 2024 • edited Loading

verejoel commented Aug 23, 2024

yeya24 commented Nov 3, 2024

verejoel commented Nov 3, 2024

mac-alkira commented Dec 6, 2024

verejoel commented Jan 23, 2024 •

edited

Loading

MichaHoffmann commented Jan 25, 2024 •

edited

Loading

irizzant commented May 27, 2024 •

edited

Loading

benjaminhuo commented Aug 19, 2024 •

edited

Loading