Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-tenant Ruler: multitsdb and multiagent #5133

Open
saswatamcode opened this issue Feb 7, 2022 · 9 comments
Open

Implement multi-tenant Ruler: multitsdb and multiagent #5133

saswatamcode opened this issue Feb 7, 2022 · 9 comments
Labels
component: rule dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement proposal

Comments

@saswatamcode
Copy link
Member

saswatamcode commented Feb 7, 2022

Is your proposal related to a problem?

Currently, the Thanos Ruler has no built-in support for multi-tenancy like Receive. This creates issues when running it in a setup where we want to isolate tenants and store their rule-evaluated metrics in a different tsdb instance each. The only possible way might be using a Ruler for each tenant which is simpler but wasteful of resources.

Also, in the case of using Stateless Rulers, it's harder to achieve multi-tenancy, as different tenants might need different configurations while remote writing (write to separate locations with separate HTTP headers like THANOS-TENANT).

For example, consider a Receive with multiple tenants, to which a single Ruler might need to remote-write multi-tenant rule-based metrics and store it in the tenant's Receive tsdb. But in this case, the Ruler cannot add HTTP headers for each tenant, so it is treated as a completely new default tenant by Receive and new tsdb gets created.
Ruler_Multi_Tenant_Problem

(Note: This is a separate problem from ensuring that Ruler only selects data from one tenant while evaluating rules.)

Describe the solution you'd like

A potential solution would be using the Receive multitsdb in Ruler and having the same flags for tenancy as Receive (--receive.default-tenant_id, --receive.tenant-label-name). So the Ruler would be tenant-aware and store evaluated metrics in a different tsdb instance for each tenant using the tenant_id label to identify what rule-based series belongs to which tenant (assuming that the rule file configuration will specify the tenant label for each rule).

This can be extended to Stateless Ruler and allow separate remote write configs for each tenant. This would start an agent, i.e, a WAL-only storage for each tenant which remote-writes to only locations that were configured for that tenant. In essence, a multiagent package, would be needed to be able to handle this.

The addition of multitsdb to Ruler can also be skipped as the Scalable Rule proposal does mention the removal of embedded tsdb to be in the work plan! :)
Rules_multiagent

Describe alternatives you've considered

Running a Ruler for each tenant.

Open to feedback and suggestions! If there are existing solutions/configuration options for achieving the same result which will be easier to implement than the above idea, that would be great too! 🙂

@matej-g
Copy link
Collaborator

matej-g commented Feb 8, 2022

We discussed this briefly with @saswatamcode with one more suggested alternative from me, which would be to have a separate remote write config for each tenant, set the tenant header and use relabeling to only forward metrics which are applicable to that tenant. However, this is not really a systematic solution and require to always manually set up the remote write config for each tenant. The proposal solution seems reasonable to me 👍.

@bwplotka
Copy link
Member

bwplotka commented Mar 9, 2022

Hey, just trying to understand the main problem we are discussing here.

The only possible way might be using a Ruler for each tenant which is simpler but wasteful of resources.

Do we have any data on this? Because for stateless rulers there is not much baseline overhead for this situation. I would even say, the more problematic thing is the extreme situation where one tenant has too many rules and alerts for one ruler.

A potential solution would be using the Receive multitsdb in Ruler and having the same flags for tenancy as Receive

Do you mean sending things to Receive that uses multitsdb or literally using multitsdb code?

This would start an agent, i.e, a WAL-only storage for each tenant which remote-writes to only locations that were configured for that tenant. In essence, a multiagent package, would be needed to be able to handle this.

I would really avoid doing that - multi-tsdb is already a tough idea - every new TSDB has a lot of costs to be started and reloaded. Not sure if we want to replicate this idea for agent code.

Also, in the case of using Stateless Rulers, it's harder to achieve multi-tenancy, as different tenants might need different configurations while remote writing (write to separate locations with separate HTTP headers like THANOS-TENANT).

Right. We need essentially something like this:

image

I feel we should have multi-tenant rulers that can do any number of tenants rules (tenant agnostic) and we build tenancy with label aware sharding on receiver. Receive router already checks EACH series in write request and distribute with hashring - so why not checking tenant label there?

@stale
Copy link

stale bot commented Jun 12, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 12, 2022
@matej-g matej-g removed the stale label Jun 13, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 13, 2022
@matej-g matej-g added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels Aug 15, 2022
@yeya24
Copy link
Contributor

yeya24 commented Oct 24, 2022

Would love to see this moving forward.
A general sharder is really something we need in Thanos. Cortex has something similar like this using the Ring. In Thanos, we have the hashring only on the receiver side. However, if we want to distribute works like rules, compaction jobs, etc. We don't have a good way now.

@saswatamcode
Copy link
Member Author

Yup! I'm writing a proposal + poc for this currently. Will land soon! 🙂

@benjaminhuo
Copy link
Contributor

benjaminhuo commented Oct 24, 2022

Yup! I'm writing a proposal + poc for this currently. Will land soon! 🙂

Looking forward to this feature!

@anarcher
Copy link

How is ruler sharing going? :-) As a cortex user, this feature was useful.

@benjaminhuo
Copy link
Contributor

I feel we should have multi-tenant rulers that can do any number of tenants rules (tenant agnostic) and we build tenancy with label aware sharding on receiver. Receive router already checks EACH series in write request and distribute with hashring - so why not checking tenant label there?

Does #7256 already implement this feature? @bwplotka @GiedriusS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: rule dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement proposal
Projects
None yet
Development

No branches or pull requests

6 participants