Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Support multi-tenancy on object storage using custom prefix #1318

Closed
github-edouard-devouge opened this issue Jul 10, 2019 · 65 comments
Closed

Comments

@github-edouard-devouge
Copy link

Support multi-tenancy on object storage using custom prefix

Context / Use case

On a multi-cluster kubernetes setup with multiple prometheus instances (one per namespace) using PrometheusOperator. Each prometheus instance embed its own thanos sidecar and for performance & scalability purpose data chuncks are sharded accross multiple s3 buckets.
For each prometheus instance, there is a dedicated thanos store and a thanos compactor working on its own bucket. One thanos query is setup to map/reduce promql queries across multiple stores and thanos-sidecars.

Problems

  • With this setup there is a proliferation of s3 buckets.
  • Bucket creation lifecycle is not as dynamic as kubernetes resources (kubectl apply versus terraform apply). Require use of multiple tools.

Question / feature request

Is there a way to share a bucket across multiple thanos instances (keeping the sharding property) ?

Otherwise, a cool feature could be the ability to add a custom prefix on thanos objects (objstore.config). So we could take advantage of s3 bucket multi-tenancy.

  • e.g : s3://<my-bucket-name>/**<custom-prefix>**/<thanos-objects>
@github-edouard-devouge
Copy link
Author

github-edouard-devouge commented Jul 31, 2019

Design proposal to discuss

Scope :

  • Enrich objstore configuration to support a custom prefix in s3 object storage key name
  • Focus on AWS S3 implementation for a first contribution

Expected result

[Before] Current flat bucket file tree :

  bucket-root
  ├── 01DH3V0E80CBEK8SK00A93SH7Z
  │    ├── chunks
  │    ├── index
  │    └── meta.json
  ├── 01DG4BDF8R3FDSZ3EM6MVHVTFX
  │    ├── chunks
  │    ├── index
  │    └── meta.json
  ├──  01DH3V01JXFHC0C9ABCM7MZGP4
  │    ├── chunks
  │    ├── index
  │    └── meta.json
  └── debug
       └── metas
           ├── 01DH3V0E80CBEK8SK00A93SH7Z.json
           ├── 01DG4BDF8R3FDSZ3EM6MVHVTFX.json
           └──  01DH3V01JXFHC0C9ABCM7MZGP4.json

[After] Expected prefixed bucket file tree :

  bucket-root
  └── my
      └── custom
          └── prefix
              ├── 01DH3V0E80CBEK8SK00A93SH7Z
              │    ├── chunks
              │    ├── index
              │    └── meta.json
              ├── 01DG4BDF8R3FDSZ3EM6MVHVTFX
              │    ├── chunks
              │    ├── index
              │    └── meta.json
              ├──  01DH3V01JXFHC0C9ABCM7MZGP4
              │    ├── chunks
              │    ├── index
              │    └── meta.json
              └── debug
                   └── metas
                       ├── 01DH3V0E80CBEK8SK00A93SH7Z.json
                       ├── 01DG4BDF8R3FDSZ3EM6MVHVTFX.json
                       └──  01DH3V01JXFHC0C9ABCM7MZGP4.json

Implementation steps

  1. Add a new prefix flag in objstore.config for type s3
type: S3
config:
    prefix: "my/custom/prefix/"
    bucket: ""
    endpoint: ""
    region: ""
    access_key: ""
    insecure: false
  [...]
  1. Write some unit tests in pkg/objstore/s3/s3_test.go

  2. Edit pkg/objstore/s3/s3.go to :

    • Extend the configuration data structure type Config struct with Prefix string yaml:"perfix"
    • Validate prefix format :
    func validate(conf Config) error {
      [...]
      if conf.Prefix != "" && len(conf.Prefix)>=1024 {
    		return errors.New("prefix is too long (limited by Amazon at 1024 bytes long)")
    	}
    	return nil
    }
    [...]
    prefix = strings.TrimSuffix(prefix, "/") + "/"
    • Append prefix string to object name for each minio call requiring an object name. For instance:

      // Delete removes the object with the given name.
      func (b *Bucket) Delete(ctx context.Context, name string) error {
      	return b.client.RemoveObject(b.name, prefix + name)
      }
  3. Update the doc : docs/storage.md

  4. Create a PR

Forecats of impacted files :

@povilasv
Copy link
Member

povilasv commented Aug 2, 2019

The proposal LGTM 🥇

@Nathan-Okolita-Aimtheory

@edevouge is there any progress with this issue?

@github-edouard-devouge
Copy link
Author

@Nathan-Okolita-Aimtheory: a pull request ( #1392 ) implementing this feature is work-in-progress by @jaredallard

@brancz
Copy link
Member

brancz commented Aug 20, 2019

I'm not against this per se, but we already have a multi-tenancy mechanism approved in the thanos receive component, that is label-set based. If we do something about multi-tenancy in the object store bucket, then that should be label-set based as well, as opposed to path based.

@Nathan-Okolita-Aimtheory

@brancz would your label based implementation look like this in the Bucket.yml:
type: S3 config: label: "tenant_label" bucket: "bucket" endpoint: "s3" region: "region" access_key: "access_key" secret_key: "secret_key" insecure: false signature_version2: true
Would the Thanos store gateway need to index the entire s3 bucket to return just the label desired?

@brancz
Copy link
Member

brancz commented Aug 20, 2019

A tenant is a label-set, so not just a single label. So that specific part would rather be something like:

tenantLabels:
- label1
- label2

@bigkraig
Copy link
Contributor

@brancz I'm not fully familiar with the label based tenants, but segregating in the bucket path allows you to use one bucket & policies for multiple Thanos deployments, that may not want to share Thanos services.

At least in my case, I have different teams with different products. By leveraging path restrictions I can reduce my S3 bucket sprawl. It also means that teams that may want to share a common query endpoint can do so, without exposing themselves to cardinality bombs by bad deployments. If someone does something bad, it only impacts their bucket location.

@edevouge @jaredallard are either of you continuing to work on the existing PR?

@dlahoza
Copy link

dlahoza commented Dec 11, 2019

These changes would allow lifecycle management inside the multipurpose buckets.
As an example, we use an S3 bucket per cluster, and we maintain multiple clusters in a single account.
It's convenient to use a single bucket, replicate it, delete it with all the data at once.
And the best part is that we can apply a life-cycle policy on the Thanos data by using the prefix of a subdirectory.
And this is crucial for us.

@AnandDevan
Copy link

We host all monitoring related configs, backup etc. under single bucket and would like to use that bucket to host thanos data as well. This feature would be super useful.

@domino7
Copy link

domino7 commented Jan 16, 2020

Elimination of multiple buckets maintenance is definitive adventage.
One more requirement here will be to ensure that clients will not affect each other performance and stability (eg with query like {"__name__"=~".+"} setup for single client/hashring should be affected at most. It shouldn't be necessary to index and query metrics from all clients - which may be the case for labels-based segregation suggested by @brancz ). Solution with /path prefix and dedicated read access(stores, queriers) for each client seems to address this issues.

I played a little bit with PR from @Dlag (PR mentioned above) and menaged to setup configuration with read and write access, however I spotted that compaction is affected when accessing data from s3 subdirectories, failing with write compaction: chunk 8 not found: segment index 0 out of range error similar to #1300 . Setup was tested with thanos 0.10 with additional s3 prefixes configuration (configuration with thanos v0.10 and dedicated s3 buckets instead of subdirectories works fine for the same dataset).
That is one more thing that should be verified for this feature.

@stale
Copy link

stale bot commented Feb 15, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 15, 2020
@Pryz
Copy link

Pryz commented Feb 20, 2020

Kind of the same use case here where I need to set a prefix because Thanos data will not be the only data I would stored in the bucket. Seems like two PRs have been closed for this issue but I'm having hard time to understand what the blocker is.

How can we move forward here ? :)

@stale stale bot removed the stale label Feb 20, 2020
@stale
Copy link

stale bot commented Mar 21, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 21, 2020
@Nathan-Okolita-Aimtheory

I am still hoping this issue will be picked up. We have an environment with many Prometheus instances many of which are deployed through automation. Having to provision many s3 buckets is not a very appealing prospect.

@stale stale bot removed the stale label Mar 23, 2020
@stale
Copy link

stale bot commented Apr 22, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Apr 22, 2020
@chadlwilson
Copy link

I think there is still a lot of interest here. This would make it a lot easier to operate a single Thanos that can aggregate metrics for N teams' prometheuses.

@stale stale bot removed the stale label Apr 23, 2020
@pracucci
Copy link
Contributor

Introducing the ability to configure an optional bucket prefix would also be a nice step forward towards Cortex blocks storage and Thanos bucket store interoperability.

@bwplotka
Copy link
Member

Thanks for the explanations @markmsmith @roidelapluie @dsayan154 I think I am happy with some simple prefix to make sure you can store blocks deeper in bucket (or have it compatible with Cortex).

@underrun:

Can you clarify how to avoid the issue @mei-ling is having without prefixes, which, if i understand, would allow multiple store gateway / compactor to work on the same bucket - one per cluster as desired in that case?

You just point multiple sidecars to the same bucket, same directory. Since Prometheus has unique external labels we can distinguish between blocks. Directory does not matter for Thanos. (:

I'm also a little confused by how people are using "tenancy" here as i feel like that's almost the opposite of this issue. does "multi-tenant" here mean having one set of infrastructure across an arbitrary number of prometheus monitored by one "thanos" with possibly multiple store gateway / compactor? This is the use case I feel like prefixes help with. Or does it mean having multiple teams with different sets of infrastructure who want to share one "thanos" for only monitoring what they are responsible for?

I think both things are the same right? Multiple metric sources (e.g Prometheus + sidecar) from multiple potentially isolated teams, uploading to the same bucket, thus allowing whatever set of compactors/store gateways thanks to sharding.

@dsayan154

In case, a tenant stops subscribing to this solution, the tenant cluster's data should be deleted from the objectore store.

Deletion of data for a defined prefix is a valid issue/ request you can put on Thanos and we can guide/help you write reliable CLI / tooling to make it work. (complexity no 1) (: You can even build on top of that and automated this purely with some deletion API tenant. Building on top of prefix is really not just hacking prefixes for multitenancy. Our semantics for multitenancy is around labels, so would be nice if your setup will follow the same rules so we can work together on those complexities you mentioned (: Per complexity number 2. You are right this is some scalability limitation, however further away (you can have easily 100k items for iter API and it's quite large number of blocks (something like 1y of ~100 tenants) for Thanos with healthy compaction). This is however easy to solve by copying the blocks new dirs if needed etc. Let's create an issue to discuss potential solutions. What I am saying is that prefix will not solve this (very potential) problem: What if within tenant you have that number of blocks?

@roidelapluie

With the prefix, it is possible to use the excellent Thanos block viewer on a cortex backend. Currently, Cortex does not really have a nice UI like Thanos to visualize this.

YES! cc @Oghenebrume50 @prmsrswt @kunal-kushwaha for motivation to get block viewer natively for Prometheus (and Complex a well!)

@stale
Copy link

stale bot commented Dec 13, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Dec 13, 2020
@pracucci
Copy link
Contributor

Still valid

@stale stale bot removed the stale label Dec 14, 2020
@kakkoyun kakkoyun changed the title Support multi-tenancy on object storage using custom prefix *: Support multi-tenancy on object storage using custom prefix Feb 12, 2021
@stale
Copy link

stale bot commented Apr 18, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 18, 2021
@midnightconman
Copy link
Contributor

Still valid?

@stale stale bot removed the stale label Apr 19, 2021
@pete-leese
Copy link

What is the current status of this? :)

@stale
Copy link

stale bot commented Jul 19, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 19, 2021
@dsayan154
Copy link

still valid

@stale stale bot removed the stale label Jul 20, 2021
@oronsh
Copy link
Contributor

oronsh commented Sep 5, 2021

Why won't you do it dynamic rather than config based?
This way, each store api could handle multiple prefixes for multiple tenants.
I guess this would need support in the compactor as well.

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@markmsmith
Copy link

Still needed.

@stale stale bot removed the stale label Jan 9, 2022
@yuha0
Copy link

yuha0 commented Feb 4, 2022

Hi @bwplotka

You just point multiple sidecars to the same bucket, same directory. Since Prometheus has unique external labels we can distinguish between blocks. Directory does not matter for Thanos. (:

This will put a huge and unnecessary pressure on each thanos store process, as well as the singleton compactor process.

For instance, say now I create a new cluster with a new thanos deployment in it, pointing to an existing bucket where other existing old clusters are using as metrics storage backend, the thanos store will immediately require a huge amount of resources (disk, CPU, memory) to process all the existing metrics data in that bucket, even when the new cluster hasn't even started generating a single data point.

This problem is more obvious on the compactor component since it is required to be a singleton process. Losing this one single process means losing data compaction globally, which is a huge risk.

Our current mitigation is just to create one bucket for each "tenant" (Kubernetes cluster) so each one can mind their own resource requirements for thanos store and compactor. But this means a lot of buckets.

@tomereli
Copy link

tomereli commented Apr 10, 2022

Any update on this? Is it at all planned to be added at some point of time? @bwplotka

@bwplotka
Copy link
Member

Hi, all again! Thanks all for keeping this up to date - it's really important to solve all the important cases you might have. Please let us know if this is still valid and for what reasons--we never said no to this, but honestly, we still don't see a clear need. Let me explain:

Currently, all Thanos components support multi-tenancy thanks to external labels:

  1. Receivers uploads block with tenant external label per tenant's block
  2. All bucket read components like Store and Compactor have selector flag mechanism that allows you to choose what blocks to handle. You can totally run any of those and say "give me blocks only for tenant X". You can even shard instances of store per tenants. All explained here (let us know if things are unclear we can expand on this document). IMO this solves cases like:

For instance, say now I create a new cluster with a new thanos deployment in it, pointing to an existing bucket where other existing old clusters are using as metrics storage backend, the thanos store will immediately require a huge amount of resources (disk, CPU, memory) to process all the existing metrics data in that bucket, even when the new cluster hasn't even started generating a single data point.

  1. Tools exists that inject matchers into query e.g https://github.com/prometheus-community/prom-label-proxy (we might consider adding this natively in Querier)

The only need for this kind implementation might be useful when we hit limit of number of directories to scan in single bucket, Prefix might in theory help here for scan speed up. But these reasons seem to be very different to what we see from responses here and might have different solutions (e.g having 2 buckets etc.)

@roidelapluie mentioned

With the prefix, it is possible to use the excellent thanos block viewer on a cortex backend. Currently Cortex does not really have a nice UI like thanos to visualize this.

If I understand it right the thanos tool commands might be needed to be extend by selector feature too.

Let us know if we miss anything here 🤗 Thanks!

@hanjm
Copy link
Member

hanjm commented Apr 29, 2022

Thanks your detailed note. I think the solutions like having 2 or more buckets need a heavy resource operatation . when create a new tenant , using bucket prefix is more easily than apply create a bucket.

@tomereli
Copy link

The reason bucket prefix support is appealing to us and probably others, is since we use silo multi-tenancy model for our tenants, meaning that each tenant's bucket must not be accessible to another tenant.
By this we also mean Thanos itself - we want to enforce this with resource policies in AWS based on JWT custom claims, which will decouple authorization logic from code entirely.
Having single bucket and using Thanos multitenancy just doesn't fit this use case so we're forced to use a Thanos per tenant plus a bucket per tenant.

@markmsmith
Copy link

The use case I posted in May of 2020 is still relevant today, and the lack of this feature blocks us from reducing the number of buckets we need to spin up for each tenant within an AWS account:
#1318 (comment)

@yeya24
Copy link
Contributor

yeya24 commented Jun 24, 2022

Closed by #5337.

@yeya24 yeya24 closed this as completed Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests