Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

Open
jasonrhodes opened this issue Nov 20, 2024 · 4 comments
Assignees
Labels
Team:obs-ux-management Observability Management User Experience Team

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Nov 20, 2024

Following on from a longer conversation happening in #196340

Problem context

When a user creates an SLO, they need permission to read data from the selected source indices, e.g. logs for an SLO based on log data, etc. The user does NOT need permissions to create or manage transforms or ingest pipelines, even though an SLO is made up of these Elasticsearch primitives. We achieve this split authorization by using an ES feature called "secondary authorization". The transform APIs and the ingest pipeline APIs both support this authorization model. We made this change in August of this past year.

Using secondary auth, the following actions and credentials are used during the life of an SLO:

Action Credentials
1 Create and manage transforms System
2 🔒 Run transform: read from source data User
3 ❌ Run transform: create new dest index User
4 Run transform: write to dest index User
5 Create and manage ingest pipelines System
6 Run pipeline action: enrich data before writing to dest index User
7 ❌ Run pipeline action: roll over data, i.e. create new monthly dest index User

This solves the problem of users being able to create and manage SLOs without needing privileges to create and manage all transforms and ingest pipelines. However, a recent change has been introduced to Elasticsearch which blocks the creation of dot-prefixed indices for user credentials. As seen above in the rows marked by ❌ , this blocks actions [3] and [7], and required us to allow-list our SLO index patterns to get around this restriction.

Proposed solution

Looking into this, it seems to me that only action [2], denoted by 🔒 , requires user privileges (to prevent users from creating SLOs that read from data that they do not have access to). If that's true, I'm proposing the following solution:

  1. According to the ML team, we can pre-create the dest index for a given transform and it won't attempt to create that index. If we do that in advance using the system user, we solve action [3], as it will no longer create the index as part of the running transforms.
  2. Based on my understanding of this permission model, it's not necessary for ingest pipelines to use "user credentials" at all. If we remove the use of secondary auth from the ingest pipeline and let the pipelines be created, managed, and run using the system user, we solve action [7].

In other words, we would end up with a table like this:

Action Credentials
1 Create and manage transforms System
2 🔒 Run transform: read from source data User
3 Run transform: create new dest index N/A
4 Run transform: write to dest index User
5 Create and manage ingest pipelines System
6 Run pipeline action: enrich data before writing to dest index System
7 Run pipeline action: roll over data, i.e. create new monthly dest index System

The goal of this issue is for engineers to test these assumptions with a POC, and to make sure the above understanding and assumptions sound correct to other stakeholders.

@jasonrhodes jasonrhodes added the Team:obs-ux-management Observability Management User Experience Team label Nov 20, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@dominiqueclarke
Copy link
Contributor

dominiqueclarke commented Dec 3, 2024

I've conducted some initial research here's a table showing the expected credentials from our assumptions versus the actual credentials after testing

Action Expected Credentials Actual Credentials
1 Create and manage transforms System System
2 🔒 Run transform: read from source data User User
3 Run transform: create new dest index N/A N/A
4 Run transform: write to dest index User User
5 Create and manage ingest pipelines System System
6 Run pipeline action: enrich data before writing to dest index System User (the same privileges as the user who created the SLO)
7 Run pipeline action: roll over data, i.e. create new monthly dest index System User (the same privileges as the user who created the SLO)

Because of this, the initial plan to utilize the system user simply to install the ingest pipeline is untenable.

Here are some solutions we've looked into to overcome this

1. Transitioning to ILM based approached

We explored using ILM to manage the SLO rollup index and rollover the data to a new write index every 30 days or when primary shard size is greater than 50GB. No change would be made to the summary index. Here is the migration path we would take.

  1. bump v3.3 to v3.4
  2. Remove processor from ingest pipeline
  3. Create the ILM policy with the system user on resource installation
  4. Apply the ILM policy to the rollup index component template on resource installation
  5. Create the first index by appending -00001 to SLO_DESTINATION_INDEX_NAME for the SLO rollup data, set is_write_index: true, and configure the index as an alias for SLO_DESTINATION_INDEX_NAME

This way, older SLOs on versions v.3.3 would continue using the ingest pipeline based rollover until updated. Eventually, this would require the SLOs to manually be reset.

Pros:

  1. Elegant solution based off an our established ILM system
    Cons:
  2. Only works in stateful, which currently doesn't restrict dot index creation anyway.

2. Handling index rollover in task manager

We are currently exploring using task manager to manually rollover the index after the index is more than 30 days old or when primary shard size is greater than 50GB. Here is the migration path we would take:

  1. bump v3.3 to v3.4
  2. Remove processor from ingest pipeline
  3. Create alias .slo-observability.sli-v3.4 to .slo-observability.sli-v3.4* instead of creating the index directly. The alias will create the index since it does not exist.
  4. Create a task calling .slo-observability.sli-v3.4/_rollover when conditions are met. Keep in mind, to evaluate these conditions stateful and serverless have different index stats apis

Pros:

  1. Works in both stateful and serverless
    Cons:
  2. Essentially recreates ILM. Not an elegant solution

Moving forward

We will continue to investigate option #2, but at this point I think it's fair to say revisiting the conversation with the ES team about tweaking secondary auth may be valuable. Option 1 is untenable because of severless, and option 2 is less ideal than desired. A solution at the ES level may be more elegant than moving forward with option 2.

@jasonrhodes
Copy link
Member Author

Because of this, the initial plan to utilize the system user simply to install the ingest pipeline is untenable.

Do we have a full, clear understanding of exactly why the ingest pipelines can't be made to use the system user for creating the new rollover indices? I assume they are just using the transform user specification, so the fact that we are telling the transform to "run" using the user credentials is passing that along to the ingest pipeline, and there's no way to separately control the ingest pipeline's auth choices? I just want to make sure we have that clearly understood and documented as we consider other options.

I think it's fair to say revisiting the conversation with the ES team about tweaking secondary auth may be valuable

I agree. The overlap of choices we've made at the platform level has created an untenable situation with transforms + dot-prefix indices + serverless + ILM, and I don't think adding to task manager's capacity to recreate ILM in serverless is what we want to spend time on doing and/or maintaining. To be honest, I think we may prefer to continue with the "allowlist" solution rather than introducing task manager based ILM. I think it's valuable to understand what that option looks like, but I can't see us pushing it to production.

@jasonrhodes
Copy link
Member Author

To close this out for now, I'd like to (a) get an answer to this: "Do we have a full, clear understanding of exactly why the ingest pipelines can't be made to use the system user for creating the new rollover indices?", and (b) create an issue for the ES team (or comment on an existing issue if one is already there) to start exploring changes to secondary auth, again.

None of this is urgent for v9 so we don't need to push on it right this minute, but having those things in place will help for whenever we need to address this again next year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

3 participants