[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

jasonrhodes · 2024-11-20T15:06:02Z

Following on from a longer conversation happening in #196340

Problem context

When a user creates an SLO, they need permission to read data from the selected source indices, e.g. logs for an SLO based on log data, etc. The user does NOT need permissions to create or manage transforms or ingest pipelines, even though an SLO is made up of these Elasticsearch primitives. We achieve this split authorization by using an ES feature called "secondary authorization". The transform APIs and the ingest pipeline APIs both support this authorization model. We made this change in August of this past year.

Using secondary auth, the following actions and credentials are used during the life of an SLO:

	Action	Credentials
1	Create and manage transforms	System
2	🔒 Run transform: read from source data	User
3	❌ Run transform: create new dest index	User
4	Run transform: write to dest index	User
5	Create and manage ingest pipelines	System
6	Run pipeline action: enrich data before writing to dest index	User
7	❌ Run pipeline action: roll over data, i.e. create new monthly dest index	User

This solves the problem of users being able to create and manage SLOs without needing privileges to create and manage all transforms and ingest pipelines. However, a recent change has been introduced to Elasticsearch which blocks the creation of dot-prefixed indices for user credentials. As seen above in the rows marked by ❌ , this blocks actions [3] and [7], and required us to allow-list our SLO index patterns to get around this restriction.

Proposed solution

Looking into this, it seems to me that only action [2], denoted by 🔒 , requires user privileges (to prevent users from creating SLOs that read from data that they do not have access to). If that's true, I'm proposing the following solution:

According to the ML team, we can pre-create the dest index for a given transform and it won't attempt to create that index. If we do that in advance using the system user, we solve action [3], as it will no longer create the index as part of the running transforms.
Based on my understanding of this permission model, it's not necessary for ingest pipelines to use "user credentials" at all. If we remove the use of secondary auth from the ingest pipeline and let the pipelines be created, managed, and run using the system user, we solve action [7].

In other words, we would end up with a table like this:

	Action	Credentials
1	Create and manage transforms	System
2	🔒 Run transform: read from source data	User
3	~~Run transform: create new dest index~~	N/A
4	Run transform: write to dest index	User
5	Create and manage ingest pipelines	System
6	Run pipeline action: enrich data before writing to dest index	System
7	Run pipeline action: roll over data, i.e. create new monthly dest index	System

The goal of this issue is for engineers to test these assumptions with a POC, and to make sure the above understanding and assumptions sound correct to other stakeholders.

elasticmachine · 2024-11-20T15:06:04Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

dominiqueclarke · 2024-12-03T14:28:35Z

I've conducted some initial research here's a table showing the expected credentials from our assumptions versus the actual credentials after testing

	Action	Expected Credentials	Actual Credentials
1	Create and manage transforms	System	System
2	🔒 Run transform: read from source data	User	User
3	~~Run transform: create new dest index~~	N/A	N/A
4	Run transform: write to dest index	User	User
5	Create and manage ingest pipelines	System	System
6	Run pipeline action: enrich data before writing to dest index	System	User (the same privileges as the user who created the SLO)
7	Run pipeline action: roll over data, i.e. create new monthly dest index	System	User (the same privileges as the user who created the SLO)

Because of this, the initial plan to utilize the system user simply to install the ingest pipeline is untenable.

Here are some solutions we've looked into to overcome this

1. Transitioning to ILM based approached

We explored using ILM to manage the SLO rollup index and rollover the data to a new write index every 30 days or when primary shard size is greater than 50GB. No change would be made to the summary index. Here is the migration path we would take.

bump v3.3 to v3.4
Remove processor from ingest pipeline
Create the ILM policy with the system user on resource installation
Apply the ILM policy to the rollup index component template on resource installation
Create the first index by appending -00001 to SLO_DESTINATION_INDEX_NAME for the SLO rollup data, set is_write_index: true, and configure the index as an alias for SLO_DESTINATION_INDEX_NAME

This way, older SLOs on versions v.3.3 would continue using the ingest pipeline based rollover until updated. Eventually, this would require the SLOs to manually be reset.

Pros:

Elegant solution based off an our established ILM system
Cons:
Only works in stateful, which currently doesn't restrict dot index creation anyway.

2. Handling index rollover in task manager

We are currently exploring using task manager to manually rollover the index after the index is more than 30 days old or when primary shard size is greater than 50GB. Here is the migration path we would take:

bump v3.3 to v3.4
Remove processor from ingest pipeline
Create alias .slo-observability.sli-v3.4 to .slo-observability.sli-v3.4* instead of creating the index directly. The alias will create the index since it does not exist.
Create a task calling .slo-observability.sli-v3.4/_rollover when conditions are met. Keep in mind, to evaluate these conditions stateful and serverless have different index stats apis

Pros:

Works in both stateful and serverless
Cons:
Essentially recreates ILM. Not an elegant solution

Moving forward

We will continue to investigate option #2, but at this point I think it's fair to say revisiting the conversation with the ES team about tweaking secondary auth may be valuable. Option 1 is untenable because of severless, and option 2 is less ideal than desired. A solution at the ES level may be more elegant than moving forward with option 2.

jasonrhodes · 2024-12-03T14:56:44Z

Because of this, the initial plan to utilize the system user simply to install the ingest pipeline is untenable.

Do we have a full, clear understanding of exactly why the ingest pipelines can't be made to use the system user for creating the new rollover indices? I assume they are just using the transform user specification, so the fact that we are telling the transform to "run" using the user credentials is passing that along to the ingest pipeline, and there's no way to separately control the ingest pipeline's auth choices? I just want to make sure we have that clearly understood and documented as we consider other options.

I think it's fair to say revisiting the conversation with the ES team about tweaking secondary auth may be valuable

I agree. The overlap of choices we've made at the platform level has created an untenable situation with transforms + dot-prefix indices + serverless + ILM, and I don't think adding to task manager's capacity to recreate ILM in serverless is what we want to spend time on doing and/or maintaining. To be honest, I think we may prefer to continue with the "allowlist" solution rather than introducing task manager based ILM. I think it's valuable to understand what that option looks like, but I can't see us pushing it to production.

jasonrhodes · 2024-12-11T14:29:11Z

To close this out for now, I'd like to (a) get an answer to this: "Do we have a full, clear understanding of exactly why the ingest pipelines can't be made to use the system user for creating the new rollover indices?", and (b) create an issue for the ES team (or comment on an existing issue if one is already there) to start exploring changes to secondary auth, again.

None of this is urgent for v9 so we don't need to push on it right this minute, but having those things in place will help for whenever we need to address this again next year.

jasonrhodes added the Team:obs-ux-management Observability Management User Experience Team label Nov 20, 2024

jasonrhodes mentioned this issue Nov 20, 2024

[SLO] Rename indices without dot prefix #196340

Open

6 tasks

dominiqueclarke self-assigned this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

jasonrhodes commented Nov 20, 2024 •

edited

Loading

elasticmachine commented Nov 20, 2024

dominiqueclarke commented Dec 3, 2024 •

edited

Loading

jasonrhodes commented Dec 3, 2024

jasonrhodes commented Dec 11, 2024

[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

[SLO] [R&D] Test strategy for "properly" creating dot-prefixed SLO indices without user permissions #200953

Comments

jasonrhodes commented Nov 20, 2024 • edited Loading

Problem context

Proposed solution

elasticmachine commented Nov 20, 2024

dominiqueclarke commented Dec 3, 2024 • edited Loading

1. Transitioning to ILM based approached

2. Handling index rollover in task manager

Moving forward

jasonrhodes commented Dec 3, 2024

jasonrhodes commented Dec 11, 2024

jasonrhodes commented Nov 20, 2024 •

edited

Loading

dominiqueclarke commented Dec 3, 2024 •

edited

Loading