Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] RFC: Index naming and hierarchy #98912

Open
banderror opened this issue Apr 30, 2021 · 81 comments
Open

[RAC] RFC: Index naming and hierarchy #98912

banderror opened this issue Apr 30, 2021 · 81 comments
Assignees
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete

Comments

@banderror
Copy link
Contributor

banderror commented Apr 30, 2021

Related to:

#93729
#95903
#98353
elastic/elasticsearch#72181

Summary

There's more and more questions and concerns being raised regarding rule monitoring implementation for RAC. I'm working on #98353 which implements "event log" abstraction within rule registry that will be used for writing and reading both alerts and rule execution logs.

This RFC proposes some naming and structure for RAC indices, hierarchy for rule registries, and lists a few open questions and concerns.

Proposal

Index aliases naming convention would be similar to Elastic data stream naming scheme:

{prefix}-{consumer}.{additional.log.name}-{kibana space}

where:

  • {prefix} will be .alerts by default. Users will be able to override it in Kibana config and set it to any other value, e.g. .alerts-xyz or .whatever (for compatibility with legacy Kibana multitenancy).
  • {consumer} will indicate the rule/alert consumer in terms of Alerting framework (more info). We will probably have security, observability and stack as consumers.
  • {additional.log.name} will be used to specify concrete indices for alerts, execution events, metrics and whatnot. We will probably have only alerts and events as concrete logs. It might be good to have short and sweet names without - or . in them (perhaps _ is ok).
  • {kibana space} will indicate space id, for example default. In case of space agnostic alerts, the no.space placeholder can be used.

Examples of concrete index names

For clarity, this section contains the concrete index names that are created the Security and Observability solutions.

Security:

.alerts-security.alerts-{kibana space}-000001        // (1) 
.alerts-security.events-{kibana space}-000001        // (2) 

(1) The Alert documents that support human workflow and are updatable.
(2) Rule-specific execution events and metrics created by Security rules to enhance our observability of alerting.

Technically it will be possible to derive child logs from these alerts and events, e.g. .alerts-security.alerts.ml-{kibana space}. Although we don't think that we need this in Security at this point.

Observability:

.alerts-observability.{apm,uptime,metrics,logs}.alerts-{kibana space}-000001   // (1)
.alerts-observability.{apm,uptime,metrics,logs}.events-{kibana space}-000001   // (2)

.alerts-observability.metrics.alerts-no.space-000001   // (3)

(1) The Alert documents that are updatable. Same exact semantics as for Security (1)
(2) Supporting documents (evaluation) for the Alerts + execution logs to be used for the Observability of Alerting
(3) Example of space agnostic alert index. This can be used by the space-agnostic Rule types like the Stack Monitoring might need. The no.space "space" is not a valid Kibana space name, so this pattern can be used as a placeholder.

Diagram

Here's a structural diagram showing some rule execution dependencies in the context of RAC and how the proposed indices fit the whole picture:

RAC  Rule execution dependencies, indices@2x

diagram source

@banderror banderror added the Theme: rac label obsolete label Apr 30, 2021
@banderror banderror self-assigned this Apr 30, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 30, 2021
@banderror
Copy link
Contributor Author

@dgieselaar
Copy link
Member

dgieselaar commented Apr 30, 2021

Thanks @banderror for putting this up! Couple of questions:

  • Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?
  • I think we're broadly in agreement that it will be .alerts-* by default, with a configuration option to point it to a different index, e.g. to match kibana.index. Do you mind updating the RFC to reflect that?
  • What do the index names look like for stack rules?
  • Do we need separate indices for alerts/events? I think it makes sense. But it makes some things a little harder, like figuring out the index target. I assume that whoever has access to e.g. security alerts, would have access to other events (rule monitoring, evaluations, state changes) as well. If that is the case, the user would have to be granted permission to .alerts-*-security, and that asterisk is greedy I think, so administrators might accidentally grant broader privileges than intended. Something like .alerts-security*, or .alerts-security-myspace* is perhaps more reasonable. We could also use .alerts-security.alerts-myspace and .alerts-security.events-myspace. Ideally we can stay close to the data stream naming scheme, if we decide to switch to that at some point.

@spong
Copy link
Member

spong commented May 1, 2021

With regards to 1. The problem with .kibana- prefix. In async discussions it was determined with a fair degree of confidence that using .alerts-* indices are going to work for our use cases. We will provide a user-configurable kibana.yml override value to override this, similar to kibana.index, which may only be available in 7.x, but will allow legacy multi-tenancy users who have segmented their Kibana entities via kibana.index to continue to do that, so long as they also specify this new configuration option as well. Setting it to, for example, "xyz", will store alerts in .alerts-xyz-*, etc. RBAC implications here would require T-Grid to have a toggle for using the Kibana Security Model (feature privileges and kibana_system user), or ES Index Privileges. Will need to scope this accordingly with the RBAC efforts.

@tsg
Copy link
Contributor

tsg commented May 3, 2021

If that is the case, the user would have to be granted permission to .alerts--security, and that asterisk is greedy I think, so administrators might accidentally grant broader privileges than intended. Something like .alerts-security, or .alerts-security-myspace* is perhaps more reasonable. We could also use .alerts-security.alerts-myspace and .alerts-security.events-myspace. Ideally we can stay close to the data stream naming scheme, if we decide to switch to that at some point.

++, I like something like .alerts-security.events-myspace and .alerts-security.alerts-myspace so security shows up before event/alerts. The . is to stay close to the datastream naming model, like @dgieselaar suggested.

It might be good for all registries to have a short and sweet name without - or . in it (perhaps _ is ok).

What do the index names look like for stack rules?

Somehting like .alerts-stack.alerts-default? or .alerts-core.alerts-default?

Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?

I'm not sure on this one, the data streams don't include the version so I'm thinking to start without first. We can add it later if we really need it.

@banderror
Copy link
Contributor Author

Thank you for comments, this is very helpful 👍

Kibana version in the name

Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?

I'm not sure on this one, the data streams don't include the version so I'm thinking to start without first. We can add it later if we really need it.

I just followed the existing implementations as well, so not sure. Off the top of my head I'd imagine this version number could be helpful for document migrations. On the other hand, migrations could be built on top of a different number specifically used for tracking changes in the schema. For example, migration system for .siem-signals index uses a version of the corresponding index template. When we bump this version in the code, documents get reindexed into a new index with the new template, and the alias is being updated to point to this new index. I'm not very familiar with this implementation, but I think this is the rough idea behind it.

Other than that, I don't have any ideas regarding use cases for Kibana version in the index name. I would support your suggestions and remove it for now 👍

Multiple indices for alerts, rule execution events etc

Do we need separate indices for alerts/events?

I think yes, but I'm also open for any objections. Why I'd say separate indices are a better option:

  • Alerts and execution events will likely have very different schemas: subset of standard ECS, custom fields added, and so the resulting ES mappings will be different. At least this is true for Security.
  • Execution events for Observability and Security will probably have slightly different schemas as well. They will have something in common and also something specific to the solution. So I think it would be good to define a common alerts schema, a common execution events schema, build a root registry using them, and then extend those schemas in the child solution registries. We need a common schema for the unified app. And specifics to be used within solutions. Having separate schemas seems to be cleaner and easier to maintain.
  • Separate types of data in separate indices = more lightweight indices, more efficient queries.
  • At this point we don't have any UI where we would combine alerts and execution events in a single list or table.
  • Ability to define different ILM policies for alerts and execution events. In Security, I think we need to store alerts for as long as possible. While execution events like status changes, metrics like how much time was spent for indexing etc - this is probably not something that needs to be kept for a long time.

Naming in general

I like the suggested data stream naming scheme! So if I got it right, this is what we're gonna have:

{alerts prefix}-{name of the log}-{kibana space}

{alerts prefix} will be .alerts by default. Users will be able to override it in Kibana config and set it to any other value, e.g. .alerts-xyz or .whatever.

{name of the log} will represent the hierarchy of logs used in rule registries. Name will be a combination of parts, parts will be concatenated with .. The first part will be solution name, the second one - type of the log.

Examples: security.alerts, security.events (or security.execlog?), observability.alerts etc.

My questions regarding this naming:

  • Do you think .events is a good name? Looks a bit ambiguous to me, in the sense that "events" in terms of detection engine are the source objects that detection rules are executed against.
  • Are you expecting to have child logs in observability, e.g. observability.events.{rule type} or observability.alerts.{service name} or something like that?
  • Do you think {kibana space} should be always at the and of the index alias? Could there be arguments for {alerts prefix}-{name of the log}-{kibana space}-{name of the child log}? Just want to make sure we're not missing anything here. This naming convention may impact the implementation of [RAC] Rule monitoring: Event Log for Rule Registry #98353

It might be good for all registries to have a short and sweet name without - or . in it (perhaps _ is ok).

Agree 👍 I will add this check to the implementation.

Stack rules

What do the index names look like for stack rules?

Somehting like .alerts-stack.alerts-default? or .alerts-core.alerts-default?

I'm not really aware of any requirements for stack rules, maybe I've missed that part. To clarify, stack rules are the rules that can be created from the Stack Management UI (/app/management/insightsAndAlerting/triggersActions/rules)? I can see there are many different types of them, maybe these types would require dedicated indices, or maybe not.

.alerts-stack.alerts-{space} or .alerts-core.alerts-{space} sounds good to me. Or .alerts-stack.alerts.{rule type}-{space} or something like that.

Do we have any requirements/plans for stack rules? In terms of RAC, stack rules == rules which will be created directly from the unified alerting app?

@banderror
Copy link
Contributor Author

Oh yes, and I will of course update the RFC, just want us to agree on most of the details.
Thanks again for asking all these questions and giving suggestions, this is a lot of helpful info for me.

@dgieselaar @spong @tsg

@pmuellr
Copy link
Member

pmuellr commented May 3, 2021

Trying to understand the difference between the "alerts" and "execlog" and current event log indices.

I assume "alerts" is intended to hold data regarding the alert being run - for index threshold, that would include the threshold being tested again, the value calculated from the es aggs call to compare to the threshold, etc. And so the "execlog" indices would be like the current event log, which just capture the execution times/duration, status, etc.

Which I think means the event log itself becomes unneeded, eventually.

But I'd like to understand the field differences between the event log and execlog. Because I'm wondering if we can live with the current event log for now, especially given the following:

At this point we don't have any UI where we would combine alerts and execution events in a single list or table.

@pmuellr
Copy link
Member

pmuellr commented May 3, 2021

re: kibana version in index name

We did this for the event log, because it solved a problem for us, and we noticed other Kibana apps doing this - something in o11y, but not sure exactly what.

The problem is: "migrating" indices when Kibana is upgraded. Obviously (I hope), we weren't planning on doing ".kibana -style" migrations of the event log, but we were worried about situations where we might want to change index mappings. Would we hit scenarios where we'd want mapping changes that wouldn't work well with the existing data, potentially requiring a re-index (and even that might not be able to "fix" something)?

Adding a version to the name makes this problem go away! We always create a new index template, alias, and initial index when Kibana is updated. And then we end up using .kibana-event-log-* in queries over ALL the different versions.

Obviously, this doesn't handle every case. If the structure between Kibana versions changes "too much", we could be in a position where we wouldn't be able to validly query old data, and similar sorts of problems. But this felt like the best thing to do, back when we wrote this.

If we want to explore not using the version in the index name, then I think we need to have a really good story for what happens when the mappings change when Kibana is updated. Haven't thought too much about this (since it's not a problem for the event log).

@ymao1
Copy link
Contributor

ymao1 commented May 3, 2021

  • What do the index names look like for stack rules?

@dgieselaar Is this question driven by the desire to view/manage alerts from stack rules and alerts from within each solution? If so, is it just limited to stack rules? Is there the desire to view/manage alerts from o11y rules inside security and security alerts from o11y? At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

@tsg
Copy link
Contributor

tsg commented May 3, 2021

At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

@ymao1 ++, I think using the consumer makes more sense. This way create alerts directly in the indices where we need them, and we don't have to query across solutions.

@tsg
Copy link
Contributor

tsg commented May 3, 2021

@pmuellr For eventlog, do you store the full version including the patch level (e.g. 7.11.2)? This is what Beats used to do before the new indexing strategy. It does help on upgrades, but it can mean creating a lot of indices in case of frequent upgrades.

The question is if we really need it, or is it enough to trigger an ILM rotation when we do an upgrade that changes the mapping. This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

FWIW, when the index naming strategy was discussed, I pressed on the need of including the version: https://github.com/elastic/observability-dev/issues/283#issuecomment-527372212

My only reason for going first without is that this is what the new indexing strategy also does the same, and it can be added later.

@gmmorris
Copy link
Contributor

gmmorris commented May 5, 2021

At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

@ymao1 ++, I think using the consumer makes more sense. This way create alerts directly in the indices where we need them, and we don't have to query across solutions.

That's good to hear, as this questions is one of the main blockers for migrating Stack Rules to Alerts-as-Data.
Can we proceed with the assumption that consumer will be in the index name? Who should we work with to make sure this is reflected in the Rules Registry/ Alert indices?

@banderror
Copy link
Contributor Author

Regarding consumer vs producer, could you explain what that actually means? What would be examples of producers and consumers in terms of RAC?

From the current code of detection engine, I can see that:

  • when we register our rule types (siem.signals and siem.notifications), we specify producer: 'siem'
  • when we create rule instances, we specify consumer: 'siem'

So seems like for our rules producer and consumer are always the same thing - our app.

Is that assumed to change in some way? Would stack rules be able to generate alerts for solutions? Does the naming already discussed here (.alerts-{solution}*) fit, I mean can we treat solution name as a consumer name?

@tsg @ymao1 @gmmorris

@banderror
Copy link
Contributor Author

Who should we work with to make sure this is reflected in the Rules Registry/ Alert indices?

@spong @dgieselaar @banderror :) I will incorporate all the feedback from this RFC to #98353

@banderror
Copy link
Contributor Author

@spong regarding version in the index name and Tudor's comment #98912 (comment)

I'd say maybe we should stick to the same approach as we already have in the .siem-signals implementation - unless there are any known issues with it and our signals migration?

@ymao1
Copy link
Contributor

ymao1 commented May 5, 2021

@banderror The alerting framework maintains the idea of producer and consumer, where the producer is the solution creating the rule type (security, uptime, apm, stackRules, etc) and consumer is essentially the location within Kibana where the user is creating the rule of that type. You are correct in that right now, there are not many examples of rules with different consumers and producers, as security rules defined by security are only created and viewed within security. I believe we want to allow for this capability though, where within security, a user could create a rule of either
a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

If the index schema is based on the producer where security produced rules are written to .alerts-security-* and stack produced rules are written to alerts-stack-*, then the RAC client would need to broaden the indices that it queries over to get all alert data for rules created within a consumer (solution). In addition, if we're giving users privileges to specific index prefixes in order for them to create ad-hoc visualizations, we might be limiting the alerts they see in that manner as well.

@banderror
Copy link
Contributor Author

Oh I see that now, thank you @ymao1 for the clear explanation.

I believe we want to allow for this capability though, where within security, a user could create a rule of either
a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

Gotcha. Maybe it means that a rule type (a stack rule type in this case), instead of indexing alerts directly, will need to use some kind of an indexing "strategy" injected into it, which would know how to properly index the alert into the destination alerts-as-data index (and would respect the document schema and mappings). Otherwise we would need to have the same mappings in all alerts-as-data indices which I'm not sure would it be feasible or not.

Do you think this naming might work .alerts-{consumer}.alerts-{kibana space}? We will probably still have security, observability and stack as consumers?

@banderror
Copy link
Contributor Author

Regarding Kibana version in the name and migrations/rollovers. This is how it's implemented in Security for .siem-signals index.

Basically, we don't have Kibana version in the index name, but instead we maintain index template version in the code, and we have two mechanisms for two cases which both use this version:

  1. There's an index rollover logic for non-breaking changes. For example, if we add a new field to the index template, we bump its version and the app will trigger index rollover automatically - without any explicit action required from the user.
  2. There's a migration logic (route, reindexing) for breaking changes. If we introduce breaking changes to the schema/template, then the users will have to call the migration API. The API does actual reindexing of documents into a new index, not just simple rollover.

We could use the same or similar approach for RAC indices. Or maybe there can be cons to that like

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

@dgieselaar
Copy link
Member

re:

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

We could alleviate this concern by scheduling a task, as that is guaranteed to be picked up by a single Kibana instance (I think Gidi or Patrick suggested that).

@dgieselaar
Copy link
Member

dgieselaar commented May 5, 2021

@banderror The alerting framework maintains the idea of producer and consumer, where the producer is the solution creating the rule type (security, uptime, apm, stackRules, etc) and consumer is essentially the location within Kibana where the user is creating the rule of that type. You are correct in that right now, there are not many examples of rules with different consumers and producers, as security rules defined by security are only created and viewed within security. I believe we want to allow for this capability though, where within security, a user could create a rule of either
a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

If the index schema is based on the producer where security produced rules are written to .alerts-security-* and stack produced rules are written to alerts-stack-*, then the RAC client would need to broaden the indices that it queries over to get all alert data for rules created within a consumer (solution). In addition, if we're giving users privileges to specific index prefixes in order for them to create ad-hoc visualizations, we might be limiting the alerts they see in that manner as well.

++ on this - which is why I think we should always include all technical fields in the shared component template. That should hopefully be a few dozen only. But that could allow users to point any rule to any index, and all the other stuff would be metadata, which may or may not need a runtime field to be queryable.

@gmmorris
Copy link
Contributor

gmmorris commented May 5, 2021

re:

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

We could alleviate this concern by scheduling a task, as that is guaranteed to be picked up by a single Kibana instance (I think Gidi or Patrick suggested that).

That was me, but even that's not super obvious, we'd need to understand the exact flow you want to support.

BTW rolling upgrades are not supported, so it's less about instances being upgraded separately, and more about more than one instance being booted at the same time.

@banderror
Copy link
Contributor Author

I updated the proposal in the description based on all your feedback. Thank you!
Let me know if I forgot to mention anything.

@dgieselaar
Copy link
Member

dgieselaar commented May 5, 2021

Fwiw, consumer is currently not Observability for rule types, but APM/Uptime etc. My suggestion would be to not tightly couple this to the the alerting framework's interpretation of consumer. Generally, I feel we should avoid technically depending on the index name, and treat it as a scoping mechanism for allowing administrators to more easily grant access to subsets of data. Preferably we use a query when we query alerts instead of reconstructing the entire index alias. Not sure if that is being suggested here, but wanted to call that out.

@spong
Copy link
Member

spong commented May 6, 2021

++ on this - which is why I think we should always include all technical fields in the shared component template. That should hopefully be a few dozen only. But that could allow users to point any rule to any index, and all the other stuff would be metadata, which may or may not need a runtime field to be queryable.

@dgieselaar, is the thought then that solutions would need to explicitly allow-list/enable which stack rules they support and then we'd combine those component templates with the solution-specific component templates so the solution indices have all the necessary fields to support stack rules? Or would solutions just include _all stack rule component templates by default so there's no ambiguity between which solutions support which stack rules?


@banderror -- updated RFC LGTM! 👍 May want to have a section with regards to storing version in _meta as opposed to the index name as you detailed here, but other than that I think we might be good to go! 🙂

@gmmorris
Copy link
Contributor

gmmorris commented May 6, 2021

Fwiw, consumer is currently not Observability for rule types, but APM/Uptime etc. My suggestion would be to not tightly couple this to the the alerting framework's interpretation of consumer.

@dgieselaar - Wouldn't diverging here make it far harder though? RBAC is already a complicated mechanism, if we start diverging on this (using something other than FeatureID in consumers/producers) we're adding another moving part to this mechanism.

I'm not necessarily objecting here, but flagging that this would come at a cost to maintainability/reliability, and we should step with caution.

cc @ymao1

@jasonrhodes
Copy link
Member

Cool, I understand how users can control their queries to scope which alerts are returned in both of those cases. I'm less clear on the case @MikePaquette mentioned as one of the deal breakers above:

So any RAC indexing scheme that requires analysts to learn/use these filters for scoping rules/alerts/cases would be a deal-breaker.

A visualization built on the alerts index (or an embedded Lens viz in a solution UI) isn't going to be auto-scoped to alerts in the current space. I just want to confirm that we don't expect that having the space ID in the index names is going to solve this either, as far as I understand.

@tsg
Copy link
Contributor

tsg commented May 28, 2021

A visualization built on the alerts index (or an embedded Lens viz in a solution UI) isn't going to be auto-scoped to alerts in the current space. I just want to confirm that we don't expect that having the space ID in the index names is going to solve this either, as far as I understand.

If the user has wide access to the alerting indices, and they use Discover or Visualise, there is a manual step in both cases: either selecting the right index to query (easier), or use the correct filter (a bit harder, but manageable). This filter needs to be added any time the user does something with alerts as data (ML jobs on top, rules on top of rules). So the overhead is adding up a bit, but I'm not sure how big of a problem will be in practice.

For embedded Lens, we can easily pass the correct index glob to Lens, so it works out of the box. I think we should also be able to pass the filter to Lens, so except for a bit more work in our code, I think this is the same.

So far, this is adding the filters for convenience, but it's not secure in any way (the user can still access all alerts via Discover).

The more common use case, I think, is that the Administrator will define ES roles so that the Analyst can only see the alerts from a particular space. This is where the difference is, IMO:

If we separate the alerts by spaces in different indices, it's very easy for the Administrator to configure the roles correctly.

If we don't separate the alerts, it's no longer possible to do this at all in Basic or Gold license levels. It is possible in Platinium+ via DLS, but the configuration is also more complex.

I'm worried in particular about the following scenario:

  1. A user creates a Rule in a space, notices that the Rule only shows up in that space, so assumes it's the same for the resulting alerts.
  2. They want to use dashboards on alerts, so they create an ES role to open up access to .alerts-*
  3. The user notices that anyone from any space can easily access alerts from other spaces, so they open a bug report.
  4. We tell them that it works as intended and that the only way to separate alerts is to upgrade to Platinium
  5. If they are Basic or a Gold customer, the user feels like we're trying to squeeze an upgrade out of them

If we would separate the alerts into indices by space, the admin would be able to define the right permissions in step 2.

I agree that on a technical level we'd prefer to not have the space id in the index name, but I want to make sure that everyone understands the consequences here.

@tsg
Copy link
Contributor

tsg commented May 28, 2021

The benefit of not including the space ID in the index-name is that this will work when we support sharing alerts in multiple spaces

This is the part that I'm not fully in agreement. I see the index name addition as an extra convenience for users that access data directly. It can and should be totally ignored by the Kibana handlers code ( by quering .alerts-*). So it shouldn't be able to break any current or future feature, because our code will ignore it.

The shared alerts feature, when we add it, will only rely on the field in the documents, not on the index name. What space id that we put in the index name can be arbitrarily one of the spaces, or no.space.

@kobelb
Copy link
Contributor

kobelb commented May 28, 2021

We tell them that it works as intended and that the only way to separate alerts is to upgrade to Platinium

With the proposal I made previously, it wouldn't be the only way to separate alerts. They could choose to use the "Alerting namespace" feature to separate alerts into different indices. This comes with the risk of having so many "Alerting namespaces" that they have too many small indices and have an over-sharding issue. If that's the case, larger Alerting indices with DLS comes with an advantage.

If they are Basic or a Gold customer, the user feels like we're trying to squeeze an upgrade out of them

At a certain level of scale, our users end up needing a Platinum license. I don't think that's a bad thing. DLS was created to address this specific use-case where we want users to have access to a subset of documents in an index.

This is the part that I'm not fully in agreement. I see the index name addition as an extra convenience for users that access data directly.

If users rely on the index-name, then it would break their Dashboards and Visualizations as soon as we allow alerts to be shared in multiple spaces. We will put ourselves in a corner here where we need a breaking change to allow alerts to be shared in multiple spaces. I would like for us to have the flexibility of allowing alerts to be shared in multiple spaces within the 8.x timeframe, as we have many users who would like this.

@tsg
Copy link
Contributor

tsg commented May 28, 2021

Thanks for the answers @kobelb.

With the proposal I made previously, it wouldn't be the only way to separate alerts. They could choose to use the "Alerting namespace" feature to separate alerts into different indices. This comes with the risk of having so many "Alerting namespaces" that they have too many small indices and have an over-sharding issue. If that's the case, larger Alerting indices with DLS comes with an advantage.

This does address my main concern, so I'd be generally happy with it, but it sounds like we're pushing the problem of maintaining the indexing strategy to our users instead of having it defined by us.

If users rely on the index-name, then it would break their Dashboards and Visualizations as soon as we allow alerts to be shared in multiple spaces. We will put ourselves in a corner here where we need a breaking change to allow alerts to be shared in multiple spaces. I would like for us to have the flexibility of allowing alerts to be shared in multiple spaces within the 8.x timeframe, as we have many users who would like this.

Ok, I can see the risk here, but if we allow users to specify their own arbitrary namespace, they will potentially hit the same problem, except that we won't be able to help with mitigation because we can't know what strategy they adopted.

Also, we do have space-ids in the GA-ed security solution already. So we can say that we're trading a breaking change in 8.x that we might need, for a breaking change in 7.x now.

@kobelb
Copy link
Contributor

kobelb commented May 28, 2021

This does address my main concern, so I'd be generally happy with it, but it sounds like we're pushing the problem of maintaining the indexing strategy to our users instead of having it defined by us.

It's true... And I think that's the major flaw with my proposal. However, given the fact that we can't predefine an indexing strategy that will work for all users because it's really Alerting usage dependent, I think it's the best option we have.

Ok, I can see the risk here, but if we allow users to specify their own arbitrary namespace, they will potentially hit the same problem, except that we won't be able to help with mitigation because we can't know what strategy they adopted.

When you say "the same problem", are you referring to users switching their namespacing strategy and then having to go update all of their Dashboards and Visualizations to use the new namespacing strategy? If so, I do agree that it'd be a frustrating user experience, and I think we should recommend that users only create visualizations/dashboards against .alerts-* to minimize the impact; however, it wouldn't be a breaking change that Elastic is forcing on the users. It's up to them to control their namespacing strategy.

Also, we do have space-ids in the GA-ed security solution already. So we can say that we're trading a breaking change in 8.x that we might need, for a breaking change in 7.x now.

We have every intention of allowing alerts to be shared in multiple spaces in the 8.x timeframe. We have the opportunity of preparing for this future in the 8.0 release. Otherwise, we're going to have to wait until 9.x to allow alerts to be shared in multiple spaces.

@cyrille-leclerc
Copy link
Contributor

Cyrille: It's very important for Observability alerting use case that alerts are data like any other observability signal (logs, metrics, traces, synthetic tests...). That alerts can be used through Discover, through processing pipelines, through ML...

Brandon: Do you mind elaborating on this, @cyrille-leclerc? Do you foresee a majority of Observability users needing to interact with these alerts using Discover? I've heard ML discussed as well, but I wasn't aware of any immediate plans to build ML jobs on the alerts themselves. Also, this is the first time that I've processing pipelines come up as well.

It's a common pattern to process alerts in pipelines to do the following:

  • Enrichment to add context to alerts
  • Correlation of alerts with other alerts or other events
  • Noise Reduction

Another pattern is to process alerts in data pipelines to mark a rule as flapping when it causes too many alert state change per time interval. Marking a rule as "flapping" causes a change in the notification strategy (e.g. only notify once every 30 mins).
Another pattern is to post process alerts to report on Mean Time To Resolution.

For all the use cases described above, we anticipate to post process alerts in pipelines that are very similar to the pipeline used on our "primal signals" that are logs, metrics, traces, or security events.
We see alerts as "signal", maybe "secondary signals" but still signals.

Does it make sense @kobelb

@joshdover joshdover added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels May 31, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label May 31, 2021
@banderror
Copy link
Contributor Author

I did a small test of the two approaches that we're discussing here - index name with spaceId in it vs without - as it was not clear to me how these 2 approaches would affect Kibana visualizations and dashboards. Let me share what I got.

This is the indices and simplistic objects I used for testing (Kibana Dev Tools syntax):

## Approach 1: spaceId in the index name + as a field in the document.
## Document level security is possible, but not required.
## Users with Basic/Gold licenses can set up RBAC for Dashboards using index level privileges.

PUT /.alerts-with.space-space1/_doc/1
{ "spaceId": "space1", "value": 10 }
PUT /.alerts-with.space-space1/_doc/2
{ "spaceId": "space1", "value": 20 }

PUT /.alerts-with.space-space2/_doc/3
{ "spaceId": "space2", "value": 30 }
PUT /.alerts-with.space-space2/_doc/4
{ "spaceId": "space2", "value": 40 }

## Approach 2: no spaceId in the index name, it's only available as a field.
## Document level security is required for RBAC.
## Users with Basic/Gold licenses can't set up RBAC for Dashboards.

PUT /.alerts-without.space-default/_doc/1
{ "spaceId": "space1", "value": 10 }
PUT /.alerts-without.space-default/_doc/2
{ "spaceId": "space1", "value": 20 }
PUT /.alerts-without.space-default/_doc/3
{ "spaceId": "space2", "value": 30 }
PUT /.alerts-without.space-default/_doc/4
{ "spaceId": "space2", "value": 40 }

I also created two Kibana users: the 1st one has access to space1, the 2nd one - to space2 (each has access to both types of indices: .alerts-with.space* and .alerts-without.space*). This is their role settings in the UI:

User 1 having access to space1:

User 2 having access to space2:

I added two index patterns to be able to query those indices in visualizations:

Finally, I created two visualizations (in the form of tables) - a table per index pattern, and combined them into a dashboard.

This is how this dashboard looks like if you log in as a superuser (you have access to all of the above indices):

If you log in as User 1, you will see only documents from space1 in the dashboard:

Same for User 2:

Notice that the visualization (table) itself does not specify a space id, it uses an index pattern which doesn't include space id - in both approaches that we are comparing:

@jasonrhodes
Copy link
Member

@banderror nice! So both of these cases appear to do what we want them to do, it's just that one requires Platinum-only document-level security, and the other is achievable just with index name matching permissions, is that accurate?

One thing I have trouble keeping in mind is how these relate to the idea of Kibana RBAC, i.e. customers who only use feature controls to set up a role. So if they give a user access to a feature called "alerts" (or whatever it would be called), would this kind of space-awareness exist for them due to the fact that the internal user will only read documents from the current space somehow? I'm not clear on how this works, but it seemed to be part of earlier conversations, and I want to be super clear about how those things relate to each other. Thanks!

@kobelb
Copy link
Contributor

kobelb commented Jun 2, 2021

One thing I have trouble keeping in mind is how these relate to the idea of Kibana RBAC, i.e. customers who only use feature controls to set up a role. So if they give a user access to a feature called "alerts" (or whatever it would be called), would this kind of space-awareness exist for them due to the fact that the internal user will only read documents from the current space somehow?

Correct. When users access alerts using custom UIs and HTTP APIs in Kibana that are integrated with the Kibana RBAC model, the user will only be able to see the subset of the alerts that they have access to:

Hybrid Indices@2x (4)

@banderror
Copy link
Contributor Author

@banderror nice! So both of these cases appear to do what we want them to do, it's just that one requires Platinum-only document-level security, and the other is achievable just with index name matching permissions, is that accurate?

@jasonrhodes Yep, I think so.

More and more I feel like it would be great to stick to @kobelb 's suggestion (no space id in the name + document-level security). It would simplify index management - no bootstrapping indices would be needed per space on the fly, e.g. in rule executors or route handlers or any other "lazy bootstrapping" logic. I might be missing/forgetting something, but the only blocker (maybe not) is being able to isolate alerts (at least security detection alerts) from different spaces in custom visualizations and dashboards for users with Basic license. Basically, like @MikePaquette said in #98912 (comment), spaces should work out of the box for Basic-licensed users, without any "surprises".

This makes me thinking:

  • Is it possible to "downgrade" document-level security feature down to Basic license?
  • Is it possible to set document-level security via Kibana feature privileges model (I'm not familiar with it)?

I think this could be a win-win solution if this is possible.

Also, if we stick to DLS for space isolation, this probably would be a foundational thing for all the future custom indices in Kibana, e.g. based on the next implementation of saved objects etc.

What do you think? @kobelb @tsg

@kobelb
Copy link
Contributor

kobelb commented Jun 9, 2021

Basically, like @MikePaquette said in #98912 (comment), spaces should work out of the box for Basic-licensed users, without any "surprises".

Spaces will work out of the box for a majority of basic-licensed users. The only exception is when users need to use the "escape hatch" to query the hidden .alerting-* indices. Users will already be taking a fair amount of risk by accessing these indices directly because their document structure is honestly likely to change. We should aim to minimize the breaking changes that we make to these indices but given the large variety of data that they're storing, it's likely to happen.

Is it possible to "downgrade" document-level security feature down to Basic license?

We can open up a licensing issue to discuss moving DLS to the basic license. However, DLS is currently a Platinum level feature, and moving it Basic is a rather large jump.

Is it possible to set document-level security via Kibana feature privileges model (I'm not familiar with it)?

Unfortunately, no. When users are querying the Elasticsearch documents directly, Kibana's RBAC model doesn't apply.

@marshallmain
Copy link
Contributor

marshallmain commented Jun 9, 2021

For a little more context on why I'm jumping in late on this issue: while @banderror is on vacation I'm going to be attempting to push forward the consolidated rule registry data client implementation. As part of that effort, I'm working on enhancing the index template creation, versioning, and bootstrapping strategy. Whether or not space IDs can potentially be included in concrete index names has significant implications on when/how the indices are bootstrapped.

I agree with @jasonrhodes that a lot of this feels like trying to fit a too-small sheet to a bed. We have competing requirements that, to my knowledge, cannot be satisfied in the current version of Kibana with a Basic license.

  1. Alerts should be space aware and users should not be able to see alerts that are not in their space - even in Discover, Lens, etc
  2. Alert indices should be compact and we should keep the number of indices minimized, as much as possible

After reading through all the comments here, I've reached the opposite conclusion from @banderror - I think optionally including the space ID in the index name comes with significant advantages and few downsides.

To summarize what I'm proposing:

  1. Solutions can optionally include the space ID in index names to create space-specific indices
  2. RAC RBAC should automatically provide users access to the underlying indices for solutions and spaces that they have access to
    • This prevents users from accessing, in any way, alerts that they should not have access to
  3. Lazily create the concrete indices per space to avoid polluting the cluster with unused indices, but create the shared (per solution) index template on Kibana startup
  4. Possibly reduce the number of shards per index to mitigate the over sharding problem
  5. Possibly include the space name in kibana.space or the appropriate field so that when Kibana "data with saved objects constraints" arrives, existing alerts will work and users can simply convert their dashboards to query .alerts-* rather than the space specific index.

Advantages

  1. Basic license customers can use index-level permissions to restrict access to alerts by space
  2. Space IDs in index names can be automatically managed by solutions vs. the custom namespace proposal where users would have to manually define and maintain their namespaces.
  3. RAC RBAC UI can automatically give users permissions to the correct indices based on the spaces they have access to, providing a curated permissions experience for admins

Disadvantages

  1. More indices and shards
    • However, this can be mitigated by lazily creating the concrete indices per space. We can use a single template per solution and share it across all the spaces, only creating the actual indices when the first alert is written for a particular space. Also, alert indices by nature tend to be relatively small compared to the rest of the cluster so we may be able to use a smaller number of shards by default for alert indices to mitigate the over-sharding problem.
  2. Potentially poor support for multi-space alerts
    • I'm not sure if this disadvantage is really related to including the space ID in the index name. If we can decide on what field to populate with space names (e.g. kibana.space or similar) in order to support multi-space alerts, then we can populate that field now with a single space name regardless of whether the space ID is included in the index name. Multi-space support, when it arrives, should simply ignore whatever space ID is in the index name and rely on the field in the document.

Basic-licensed users of the security solution currently enjoy the capability to have separate workspaces for "dev/prod" and other organizational groupings. In each workspace (Kibana space) they have independent sets of rules/alerts/cases data. To remove this popular use case from the Basic tier user would be a breaking change for many users, and would likely be a significant impediment to future adoption of the solution. So any RAC indexing scheme that removes the ability of basic-tier users to have separate scoped workspaces for rules/alerts/cases would be a dealbreaker.

I agree with @MikePaquette here - taking features away from existing users in a new release would be an awful user experience. This hard requirement for the Security Solution means we must have either space ID or a custom namespace in the index name until we have some way in Kibana of controlling access to data indices by space (e.g. solutions might register a data index as "space-controlled" and Kibana then adds a "space filter" to every request that hits that index). I think using custom namespaces to implement this capability would add additional complexity and cognitive load to users who just want to replicate their existing environment since they would have to manually create and maintain a custom namespace for each Kibana space.

It's also not clear to me what the desired user workflows around multi-space alerts in security would be, or if there's a clear need for multi-space security alerts (where a single underlying document is shared). There are cases where we would want an alert to be created in multiple spaces so the rule would be multi-space in a sense, but until an analyst checks the alert we can't know whether the alert state (open/closed etc) should be shared across spaces. To support this, we would likely want to create copies of the alert in each space and build out a feature "close alert in all spaces" that analysts could choose in the UI as an alternative to "close alert (implicitly only in this space)". In this scenario, alert documents are still single-space. (brief slack thread on the topic)

@kobelb
Copy link
Contributor

kobelb commented Jun 9, 2021

I've stated this before, but I don't think we should be making poor architectural decisions because of license levels. Including the space-id in the index-name has critical disadvantages that we should not overlook:

  1. If we include the consumer and the space-id in the index-name, we will be contributing to over sharding. If we have 1000 spaces with 10 consumers, we will end up with 10,000 indices. If the only answer we will have for our users is "use fewer spaces", this is a bad answer.
  2. If we include the space-id in the index-name, it will cause a confusing user experience or entirely block multi-space alerting rules and alerts.

@marshallmain
Copy link
Contributor

marshallmain commented Jun 9, 2021

Removing features for existing customers is a non-starter IMO, and an architecture that requires this would entirely block the ability for the security solution to use the RAC infrastructure. Requiring existing customers to pay for a higher tier license to continue receiving the same feature set constitutes "removing features" in my mind. Without space IDs in the index name, the security solution would not be able to migrate its full feature set to RAC infrastructure until Kibana provides "data with saved objects constraints".

If we include the consumer and the space-id in the index-name, we will be contributing to over sharding. If we have 1000 spaces with 10 consumers, we will end up with 10,000 indices. If the only answer we will have for our users is "use fewer spaces", this is a bad answer.

This is why the concrete indices would be created as-needed. If users do create alerts in 1000 different spaces and they end up with indices for all of them, it's reasonable to think that they are separating the alerts intentionally and would want to be able to have strong separation between the spaces without jumping through extra hoops. If only a few of their spaces are actually used for alerts, then only a few indices are created. This is a tried-and-true strategy that the security solution has used to workaround the lack of "data with saved objects constraints" in Kibana.

If we include the space-id in the index-name, it will cause a confusing user experience or entirely block multi-space alerting rules and alerts.

I don't see how this blocks multi space alerting. The alerts that live in these indices with space IDs are and will always be single space alerts. When multi space alerting becomes available, we can create a new index without the space ID in the name and write multi-space alerts to that index instead. We can write single space alerts there as well, and tell users that dashboards that were pointing to .alerts-...-<space id> should now omit the space ID. If we're ok with the data structure of docs in these hidden indices changing and breaking dashboards, then a change to the index pattern should be fine as well.

Having "legacy" single space alerts in an index that has the space ID in the name should also be less confusing to users than either manually managing namespaces to match up with Kibana spaces or adding document level security.

In general I don't think we should be making immediate sacrifices to user experience in the hopes of being forwards-compatible with multi-space alerting, given that multi space alerting hasn't been fully designed yet.

@kobelb
Copy link
Contributor

kobelb commented Jun 9, 2021

Removing features for existing customers is a non-starter IMO, and an architecture that requires this would entirely block the ability for the security solution to use the RAC infrastructure. Requiring existing customers to pay for a higher tier license to continue receiving the same feature set constitutes "removing features" in my mind. Without space IDs in the index name, the security solution would not be able to migrate its full feature set to RAC infrastructure until Kibana provides "data with saved objects constraints".

We remove features all the time (generally after a reasonable period with deprecation notices). The security solution can migrate a very large amount of its features to the RAC infrastructure even before Kibana provides "data with saved object constraints". The only features that I'm aware of that are blocked are using applications like Discover, Visualize, Lens with the alert-data using Kibana's RBAC model.

This is a tried-and-true strategy that the security solution has used to workaround the lack of "data with saved objects constraints" in Kibana.

Just because it works with the scale of security detections right now, doesn't mean it will continue to work. Additionally, we currently have some alerts for stack monitoring that are being created in every space.

I don't see how this blocks multi space alerting. The alerts that live in these indices with space IDs are and will always be single space alerts. When multi space alerting becomes available, we can create a new index without the space ID in the name and write multi-space alerts to that index instead.

This reinforces the argument I was making previously. Multi-space alerts are incompatible with alerts indices with space IDs in their names. We have a potential workaround where we write single space alerts to alert indices with a space ID in their name, but we have to stop doing that as soon as these alerts exist in multiple spaces. Whenever we would switch which index these alerts are being written to, we would break any features that were reading from the original single space alert indexes.

In general I don't think we should be making immediate sacrifices to user experience in the hopes of being forwards-compatible with multi-space alerting, given that multi space alerting hasn't been fully designed yet.

What immediate sacrifices are we making? My primary goal is to prevent us from generalizing a solution that the detection engine has implemented that puts us in a corner and prevents us from implementing multi-space alerting. We have a fairly good idea of how we want this to work, it's not some long-term, multiple years out project. We have users that want multi-space alerting as soon as possible.

@marshallmain
Copy link
Contributor

It's important to note that in the proposal the space ID in the index name is optional.

What immediate sacrifices are we making?

Customers that use spaces to segment alerts and ensure that analysts in certain roles can only see the alerts they are supposed to see will suddenly lose the ability to create the proper permissions at the Basic level. Currently this can be achieved using index level permissions.

Multi-space alerts are incompatible with alerts indices with space IDs in their names. We have a potential workaround where we write single space alerts to alert indices with a space ID in their name, but we have to stop doing that as soon as these alerts exist in multiple spaces. Whenever we would switch which index these alerts are being written to, we would break any features that were reading from the original single space alert indexes.

We wouldn't need to switch the index that features are reading from - we can build our features to ignore the space ID in the index name and simply read from .alerts-...-*, filtering on the kibana.space field. The point of including the space ID in the index name is to provide users with a way to manage permissions and access to alerts at the Basic license level, not to have our features only query the space specific index. When multi space alerts arrive, the features will query the same index pattern and filter on the same field.

Users will have to update the index pattern they use for custom visualizations and dashboards, but that's an acceptable breakage with hidden indices.

Is there a specific operation that is not possible or logically does not make sense if the index name contains the space ID? The only one that comes to mind would be adding an additional space to a specific alert document's kibana.space field after alert creation, but that seems like an unlikely use case and something we could easily disallow on any alerts in space-specific indices (i.e. alerts created before multi-space alerts are released).

Additionally, we currently have some alerts for stack monitoring that are being created in every space.

This is why, in my proposal, including the space ID in the index name is an optional feature, determined by the rule writers within each solution. The security solution can continue to use the space ID, and solutions that don't need it can use a single index.

The only features that I'm aware of that are blocked are using applications like Discover, Visualize, Lens with the alert-data using Kibana's RBAC model.

The key feature that's blocked is the ability to prevent users in certain roles from being able to access alerts that they should not access at the Basic tier.

@kobelb
Copy link
Contributor

kobelb commented Jun 9, 2021

@marshallmain if we don't include the space-id in the index-name, the only situations where we won't have space-based isolation and authorization is when end-users access the indices directly. Are we in agreement on this?

As far as I'm aware, this is isolated to a very few edge-case situations: Discover, Visualize, Lens, etc. I believe this will also be the case with "alerts on alerts", at least for the short-term. Are there other significant use cases that I'm missing?

@jasonrhodes
Copy link
Member

Basic-licensed users of the security solution currently enjoy the capability to have separate workspaces for "dev/prod" and other organizational groupings. In each workspace (Kibana space) they have independent sets of rules/alerts/cases data. To remove this popular use case from the Basic tier user would be a breaking change for many users, and would likely be a significant impediment to future adoption of the solution.

This feels like the important comment to just make sure we've fully addressed. Security users with a Basic license currently use Kibana spaces to segment their alerts. From a solution perspective, it sounds like we're saying that will still be possible without using Space ID in the name, is that right? The particular magic that makes that work never seems to stick in my memory, but we should avoid a false dichotomy if this is possible in both solutions.

The security solution can migrate a very large amount of its features to the RAC infrastructure even before Kibana provides "data with saved object constraints". The only features that I'm aware of that are blocked are using applications like Discover, Visualize, Lens with the alert-data using Kibana's RBAC model.

I think I understand what you're saying here, but just to be clear, Security can migrate "a very large amount of its features" but the ones that they can't migrate will be dropped, they can't continue to support those features in the old model while also migrating the others. In other words, once they migrate to the new indices, any feature that's not migratable will be removed from the UI.

So the question is: what are the features that will be dropped? It sounds like it would not be anything in the Security UI, but rather, custom visualizations, dashboards, etc. that end users built that relied on Space ID being in the index name in order to properly segment their queries, is that correct? If we can clarify that officially, then I think we can pull @MikePaquette back in to see whether that's still a deal-breaker.

@jasonrhodes
Copy link
Member

This conversation has moved into our architecture design document. I'll summarize the complete results of that discussion here ASAP and close this ticket.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests