Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.x Backport] Optimized Privilege Evaluation #4898

Open
wants to merge 19 commits into
base: 2.x
Choose a base branch
from

Conversation

nibix
Copy link
Collaborator

@nibix nibix commented Nov 13, 2024

Description

This implements the optimized privilege evaluation as described in #3870 and backports the changes from #4380 to the 2.x branch.

Important: Before this can be release, the OpenSearch version in the LEGACY_HEADERS_UNNECESSARY_AS_OF property must be checked to be in sync with the actual release version. See review comment at #4898 (review)

This introduces de-normalized data structures that are optimized for the checks that need to be done during privilege evaluation. Additionally, certain objects (like DLS queries) are prepared ahead of time, as early as possible in order to minimize the overhead during actual privilege evaluation.

This is a big change set - in order to facilitate the review, I have split it into three major commits:

  • Optimized action privilege evaluation
  • Optimized DLS/FLS/FM privilege evaluation
  • Removal of unused code

The code is extensively commented - I hope that will help during review.

  • Category: Enhancement
  • Why these changes are required?

Performance tests indicate that the OpenSearch security layer adds a noticeable overhead to the indexing throughput of an OpenSearch cluster. The overhead may vary depending on the number of indices, the use of aliases, the number of roles and the size of the user object. The goal of these changes is to improve privilege evaluation performance and to make it less dependent on the number of indices, etc.

  • What is the old behavior before changes and new behavior after changes?

No significant behavioral changes in the "happy case", when privileges are present.

The undocumented config option config.dynamic.multi_rolespan_enabled is no longer evaluated. The code now behaves like it is always set to true - that is the former default. See #4495 for details.

Some slight changes are present in error cases:

  • More detailed error messages for missing privileges, showing a index/action matrix of missing privileges
  • Errors in the role configuration might be reported (as error log messages) more early, directly after the configuration was applied
  • The DLS/FLS implementation now defaults to a "deny by default" implementation. This is not relevant for normal cases. This will be only relevant if index requests pass through privileges evaluator even though there are no roles which grant privileges to the requested indices. Note: This would only happen in case of a bug in the code. In the previous versions, the DLS/FLS implementation would grant full access to the indices. Now, the DLS/FLS implementation acts as a second barrier, denying access to the indices.

Issues Resolved

This is a backport from #4380

Testing

Check List

  • New functionality includes testing
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

nibix added 18 commits November 13, 2024 04:45
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
…d_privileges.include_indices

See discussion in opensearch-project#4380 (comment)

Signed-off-by: Nils Bandener <[email protected]>
Signed-off-by: Nils Bandener <[email protected]>
@nibix nibix mentioned this pull request Nov 13, 2024
3 tasks
@nibix nibix changed the title Optimized Privilege Evaluation [2.x Backport] Optimized Privilege Evaluation Nov 13, 2024
* Defines the first OpenSearch version which does not need the legacy headers
* TODO this needs to be adapted
*/
static final Version LEGACY_HEADERS_UNNECESSARY_AS_OF = Version.V_2_19_0;
Copy link
Collaborator Author

@nibix nibix Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Before this PR is released, it must be made sure that the attribute LEGACY_HEADERS_UNNECESSARY_AS_OF refers to the OpenSearch version this functionality is released in. Otherwise, DLS/FLS won't properly work in mixed clusters with older versions.

This also needs to be forward-ported to main then.

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 73.42760% with 545 lines in your changes missing coverage. Please review.

Project coverage is 66.71%. Comparing base (7efe29b) to head (3c88111).
Report is 3 commits behind head on 2.x.

Files with missing lines Patch % Lines
...ecurity/privileges/dlsfls/DlsFlsLegacyHeaders.java 1.42% 68 Missing and 1 partial ⚠️
...search/security/configuration/DlsFlsValveImpl.java 61.04% 42 Missing and 25 partials ⚠️
...privileges/dlsfls/AbstractRuleBasedPrivileges.java 82.77% 33 Missing and 18 partials ⚠️
...ensearch/security/privileges/ActionPrivileges.java 87.72% 30 Missing and 18 partials ⚠️
...security/configuration/DlsFlsFilterLeafReader.java 43.90% 42 Missing and 4 partials ⚠️
...earch/security/privileges/dlsfls/FieldMasking.java 72.56% 40 Missing and 5 partials ⚠️
...g/opensearch/security/privileges/IndexPattern.java 54.34% 38 Missing and 4 partials ⚠️
...earch/security/privileges/PrivilegesEvaluator.java 69.89% 17 Missing and 11 partials ⚠️
...ch/security/privileges/dlsfls/FieldPrivileges.java 81.34% 16 Missing and 9 partials ⚠️
...urity/privileges/dlsfls/FlsStoredFieldVisitor.java 37.83% 20 Missing and 3 partials ⚠️
... and 24 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.x    #4898      +/-   ##
==========================================
+ Coverage   63.87%   66.71%   +2.84%     
==========================================
  Files         330      342      +12     
  Lines       23136    23157      +21     
  Branches     3747     3659      -88     
==========================================
+ Hits        14777    15450     +673     
+ Misses       6527     5863     -664     
- Partials     1832     1844      +12     
Files with missing lines Coverage Δ
...ty/configuration/ConfigurationLoaderSecurity7.java 67.64% <100.00%> (+0.23%) ⬆️
...ecurity/configuration/ConfigurationRepository.java 74.42% <100.00%> (ø)
...urity/configuration/PrivilegesInterceptorImpl.java 59.77% <100.00%> (+2.45%) ⬆️
...va/org/opensearch/security/configuration/Salt.java 100.00% <ø> (ø)
...rity/configuration/SystemIndexSearcherWrapper.java 91.52% <100.00%> (+0.14%) ⬆️
...nsearch/security/dlic/rest/api/RolesApiAction.java 89.58% <100.00%> (-2.58%) ⬇️
...org/opensearch/security/filter/SecurityFilter.java 65.87% <100.00%> (-0.48%) ⬇️
...ch/security/privileges/PitPrivilegesEvaluator.java 96.15% <100.00%> (-0.15%) ⬇️
...urity/privileges/RestLayerPrivilegesEvaluator.java 93.75% <100.00%> (-0.37%) ⬇️
.../security/privileges/dlsfls/DlsFlsBaseContext.java 100.00% <100.00%> (ø)
... and 40 more

... and 4 files with indirect coverage changes

@krishna-ggk
Copy link
Contributor

krishna-ggk commented Nov 25, 2024

Thank you for the significant contribution - the improvements look great 🙌

Would you be open to incorporating a feature flag here as well? In a previous major improvement, we found that feature flagging would have allowed us to roll it out more gradually.

@nibix
Copy link
Collaborator Author

nibix commented Nov 25, 2024

@krishna-ggk

Would you be open to incorporating a feature flag here as well? In a previous major improvement, we found that feature flagging would have allowed us to roll it out more gradually.

From a technical point of view, of course it will be possible. I just have a few detail questions:

  • Are you thinking about a flag that can be changed at runtime, without needing to restart the nodes, or a flag that needs the nodes to be restarted?
  • How shall the rollout strategy be? Ship it disabled and let users enable it optionally? Or ship it enabled and let users disable it just in case?
  • Is the plan then to keep it till the last version of OpenSearch 2.x and remove the feature flag for 3.0?

Still, I guess we have to ask ourselves the question whether it is feasible and economical. Unfortunately, adding a feature flag will require a quite a bit of effort and will significantly increase maintenance effort due to a lot of code duplication. From the top of my head, we need to do this:

  • Restore the currently deleted class SecurityRoles from https://github.com/opensearch-project/security/blob/2.x/src/main/java/org/opensearch/security/securityconf/ConfigModelV7.java - this will be roughly 1000 additional LOC
  • Duplicate significant parts of PrivilegesEvaluator as it has been changed fundamentally. That will be roughly 500 additional LOC.
  • Duplicate PitPrivilegesEvaluator (96 LOC), SystemIndexAccessEvaluator (320 LOC), TermsAggregationEvaluator (100 LOC)
  • Duplicate significant parts of RestLayerPrivilegesEvaluator (100 LOC)
  • Duplicate most of the DLS/FLS code (2000+ LOC)
  • Revise the backwards compat concept for DLS/FLS (at the moment it assumes that the change of implementation happens with a specific OpenSearch release version)
  • Duplicate all integration tests related to privileges and DLS/FLS and run it both with the feature flag enabled and disabled
  • Keep in mind for future features and bug fixes related to privilege evaluation and DLS/FLS, two code parts need to be considered

Would maybe releasing a beta version an alternative solution? Especially if the feature flag should be disabled upfront, there won't be so much significant difference to a beta version. In both cases, users would need to actively decide that they want to use that feature and that they are willing to accept any potential risks.

@krishna-ggk
Copy link
Contributor

@krishna-ggk

Would you be open to incorporating a feature flag here as well? In a previous major improvement, we found that feature flagging would have allowed us to roll it out more gradually.

Generally, of course. I just have a few detail questions:

Thanks @nibix .

  • Are you thinking about a flag that can be changed at runtime, without needing to restart the nodes, or a flag that needs the nodes to be restarted?

Ideally, a runtime flag would be preferable, but a restart-required flag could also work as a simple mitigation if issues arises.

  • How shall the rollout strategy be? Ship it disabled and let users enable it optionally? Or ship it enabled and let users disable it just in case?

Given this is a performance improvement rather than a new feature where we're gathering user feedback, I'd recommend shipping it enabled by default. That way, users can easily disable it if any issues come up, without needing to opt-in.

  • Is the plan then to keep it till the last version of OpenSearch 2.x and remove the feature flag for 3.0?

Since there are no compatibility concerns, we could maintain the feature flag for 2-4 minor versions and then remove it. I'm open to other perspectives on the right timeline as well.

JFTR: Unfortunately, adding a feature flag will require a quite a bit of effort and will significantly increase maintenance effort due to a lot of code duplication. From the top of my head, we need to do this:
..

  • Keep in mind for future features and bug fixes related to privilege evaluation and DLS/FLS, two code parts need to be considered

Thanks. These are very valid points and tradeoff considerations. I understand the additional effort required to implement and maintain a feature flag along with maintenance burden due to increased code duplication. However, given the core nature of the privilege evaluation logic being changed, having that extra safeguard in the short-to-medium term could be quite valuable.

Perhaps a middle ground would be to include the feature flag for 2-3 release cycles, with the default set to "on" starting in 2.19. That way, we get the benefits of the flag without an overly long-lived maintenance burden. What are your thoughts on that approach?

@nibix
Copy link
Collaborator Author

nibix commented Nov 27, 2024

One complication of a feature flag would be the handling of DLS/FLS backwards compatibility - in mixed clusters.

This is a bit tricky. There are several ways to tackle this thinkable. Each with different issues. So, I'd like to hear your feedback @cwperks @DarshitChanpura @krishna-ggk

We need to consider backwards compatibility on mixed clusters because nodes on OpenSearch versions without this change rely on the presence of thread context headers to determine whether DLS/FLS restrictions need to be applied or not:

private void setDlsHeaders(EvaluatedDlsFlsConfig dlsFls, ActionRequest request) {
if (!dlsFls.getDlsQueriesByIndex().isEmpty()) {
Map<String, Set<String>> dlsQueries = dlsFls.getDlsQueriesByIndex();
if (request instanceof ClusterSearchShardsRequest && HeaderHelper.isTrustedClusterRequest(threadContext)) {
threadContext.addResponseHeader(
ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
Base64Helper.serializeObject((Serializable) dlsQueries)
);
if (log.isDebugEnabled()) {
log.debug("added response header for DLS info: {}", dlsQueries);
}
} else {
if (threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER) != null) {
Object deserializedDlsQueries = Base64Helper.deserializeObject(
threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER),
threadContext.getTransient(ConfigConstants.USE_JDK_SERIALIZATION)
);
if (!dlsQueries.equals(deserializedDlsQueries)) {
throw new OpenSearchSecurityException(
ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER + " does not match (SG 900D)"
);
}
} else {
threadContext.putHeader(
ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
Base64Helper.serializeObject((Serializable) dlsQueries)
);
if (log.isDebugEnabled()) {
log.debug("attach DLS info: {}", dlsQueries);
}
}
}
}
}

The new implementation does not need these headers any more. But it needs to know when in talks to nodes with the old implementation. In that case the headers need to be sent. If they would not be sent, the search results from that nodes would contain data the respective user is not authorized to see.

At the moment, the new implementation uses the OpenSearch version supplied in the connection to determine whether a node needs the headers or not:

/**
* Writes the prepared DLS/FLS headers into the given map IF this method deems that it is necessary.
* To be called when a transport message is sent to another node, i.e. in TransportInterceptor.interceptSender().
*/
public void performHeaderDecoration(Transport.Connection connection, TransportRequest request, Map<String, String> headerMap) {
if (connection.getVersion().onOrAfter(LEGACY_HEADERS_UNNECESSARY_AS_OF)) {
// Target node is new enough -> no headers to be applied
return;
}
if (request instanceof ActionRequest) {
// The legacy implementation will create the information by itself in DlsFlsValve if an ActionRequest is received
// Thus, if we have an ActionRequest, we do not need to get active either
return;
}
if (dlsHeader != null) {
headerMap.put(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER, dlsHeader);
}
if (flsHeader != null) {
headerMap.put(ConfigConstants.OPENDISTRO_SECURITY_FLS_FIELDS_HEADER, flsHeader);
}
if (fmHeader != null) {
headerMap.put(ConfigConstants.OPENDISTRO_SECURITY_MASKED_FIELD_HEADER, fmHeader);
}
}

If we introduce a feature flag, we can no longer use the condition connection.getVersion().onOrAfter(LEGACY_HEADERS_UNNECESSARY_AS_OF) to determine whether we need to send headers because it not only depends on the version but also on the value of the flag.

Option 1: Feature flag in opensearch.yml

This is a feature flag which can only be changed with a rolling restart. Thus, there might be mixed cluster conditions with some nodes having the feature flag on and some nodes having the feature flag off.

If we use such a feature flag, we'd need to somehow know the value of the feature flag on the remote side of the connection. At the moment, I am not aware of such a mechanism. Is there one which I am not aware of? @cwperks @DarshitChanpura @krishna-ggk

I could imagine a kind of "capabilities" mechanism stored in the cluster state which tells me the capabilities of each node and also whether all the nodes in the cluster have that capability. But that would be a complete new feature needing implementation.

Option 2: Feature flag in config.yml

This is a feature flag which can be changed during runtime. Changes to config.yml are broadcasted to the nodes - thus, the implementation can be switched semi-instantly. The issue is here still however the "semi". The config update process is still an async one. There will be a short time window where we also have a mixed cluster with the issues described for option 1.

@krishna-ggk
Copy link
Contributor

Agreed @nibix, backward compatibility is valid point and infact can remain if there is cross-cluster setup with mixed versions.

As one thought, could we continue to populate headers irrespective for subsequent versions until we remove the old code at which point we can add back the LEGACY_HEADERS_UNNECESSARY_AS_OF for that version? I understand this may remove some performance benefit, but wondering if it is significant enough.

@nibix
Copy link
Collaborator Author

nibix commented Dec 2, 2024

@krishna-ggk

As one thought, could we continue to populate headers irrespective for subsequent versions until we remove the old code at which point we can add back the LEGACY_HEADERS_UNNECESSARY_AS_OF for that version? I understand this may remove some performance benefit, but wondering if it is significant enough.

Yes, that should be possible. You are right that this means that users with DLS, FLS or field masking active won't have significant performance improvements in that phase.

@cwperks
Copy link
Member

cwperks commented Dec 3, 2024

I had a PR that I previously closed for Fls/Dls/Field Masking improvements: #4706

Would it make sense to re-open that to get the optimization put forth in that PR?

I'd also like to try to get #4665 into the 2.x branch as well and wondering if I should re-open that PR and target it to 2.x.

@nibix
Copy link
Collaborator Author

nibix commented Dec 4, 2024

@cwperks

I had a PR that I previously closed for Fls/Dls/Field Masking improvements: #4706

Would it make sense to re-open that to get the optimization put forth in that PR?

Well - if the goal is managing risks caused by code changes, then I am not sure if the changes from #4706 would come with a lower risk than the risks implied by this PR. IMHO, FLS and field masking do have a less than optimal test coverage. This makes any change in this regard risky. Maybe, another strategy of managing the risk introduced by this PR is just to invest in new tests to increase test coverage especially for FLS and field masking?

Concretely about the strategy from #4706: I am worried whether the index information obtained by the request index resolution can be sufficiently trusted in order to be usable for building the DLS/FLS/FM maps. There are are already now examples where this information cannot be used:

Potentially, there are more similar cases we do not know about, either in OpenSearch core or in plugins.

The new DLS/FLS/FM implementation does not suffer from this problem, as it moves the access controls down to the actual places where the access happens - at this place it will be always 100% clear which index is accessed.

I still think the two most promising ways to tackle unknown risks are:

  • Intensified testing. Analyze potential sources of issues and try to cover these by tests. Ideally, automated, otherwise manual. Of course, this can only cover "unknowns" which we can identify and thus make "knowns". "Unknowns" which remain "unknowns" are not covered here :)
  • As described in [2.x Backport] Optimized Privilege Evaluation #4898 (comment): Feature flag with a kind of capability information stored in the cluster state and thus made available via a Connection object. That would be a core change - which however also other future works would benefit from, IMHO.

Both however require a bit of work.

@cwperks
Copy link
Member

cwperks commented Dec 4, 2024

Maybe, another strategy of managing the risk introduced by this PR is just to invest in new tests to increase test coverage especially for FLS and field masking?

Increased test coverage with more scenarios like Scroll, PIT and CCS with FLS/DLS/FM scenarios would be a good thing to have. The PR I raised did add a scenario for scroll, but not for PIT or CCS

@krishna-ggk
Copy link
Contributor

Thanks folks. I'm still responding only to partial aspects mentioned as I'm currently traveling. (I'd love to collaborate more on this with you all more deeply but stuck this week due to travel)

I still think the two most promising ways to tackle unknown risks are:

  • Intensified testing. Analyze potential sources of issues and try to cover these by tests. Ideally, automated, otherwise manual. Of course, this can only cover "unknowns" which we can identify and thus make "knowns". "Unknowns" which remain "unknowns" are not covered here :)
  • As described in [2.x Backport] Optimized Privilege Evaluation #4898 (comment): Feature flag with a kind of capability information stored in the cluster state and thus made available via a Connection object. That would be a core change - which however also other future works would benefit from, IMHO.

Both however require a bit of work.

Agreed on the overall direction of addressing the risk. However I still think a feature flag to turn off is super helpful safeguard since the area we are improving is core authz capability that has security implications to users in case of any regression.

I'm wondering if we can perhaps take the approach of validating correctness in couple minor versions before in addition to increased coverage and then target performance gain for subsequent version where we cleanup old code? I'm open to other approaches too, but just falling back to this as a potential approach to reduce risk.

@nibix
Copy link
Collaborator Author

nibix commented Dec 6, 2024

@krishna-ggk

I'm wondering if we can perhaps take the approach of validating correctness in couple minor versions before in addition to increased coverage and then target performance gain for subsequent version where we cleanup old code?

From my point of view, that should be possible. If users want to have immediate performance gains, one could even add a second flag which turns off the headers. However users would need to take some precaution to change the flag only at the right state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants