Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for application privileges #55836

Merged
merged 57 commits into from
Jun 29, 2020

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Apr 28, 2020

Add caching support for application privileges to reduce number of round-trips to security index when building application privilege descriptors.

A few key points of the changes:

  • Per discussion Add cache for application privileges #54317, the main cache is keyed by concrete application name and values are a set of application privilege descriptors.
  • A secondary cache is also added to map a set of application expressions (i.e. with wildcard) to a set of concrete application names.
  • Due to the cache design, privilege retrieving in NativePrivilegeStore is changed to always fetching all privilege documents for a given application.
  • The change is applied to all places including "get privilege", "has privileges" API and CompositeRolesStore (for authentication).
    • Initially I wanted to exclude CompositeRolesStore from the caching. But this means no code can be deleted from NativePrivilegeStore. We basically have to add "retriving by application name" on top of existing query logic. For simplicity, I later decided to include it so that the query part can be largely simplified.
  • Added an API to clear privilege cache, which is also used internally when adding/deleting privileges. These operations already clear role cache, the privilege cache invalidation is added on top of it.
  • Add security index state listener to clear cache on state changes.
  • Docs added
  • Yaml tests added

Resolves: #54317

@ywangd ywangd added >enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC v8.0.0 v7.8.0 labels Apr 28, 2020
@ywangd ywangd requested a review from tvernum April 28, 2020 03:16
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Authorization)

Comment on lines 170 to 172
// Avoid caching potential stale results.
// TODO: It is still possible that cache gets invalidated immediately after the if check
if (invalidationCounter == numInvalidation.get()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is used in CompositeRolesStore. However it is still possible that the cache gets invalidated immediately after the if check. The window for it to happen is much small, but still exists in theory.

The other pattern is to use ListenableFuture. This would solves the stale entry problem here because the future itself is removed from the cached. So adding items to the removed future has no impact to the cache. However, this pattern could potentially have a deadlock issue? If the thread computing for future crashes, will all the other thread waiting for it get stuck? If this is true, I'd prefer to use the first pattern since a stale entry (with very low chance) is less harmful than deadlocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a ReadWriteLock would solve the problem. Treat the invalidator as the writer and the cache population as the reader. The invalidator would need exclusive access, but we could support multiple populators.

I'll need to think about it again when it's not midnight, but I think it's reasonable (if used alongside the invalidationCounter so the lock window is small).

Comment on lines 241 to 242
applicationNamesCache.invalidateAll();
final Set<String> uniqueNames = Set.copyOf(updatedApplicationNames);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For invalidation, the application names cache is always completey invalidated since it requires some effort to identify the applicable entries. We could do this, but the gain may not be much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me, what's Kibana's typical usage pattern for querying? Does it use wildcards for the application name?

If so, I think invalidating the name cache means that invalidating for a single application (which might not even exist) would effectively invalidate the whole cache because it would mean that querying for kibana* would end up not using any cache at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana always sends a single concrete application name kibana-.kibana. So it should be fine for Kibana's typical usage.

But let me know if you think it is still necessary. The logic would look like something as the follows:

foreach cache key (type is Set<String>)
   foreach key member
     foreach application
        if (key member == application) or (key member is a wildcard and matches application)
          invalidate the cache key

Java code would be

StreamSupport.stream(applicationNamesCache.keys().spliterator(), false)
    .filter(keys -> keys.contains("*")
        || Sets.intersection(keys, uniqueNames).isEmpty() == false
        || keys.stream().filter(k -> k.endsWith("*")).anyMatch(
            k -> uniqueNames.stream().anyMatch(n -> n.regionMatches(false, 0, k, 0, k.length()-1))))
    .forEach(applicationNamesCache::invalidate);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it - realistically we're talking about Kibana only, so clearing all applications isn't actually going to hurt anyone, and the only time we will clear the cache is on privilege update which happens when you install a new Kibana version.

listener.onResponse(Collections.emptySet());

} else {
final Tuple<Set<String>, Map<String, Set<ApplicationPrivilegeDescriptor>>> cacheStatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning bells go off for me when I see complex Tuples like this (though I'm guilty of using them as well).

I'd prefer we avoided it entirely, but if we really need it, can we assign the members to appropriately named local vars as soon as possible after the method returns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complexity is due to an attempt to optimize number of documents to be fetched from index. It can be simplified if we always fetching everything when hitting the index is unavoidable (as you suggested below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we remove this optimization the Tuple<Set<String>, Map<String, Set<ApplicationPrivilegeDescriptor>>> data structure is no longer necessary. So complexity will definitely be reduced

} else {
privilegesStore.invalidate(Arrays.asList(request.getApplicationNames()));
}
rolesStore.invalidateAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. It seems like this API ends up doing something other than what it was supposed to, just because we assume that the caller wants it.
I understand why - if the privileges have changed then the roles cache is probably wrong, but it seems like it's chain side-effects together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From just the API point of view, you are right that these two should not be tied together. There are valid use cases when user only wants to actively clear privileges cache. I did this because the two are always tied together in NativePrivilegesStore since it was clearing role cache before my change.

I tried to avoid nested callbacks (clear role cache then clear privileges cache). It seems OK and more efficient by just looking at NativePrivilegesStore. But it does feel wrong from pure API side.

I could either just go with nested callback or create a transport layer only action to clear both caches. So it is not exposed at REST layer and still has the efficiency on transport layer. But this does lead some code redundancy. Another option is to have a query parameter for the clear privilege cache API. When set to true, it clears both caches. What do you think?

import java.io.IOException;
import java.util.List;

public class ClearPrivilegesCacheResponse extends BaseNodesResponse<ClearPrivilegesCacheResponse.Node>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate to this PR, it feels like we could consolidate these duplicate classes into a common base class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common base class for all ClearXxxCacheResponse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Not a priority, but there's a bunch of copy paste here that we could ditch.

@ywangd
Copy link
Member Author

ywangd commented Apr 30, 2020

Resolves #54317

ywangd and others added 5 commits April 30, 2020 12:39
…ecurity/authz/store/NativePrivilegeStore.java

Co-Authored-By: Tim Vernum <[email protected]>
…ecurity/action/privilege/TransportClearPrivilegesCacheAction.java

Co-Authored-By: Tim Vernum <[email protected]>
@ywangd ywangd requested a review from tvernum June 2, 2020 14:02
@ywangd
Copy link
Member Author

ywangd commented Jun 2, 2020

@elasticmachine run elasticsearch-ci/1

@ywangd
Copy link
Member Author

ywangd commented Jun 3, 2020

After discussion with @tvernum, it is agreed that a ReadWriteLock is necessary to achieve maximum correctness. Indeed we cannot completely avoid caching stale result. But we can ensure if stale result is cached, it will be invalidated as soon as possible and this is where the ReadWriteLock comes in. It works as the follows:

  • Before caching the result, acquire a read lock
  • Before invalidating the cache, acquire a write lock

The acquisition of a read lock ensures any invalidation requests will be held off until the current result is cached. If the current reseult is stale, e.g. because index is updated while it is being cached, the locking mechanism guarantees that invalidation will happen after current caching action finishes. In another word, the time window for a stale result to stay in the cache is as small as possible.

The code is updated accordingly.

@ywangd
Copy link
Member Author

ywangd commented Jun 15, 2020

Just pushed two more updates:

  • Add a ttl (default 24h) to the caches. This setting acts as a safety net in rare cases where things go out of sync and we ensure out-of-sync info will be cleared out eventually.
  • Use a single the cache size setting for both caches because the two layer caching is an implementation detail and does not feel right to enforce users to be aware of it.

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think we can be a bit smarter with the locking.

// Always completely invalidate application names cache due to wildcard
applicationNamesCache.invalidateAll();
uniqueNames.forEach(descriptorsCache::invalidate);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can release the lock immediately after incrementing numInvalidation. It will mean that, in theory, we could invalidate things that we don't need to, but would mean holding a lock for less time.

Did you consider the trade-off of how long to lock for vs perfect cache accuracy?

If we keep the lock around all the invalidation process, then I think the calculation of uniqueNames should be before the lock is acquired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tvernum. Your comment is very insightful. You are right we could minimize the locking time in this case. The possibility of invalidating more than necessary should be very low. The getPrivileges thread needs to read the incremented value of numInvalidation and perform a search query and all these have to complete before the cache is invalidated. The chance should be extremely low. I have updated the code to minimize the locking for both invalidate and invalidateAll.

A more possible scenario is "unnecessarily skipping put result in cache". We only cache when numInvalidation does not change. But this value changes for both partially and full cache invalidation. In the case of partial invalidation, the things get invalidated may not be relevant to the things that we want to cache. But the code would just skip caching them regardlessly for simplicity. This however has nothing to do with the locking, i.e. the same situation exists before we added the locking. Overall, combined with how Kibana behaves, I think this is an acceptable trade-off because:

  • The chance is still low
  • We always fully invalidate applicationNamesCache even for partially invalidation. And we decide to keep it this way for simplicity.

Also moved descriptorsCache != null check before the read lock, so we do not just lock and find out there is no cache to use. It is an edge case optimization but anyway it is easy to add.

@ywangd ywangd merged commit 38185e5 into elastic:master Jun 29, 2020
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Jul 1, 2020
Add caching support for application privileges to reduce number of round-trips to security index when building application privilege descriptors.

Privilege retrieving in NativePrivilegeStore is changed to always fetching all privilege documents for a given application. The caching is applied to all places including "get privilege", "has privileges" APIs and CompositeRolesStore (for authentication).
ywangd added a commit that referenced this pull request Jul 2, 2020
Add caching support for application privileges to reduce number of round-trips to security index when building application privilege descriptors.

Privilege retrieving in NativePrivilegeStore is changed to always fetching all privilege documents for a given application. The caching is applied to all places including "get privilege", "has privileges" APIs and CompositeRolesStore (for authentication).
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 4, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.

Co-authored-by: Russ Cam <[email protected]>
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Aug 5, 2020
Relates: elastic/elasticsearch#55836

Derive ClearCachedRealmsResponse from NodesResponseBase to
expose NodeStatistics.

Co-authored-by: Russ Cam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cache for application privileges
6 participants