Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for application privileges #54317

Closed
tvernum opened this issue Mar 27, 2020 · 7 comments · Fixed by #55836
Closed

Add cache for application privileges #54317

tvernum opened this issue Mar 27, 2020 · 7 comments · Fixed by #55836
Assignees
Labels
>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team

Comments

@tvernum
Copy link
Contributor

tvernum commented Mar 27, 2020

When we added Application Privileges we intentionally didn't cache them because the semantics of a distributed cache can be tricky (we've had to tweak the user & roles caching logic several times) and we weren't sure how necessary it would be

We've recently seen a few places where that lack of caching had an impact, and it's probably time to bring it in.

@tvernum tvernum added >enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Mar 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Authorization)

@ywangd
Copy link
Member

ywangd commented Apr 20, 2020

It is probably obvious. My overall approach will be like the follows. Please raise any concerns before I commit deeply into the changes. Thanks.

@tvernum
Copy link
Contributor Author

tvernum commented Apr 21, 2020

I don't think it's necessary to use the ListenableFuture stuff from the CUPR, a simpler cache is probably sufficient.

The tricky bit to think through will be what you cache by. The main method on NPS is

getPrivileges(Collection<String> applications, Collection<String> names, listener)

With special cases of:

  • Both Collections are empty (fetch all)
  • names is empty (fetch all privileges for application(s))
  • applications contains "*" (fetch for all applications)
  • An element in applications ends with "*" (fetch by prefix)
  • applications is empty (fetch for all applications)
  • fall back to fetch-by-id if none of the applications are a prefix match, and there are explicit names

So we could have two approaches to caching

  1. Cache by query, so the cache key is Tuple<Set<String>, Set<String>>
  2. Cache by id, so the cache key is Tuple<String,String>

The former will be more efficient for repeated searches on the same values, but by useless when the query is slightly different. E.g. If we have an up to date cache { Set.of("*"), Set.of() }, then we know the answer to { Set.of("kibana-.kibana"), Set.of() }, but we would not know that, and would not get any benefit from the cache unless we do special handling.

Caching-by-id will mean that a wildcard search will populate the cache so that a future request to get a named privilege will be fast, but it would not really help with caching of the original wildcard query.

Cache-by-query is trickier to keep in sync with updates to privileges. You'd either

  1. have to be really smart about working out which queries would be affected by the update, or
  2. just clear the whole cache whenever a privilege was created/modified.

I suspect it might be worth having both caches, but we'd need to think it through.

@ywangd
Copy link
Member

ywangd commented Apr 23, 2020

@tvernum Thanks a lot for the discussion. Here is my analysis.

TL;DR
Based on existing use cases, the cache should be keyed by application name (literal, no wildcard) and values are a set of ApplicationPrivilegeDescriptor.

The NativePrivilegeStore#getPrivileges method is called in the following three places:

  1. CompositeRolesStore for Role building during authentication
  2. GetPrivileges API
  3. HasPrivileges API

For use case 1 (CompositeRolesStore), since a Role is itself cached, I think it is reasonble to not cache its underlying call to getPrivileges, i.e. we could have a separate method for getting privileges which skips caching at all. This also has the benefit to avoid some complexity for the caching design since CompositeRolesStore is where complicated combinations of applications and privileges names can be passed to the getPrivileges call.

For use case 2 (GetPrivileges), it can only take a single application name, which can be wildcard. But this wildcard is enforced to be at the end of the string, i.e. app-* is a valid wildcard and translated to a prefix search to fetch all applications begin with app-. But *-app will be search literally and match nothing. For privilege names, it can accept a comma separated list, but wildcard does not work for them.
When Kibana launches, it fires a get privileges call with only its application name, which effectively calls getPrivileges("kibana", null) and gets all kibana's privilege
documents back.

For use case 3, it always pass null to getPrivileges call. It allows for mulitple application names, but does not support wildcard in application name (this is enforced by the request validator). Effectively, this usage can always be treated as mulitple calls to getPrivileges with each of them having a single application name, i.e. multiple getPrivileges(String, null).
Kibana has heavy usage of this API, it basically call it all the times when an user clicking something on the UI, e.g. switching App, or switch tab in the same app, executing a search in the console. All these translate to a getPrivileges("kibana", null), which fetches all privilege documents belong to kibana application.

Based on above analysis, I'd propose that the cache should be keyed by the application name (literal name, no wildcard) and the values are a set of ApplicationPrivilegeDescriptor. We could have the values as document IDs instead of ApplicationPrivilegeDescriptor. But it not may not worth the indirection. This is because the documents are of decent size, hence the cost of deserialisation is likely be a big part of the overall process. So it would be better if caching could avoid them all together.

With this design, use case 3 is easily covered. If a single application is required, it is a direct retrieval from the cache. If mulitple appliations are required and some of them are not available in cache, we can fetch for the unavailable applications only.

Use case 2 can also be catered reasonablly. If the application name is not a wildcard, e.g. _security/privilege/app-1/read,write , it becomes more or less like use case 3. Since the cache has all privileges for an applications, we can also easily pick and choose the required privileges.

If the application name does contain wildcard, e.g. _security/privilege/app-*, we can check what entries in the cache can match the wildcard. This is easily achievable since it is simply a prefix check. In the subsequent search, we can excluded the matched entries from the search criteria. This has the downside that a search is performed regardlessly. If app-* can only match to app-1 and we already have app-1 in the cache, but without the search, we cannot know whether the wildcard would match something else or not. A remediation is to have a second cache which is keyed by application wildcard expression and values are a list of concrete applications that can match the wildcard, e.g. key is app-* and values are Set("app-1", "app-2"). With this in place, we can avoid the search if the first cache have all the descriptors we need.

The cache will be invalidated when there are chanegs to the privilege documents. We can invalidate everything whenever there is a change. But it should be relative easy to just invalidate the applications that have been updated and leave others untouched.

@ywangd
Copy link
Member

ywangd commented Apr 24, 2020

My above analysis has two oversights:

First oversight
It assumes first request to populate the cache is a "has privileges" call. In this case, it is simple to just fill the cache with the required applications. However, if the first request is a "get privleges" call, not every privilege of an application is required. The situation for "application name keyed cache" is tricky in this case. Because the list of retrieved privileges is incomplete for a certain application.

A simple solution is to always fetch all privileges for an application regardless of what the request asks. This makes things simple to handle but has the downside to retrieve more than what is actually need. Or an even simpler solution is to not cache the "get privileges" use case. From Kibana's perspective, it is probably not too bad since it calls "has privileges" all the time.

Another solution might be cahcing the incomplete list and then explicity mark these cached entries to be excluded from subsequently retrieval. This however adds additional complexity.

Second oversight
A single value entry of the cache could become unbounded. Since the cache is keyed by application name and the value is complete list of associated privileges, this list could go arbitrarily long. By simply limiting the number of applications for caching may not do what we need since an single application can risk blow up the cache.

A simple and crude solution is to set a upper limit for number of ApplicationPrivilegeDescriptor that an application can have. If the number is exceeded, we don't cache it. Another possibility is to have the cache limited by actual resource it consumes (e.g. Lucene's Accountable classes). But not sure how feasible it is.

Alternatively ,
as Tim suggested, the cache can be key by "query", i.e. Tuple(Set(application), Set(privilege name)), and the values are a list of String (document IDs). So instead of scroll, we can mget the documents. It is also possible to have a second cache that is keyed by document ID and value is ApplicationPrivilegeDescriptor to reduce numbers of sizes of round-trips to security index.

The first "query" cache could suffer from the same problem of potentially unbounded value entry (list of matched documents IDs). But since it is a list of simple strings, the issue is not as bad. It also has the problem of duplication since two sets of document IDs from two different queries could overlap, e.g. query of "*" compared to query of "Kibana". There is no simple solution for this and a complex solution is unlikely worth the effort.

Another alterntively is to have the cache keyed by Tuple(application, privilege) and value is the corresponding ApplicationPrivilegeDescriptor. However I feel this is least useful as by itself it cannot decide whether the cached entries are complete for a query. If we add tracking for that, it effectively becomes the previous alternative suggested by Tim.

In summary,
we have following two designs for the cache:

  1. Map<String, Set<ApplicationPrivilegeDescriptor> where the key is application literal name.
    A secondary cache could be Map<String, Set<String> where the key is application wildcard and the key is concrete application names the wildcard can expand to.

  2. Map<Tuple<Set<String>,Set<String>>, Set<String>>, where the key is a tuple of "application wildcard" and "privilege names", and value is a set of document IDs.
    A secondary cache could be Map<String, ApplicationPrivilegeDescriptor>, where key is document ID.

They both have pros and cons. I personally feel the first option is overall more attractive. But I am open to suggestions.

@tvernum
Copy link
Contributor Author

tvernum commented Apr 24, 2020

since a Role is itself cached, I think it is reasonble to not cache its underlying call to getPrivileges

Is this true for API Keys? Since we currently have a per-API-key role, and know that we may have many API keys with the same role descriptors (but unique cache-ids) it feels like some caching here would be useful.
However, we know we'd like to improve this pseudo-role caching, so maybe it's not a big deal.

A simple solution is to always fetch all privileges for an application regardless of what the request asks

I don't think that's unreasonable given that has_privileges needs it. It would greatly simplify the code in the NativePrivilegeStore if everything became a load-by-application query.

Another possibility is to have the cache limited by actual resource it consumes

This is easy to do. You pass a weigher Function object into the cache when you build it, and it is used to calculate the cost (weight) of each entry (as a long). The units are irrelevant - they can be estimated bytes, or just number of objects (e.g. value.size() + 1 (for the key)) The cache has a maximum total weight that forces eviction. The weight is not used to determine which entry to evict (it's always the oldest entry), but it prevents the cache becoming too large.

I personally feel the first option is overall more attractive

That's my gut feel also.

@ywangd
Copy link
Member

ywangd commented Apr 24, 2020

since a Role is itself cached, I think it is reasonble to not cache its underlying call to getPrivileges

Is this true for API Keys?

I think Improving pseudo role caching is a better way of handling this. In addition, even without it, the price to pay here is one call per API key at its first authentication. This is a known initial cost and not ongoing. So I think it is affordable. If we get role caching thrashing, then we have a bigger problem not just additional round-trip for application privileges.

I don't think that's unreasonable given that has_privileges needs it. It would greatly simplify the code in the NativePrivilegeStore if everything became a load-by-application query.

If we do make this change across the board, the caching mechanism will work for CompositeRolesStore usage as well, i.e. API key role caching can be improved even in its current form. There is a balance between fetching too much content (and deserialising) vs fetching less frequently.

This is easy to do. You pass a weigher Function object into the cache when you build it, and it is used to calculate the cost (weight) of each entry (as a long). The units are irrelevant - they can be estimated bytes, or just number of objects (e.g. value.size() + 1 (for the key)) The cache has a maximum total weight that forces eviction. The weight is not used to determine which entry to evict (it's always the oldest entry), but it prevents the cache becoming too large.

Awesome!

Thanks for the discussions. I think I am ready to get some hands-on now. I'll proceed with option 1 and cover usage 2 and 3 (get and has privileges calls).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC Team:Security Meta label for security team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants