Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Audit Policy KEP #1259

Closed
wants to merge 1 commit into from

Conversation

shturec
Copy link

@shturec shturec commented Sep 26, 2019

The proposal is to enhance the Dynamic Audit Backend feature with a audit policy API that allows to select subsets of requests with rules for audit level assignment.

See the API types out-of-tree implementation for reference.

@k8s-ci-robot
Copy link
Contributor

Welcome @shturec!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @shturec. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Sep 26, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shturec
To complete the pull request process, please assign liggitt
You can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shturec
Copy link
Author

shturec commented Sep 26, 2019

/assign @liggitt

@tallclair
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2019
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get through the whole KEP, but will make another review pass on Monday.

Can you link the google doc with the original discussions in the PR description?

- policies are defined with the `auditregistration.k8s.io/v1alpha1` [`AuditSink`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.13/#auditsink-v1alpha1-auditregistration) API, which has a [`Policy`](https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/auditregistration/types.go#L77) structure for that purpose.
- Policies define the default level of audit detail for requests to the API server.
- A level can be assigned also to a subset of all requests that are selected with rules.
- Rules are decoupled from policies into dedicated `auditregistration.k8s.io/v1alpha1` `AuditClass` API objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a rule a {audit class, level} tuple, that is embedded in the policy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'm describing this in the context of the previous bullet. What I mean here is the rules that are encoded in the AuditClasses. It's been a challenge generally to clarify, which rules I mean - policy rules vs. audit class rules. Hopefully, I used these word combinations below and it's clearer.

keps/sig-auth/20190926-dynamic-audit-policy.md Outdated Show resolved Hide resolved
keps/sig-auth/20190926-dynamic-audit-policy.md Outdated Show resolved Hide resolved
keps/sig-auth/20190926-dynamic-audit-policy.md Outdated Show resolved Hide resolved

Requests matched by rule are *selected*. Rules define the criteria to match requests. The way it is interpreted is no different to the file policies.

Selection rules are typed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sold on categorizing the rules like this.

Copy link
Author

@shturec shturec Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, state your concerns.

The design driver was to make them as explicit as possible. Having supplementary documentation on what is actually valid selectors combination and values is not exactly dev-friendly. I'm open to suggestions how to achieve that in other (better?) ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns are mostly around useability, since the approaches we're debating are equivalent in expressiveness.

If I want to include resources X, Y, and Z and don't care about what namespaces they're in, I need to know whether each one is cluster-scoped or namespace scoped. This is also error prone, as I don't think we can validate that the real resources have the same scope as that listed here. I also think there is some redundancy in the current approach.

Thinking about this a bit more, I think my main concern is around separating cluster & namespace scoped resources. Here's an alternative approach (strawman):

type GroupResourceSelector struct { 
  // +optional
  Group string
  // +optional
  Resources []ResourceSelector
  // +optional
  Scope ScopeSelector
}

type ResourceSelector struct {
  Kind string
  // Subresources on this Kind.
  // nil = no subresources.
  // '*' = all subresources & non-subresourced
  // '' = non-subresourced
  // +optional
  Subresources []string
  // +optional
  ObjectNames []string
}

// The scope selector makes the cluster vs. namespaced scope more explicit
// +union
type ScopeSelector struct {
  // +unionDiscriminator
  Scope ScopeType
  // +optional
  Namespaces []string
}

const (
  ScopeTypeAny = "Any" // default
  ScopeTypeCluster = "Cluster"
  ScopeTypeNamespaced = "Namespaced"
)

This still feels too clunky to me. I'll keep thinking about it...

Copy link
Author

@shturec shturec Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're going. Admissionregistration has some nice concepts about request selection we can borrow. I like the idea of merging namespace into the scope of GroupResource(Selector) structure. Sounds like that's where it belongs indeed.
I have two concerns though (or maybe one if redundancy is not a thing anymore):

  1. With the proposed approach you can still do the contradiction:
    scope: 
      scope: Cluster
      namespaces: ["x","y","z"]
    
    , which is like the following possible error you have to keep an eye on today in file policies:
    - group: "" 
      resources: ["configmaps"]
      namespaces: [""]
    
    On the bright side, I agree it's more explicit in your proposal than it was in file policies and it does not build on the special semantics of empty namespace string in a namespaces array. That's still an improvement over file policy.
  2. You can't avoid verbs and users selectors redundancy even in this proposal. As long as we do any kind of breakdown into selector types, they have to be repeated in every single one because they can always be part of the criteria.

Copy link
Author

@shturec shturec Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through abut unions this starts to shape for me too. Anyway a couple of questions:

  • What does it mean if Scope in GroupResourceSelector is undefined? Same as Any scope?
  • How does scope type Any translate to the current policy model? The tuple cluster resource rule and an any namespace rule (with undefined namespace) should be the equivalent in the file policy model, am I right?

Also unions seem to be in beta now, right? I saw some code parsing the tags, but not exploiting results. Also I didn't see standard resource using it (for now). My expectation was that the annotating for unions would make it somehow more explicitly to the generated API documentation. And that it would ease the validation somehow. What's the actual plan?

Copy link
Author

@shturec shturec Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's recap and drive this to the finish line.

The original proposal in this KEP follows the RBAC approach of distinction between cluster and namespaced resources with dedicated selector struct types. Unlike it, it distinguishes non-resource request selectors as a distinct selector type, and models the "any" namespace selection in the resource selector type rather than in the cluster resource selectors type.

The proposed change borrows the scope concept from the Dynamic admission control approach for request selection rules reducing the two types of resource selection structs to one, using a discriminator for the resource scope (cluster|namespaced|any). In addition it introduces a more explicit and fine-grained selection model for resources (incl. explicitly subresources now)

The example from the KEP implemented as proposed here, would look like this:

apiVersion: auditregistration.k8s.io/v1alpha1
metadata:
  name: sensitive-things 
kind: AuditClass
spec:
  rules:
  - userGroups: ["system:masters"]
    verbs: ["create","patch","update","delete"]
    groupResourceSelectors: # match secrets and configmaps in the kube-system namespace
    - group: ""
      resources: 
      - kind: secrets
      - kind: configmaps
      scope: Namespaced
      namespaces: ["kube-system"]	  
  - groupResourceSelectors: # match auditclass cluster resources
    - group: "auditregistration.k8s.io"
      resources: 
      - kind: auditclasses
      scope: Cluster
---
apiVersion: auditregistration.k8s.io/v1alpha1
kind: AuditClass
metadata:
  name: noisy-lowrisk-things
spec:
  rules:
  - groupResourceSelectors:  # match secrets and configmaps with name controller-leader in the kube-system namespace
    - group: ""
      resources: 
      - kind: configmaps
        objectNames: ["controller-leader"]
      scope: Namespaced
      namespaces: ["kube-system"]
  - users: ["system:kube-proxy"]
    verbs: ["watch"]	  
    groupResourceSelectors: # match services and endpoints cluster-wide
    - group: ""	
      resources: 
      - kind: services
      - kind: endpoints
  - userGroups:  ["system:authenticated"]
     nonResourceSelectors:  # match requests to URL paths patterns  "/api*" or "/version"
     - urls: ["/api*","/version"]

It seems we won't be able to get advantage of the unions enhancement for now, but the foundations for adopting it later on will be laid down.

Finally, I would like to propose two more improvements:

  1. Change from flat Users and UserGroups to extensible Subject
    The main reason is to be consistent with how we treat for example resources in GroupResourceSelector. The criteria composition there is with OR, while on rule level it's AND. Introducing a list of subjects, either of which can be either a User, UserGroup (or service account or anything else in future) each with a set of filter values is more coherent as the current approach with flat User and UserGroup. Rewriting the example above:
  - subjects: 
    - type: User
      names: ["system:kube-proxy"]
    verbs: ["watch"]	  

Note that the proposed user criteria composition is semantically no different from the one in the static policy.

  1. Selection of generated or high volume objects by label

I have concerns about 1.17 timelines, so on this stage the minimum that I want is to confirm it and ensure that it can be introduced with a subsequent API update. For that I will need the namespaces list components to be composite objects.

Generated and high volume objects are a common case for example for providers of managed environments. Think of namespaces for consumers control planes. In this use case the lack of convenient way to select a potentially enormous amount of objects (e.g. namespaces) is a bigger concern than somebody applying or removing a label, since the cluster control is concentrated and controlled anyway.

Also considering that one of the major scenarios is debugging, there is higher demand for flexibility in selecting audited objects, incl. generated ones. For example, you'd be interested in what is it that does funny things to your "my-app" pods for a deployment, not in all pods in a namespace in general, esp. in large solutions. Of course that would be encouraged only in clusters where it is safe to rely on the objects labels.

I would like to change the namespaces selector to comprise composite structs allowing dynamic selection based on labels similar to what we have in dynamic admission control here and here.

Considering the example above a selectors would then look like that:

groupResourceSelectors: # match secrets and configmaps in the kube-system namespace
- group: ""
  resources: 
  - kind: secrets
  - kind: configmaps
  scope: Namespaced
  namespaces: 
  - name: kube-system

or using a namespace selector as proposed:

groupResourceSelectors: # match secrets and configmaps in namespaces labeled as "environment:prod" or "environment:staging"
- group: ""
  resources: 
  - kind: secrets
  - kind: configmaps
  scope: Namespaced
  namespaces: 
  - matchExpressions:
    - key: environment
      operator: In
      values: ["prod","staging"]

Similarly, a matchExpression could be added to ResourceSelector to achieve the same on object level like here.

## Summary

The proposal is to enhance the [Dynamic Audit Backend](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#dynamic-backend) feature with **audit policy** that has the following characteristics:
- policies are defined with the `auditregistration.k8s.io/v1alpha1` [`AuditSink`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.13/#auditsink-v1alpha1-auditregistration) API, which has a [`Policy`](https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/auditregistration/types.go#L77) structure for that purpose.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the policy a Kubernetes resource, or is it a type embedded in the audit webhook resource?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A type embedded in the audit webhook resource (AuditSink > Policy).


Introducing a dependency in a `AuditSink` policy to another runtime object (`AuditClass`) by reference, raises concerns about maintaining the consistency of this relationship. For example, what should be the status of an `AuditSink` comprising a policy that refers to a non-existing `AuditClass`. How is this addressed when it is created and when the `AuditClass` is deleted at runtime?

This proposal borrows conceptually from the approach of `Pod` resources towards used resources, such as secrets. `AuditSink` objects are created even if referenced `AuditClass` objects are not yet available at the time, but the corresponding dynamic backend does not send events to that sink until all references are available. Deleting an `AuditClass` object without removing existing policy references to it is blocked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. IMO the audit class should be thought of as a separate, async system:

  1. Request is received
  2. Audit classifier labels the request with all audit classes that it matches
  3. (per webhook) audit the request at the level of the first rule that matches the request.

Under this model, if the AuditClass resource is missing, then no requests will have that class label, so that rule in the policy will be ignored.

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably where we have to start. Let's elaborate.

Audit classifier labels the request with all audit classes that it matches

What's the criteria for a classifier to match a request to a class dynamically at runtime? A request has no supplementary metadata. So the only thing left is to figure out if a class has a rule that matches the request. Then again, how's that different to step 3 and what is already proposed here?

Considering the context, I would understand the proposal if on step 2 it was "Audit classifier labels the request with all audit classes". Was that what you actually meant by chance?

Copy link
Author

@shturec shturec Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused abut the labels part too. So you are saying that class (rule sets) references in a policy should not be normative, but kind of recommendation? If classes are delivered by one role (say an operator, together with the deployment of application or extension bundles), and the policies are defined by a more privileged role (say an administrator), that means that operators can influence what's being audited. A proper delete of a class could hide their activities for good. And that would not sit well if you wanted only administrators to control auditing.

But I must admit that adhering to the role separation implied in the KEP has other disadvantages. Consider administrator and operator roles in the context above with policy design being fully in control of the admin and rules only provided by operators e.g. together with deployed apps/extensions. That is kind of a rephrase of what we have now with the static policy file, but in a slightly different form. I mean, that approach implies certain apps/exts domain knowledge in the administrator. How would they know if and why and at what level they have to include an app/extension rule set (from a class) provided by an operator, without understanding its semantics and audit requirements? But concrete apps/exts may well be out of scope for their role too. See the contradiction? We are imposing a concrete domain knowledge in a role scope that should be agnostic to it.

Maybe, this is not the right role break down, or other forms of audit objects breakdown and/or mechanisms are needed. I am eager to hear improvement suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to give an example to illustrate what I was proposing:

Suppose I have the following audit classes (pseudo-code):

- name: tim-requests
  users: ["tim"]
- name: bob-requests
  users: ["bob"]
- name: get-requests
  verbs: ["get"]
- name: secrets-requests
  resources: ["secrets"]
- name: pod-requests
  resources: ["pods"]

And my audit policy looks like:

- level: None
  class: "foo-requests" // we don't have "foo-requests"
- level: Metadata
  class: "secrets-requests"
- level: RequestResponse
  class: "tim-requests"
- level: None
  class: "get-requests"

1. Request Received: Then a request comes in:

[user=tim] GET /api/v1/namespaces/default/secrets/foo

2. Classification: The system "labels" (attaches to the context) the matching audit classes:

- tim-requests
- get-requests
- secrets-requests

3. Policy Decision: Then the audit policy takes the first matching rule: secrets-requests --> Metadata

The classification step (2) has no notion of a "foo-requests" class, so it doesn't care about it. It only considers the classes that are defined. The policy decision step (3) doesn't care what audit classes are defined, only what classes the request is labeled with. The request isn't labeled with "foo-requests" just like it isn't labeled with "bob-requests" - from the policy decision perspective there's no difference between those 2 situations.

Does this help clarify?

Copy link
Author

@shturec shturec Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you. That is actually inline with the proto-implementation that was prepared for the initial policy proposal (except for subtleties such as classes attached to context but at the end it works as described) an I can imagine a good deal of it will be reused. Let me confirm a few things with you though:

  • Step 2 does not occur on per-request basis. I am using a cached and synchronized AuditClasses state (using informers) to figure out whats available. I don't check the state synchronously with list per request. That would be an overkill.
  • When I refer to an effective policy or compiled policy in the KEP I mean a policy that is produced out of the classes identified on step 2 and the policy levels, and then normalized in a ready-to-use state for step 3. It is updated on changes in the audit sink or (referenced) classes.
  • For step 3 I use the server auditing existing checker mechanism to figure out level and stage out of the compiled policy and the request.

It's seems we should focus the discussion on who controls what (my second comment). I am not completely fond of the options for resources and responsibilities breakdown as they stand now.

I'm starting to think we should elaborate the following.
An admin can strictly define the cluster resources they do care about as class references in a policy.
Operators can contribute classes with domain-specific rules together with audit levels that are then dynamically composed in a policy, should an admin allow that.
Consider our earlier discussion on my proposal for composition by labels/annotations as an enabler for that.
Of course we need to decide on the ordering and levels issues then. But I think that nicely solves the responsibilities mix problem that we have now. And it is inline with the concept of dynamically resolved effective rules, but at the same time without shifting control from admins. Overall, we are discussing dynamism vs control and with this proposal we delegate the decision to the concrete cluster setup owners.

WDYT? Should I sketch it more precisely?

Copy link
Author

@shturec shturec Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's recap and finish this.

My proposal is to allow three options:

  1. Administrator is fully responsible for the policy compositions. They could control also the auditclasses life cycle management. Or alternatively operators could deploy audit classes and negotiate with administrator to include them in the policy (in case the two roles are distinguished). In case another role can control the auditclasses, their removal needs to be controlled by the system when they are referenced by a policy. That would guarantee the control of the administrator on the policy.
  2. Administrators have exclusive control, but allow operators to contribute to a policy dynamically. This allows to define some minimal rules that take precedence as an enforced compliance framework for example. In this case the policy features some hard-wired referenced rules, most probably deployed by the administrator themselves. And it can also feature rules that are dynamically contributed by operators with the solutions. This option allows a great level of flexibility and separation of concerns. Operators can helm solutions and auditclasses in and out of cluster, focusing on the solution compliance and administrators can focus on the cluster compliance, non-disruptively to each other.
      rules:
      - withAuditClass: sensitive-things
        level: Metadata
      - selector: # the rule matches all auditclasses labeled with "environment: prod"
        - environment: prod
    In this example an administrator allows auditclasses labeled with environment: prod to contribute rules to the policy. The rules will be considered after the rules defined in the sensitive-things auditclass.
    A note about levels of dynamically added classes. To preserve the separation of concerns that we enjoy here, levels need to be defined either in the classes or annotated/labeled by operators at deploy time. Either way, despite that it's less explicit, this is not administrators responsibility and concern.
    A note about order. Order of rules from dynamically selected auditclasses in the effective policy will not be guaranteed. We could introduce annotations/labels that operators could use to explicitly define this, but IMO first we need to confirm it is required.
  3. An administrator can opt to have a policy populated with rules fully dynamically, i.e. same as above, but without the hard-wired auditclass reference. That can be achieved either using an agreed label on all classes or we could model it more or less explicitly e.g. with empty selector. No control on availability of auditclass resources is exercised in this case as you suggested.

In this way we avoid mixing roles responsibilities and at the same time achieve a compelling level of dynamism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm having a hard time following this thread. Let me try to recap the issue:

  1. Administrators should be able to define an audit policy in a secure way that can only be altered by other admins or highly privileged users (e.g. security team).
  2. Operators want to be able to extend the audit policy for a service they operate. For example, I added a new type and I want to make sure it is audited appropriately, or I added a new service that makes a lot of requests, and I want to omit it to cut down on noise.

I think the problem is that these 2 use-cases are fundamentally at odds. I suppose there might be some way for the admin to define a policy on how/what the operator can modify, but I can't think of a good way to define such a policy. Ultimately, I think something like gatekeeper or OPA might be the best tool here, to allow arbitrary restrictions on what audit classes a user can define. IMO we should focus on the first case.

There is also a third use case:
3) I want to add some special-case logging for my service, e.g. for debugging.

As it stands, there isn't a good way for an administrator to allow this securely. Even if a user is allowed to edit a webhook created for them, they could abuse that to read everything (secrets) in the cluster. In this case, I think what we really need is namespaced audit webhooks, and we should consider that out-of-scope for this KEP.

Does this answer your questions, or am I way off base here?

Copy link
Author

@shturec shturec Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not way off, but we are still thinking in different terms how to achieve it. I am convinced that we can address the scenarios with clear separation and simpler means, with the policy "ease of reasoning" being the only partially open topic. I think I've laid out my point sufficiently in my last post above, so I will not repeat it. I suggest we make a touch base call to find the common ground as doing that offline seems to go nowhere.

@shturec
Copy link
Author

shturec commented Sep 30, 2019

I added reference to the draft with the comments both in the kep metadata and in the References section

@tallclair
Copy link
Member

/assign @lavalamp
As the original proposer of the AuditClass approach.

@shturec shturec force-pushed the dynamic-policy-kep branch 6 times, most recently from c17c550 to 774d252 Compare October 3, 2019 09:16
@shturec
Copy link
Author

shturec commented Oct 3, 2019

added:

  • reference to the API out-of-tree implementation
  • inline short code snippets illustrating the proposal as you read
  • a section discussing the compatibility with the file policy in the same cluster.

@shturec shturec force-pushed the dynamic-policy-kep branch from 774d252 to 722f650 Compare October 3, 2019 09:32
@k8s-ci-robot k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 3, 2019
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 3, 2019
@shturec
Copy link
Author

shturec commented Oct 10, 2019

Since the API freeze for 1.17 is approaching fast, I would like to close any major open topics wrt the API design and runtime by the end of this week.
Thanks!

@shturec
Copy link
Author

shturec commented Oct 14, 2019

proposed:

  • extensible subjects
  • label selectors to capture subset of high volume or generated objects e.g. for debug

Waiting for feedback before deadline in order to merge it into the KEP.
See the Design Proposal supplementary document for preview.

@shturec shturec force-pushed the dynamic-policy-kep branch from 722f650 to 3841e14 Compare October 15, 2019 00:20
@shturec
Copy link
Author

shturec commented Oct 15, 2019

@lavalamp, @tallclair still waiting for approval before deadline slips.

Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to focus simplifying this. As it stands, it's hard to reason about, and it's going to be hard to get everyone to agree in a reasonable timeframe. Can you think of any ways to break this problem down into smaller pieces that can be addressed independently?

I'll follow up with a comment with some ideas.

- "@shturec"
owning-sig: sig-auth
participating-sigs:
- sig-apimachinery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sig-instrumentation


In environments, such as managed clusters (e.g. GKE), there are multiple roles that are interested in controlling certain audit aspects that do not necessarily overlap, or at least not completely. When a single policy controlled only by one of the roles is not the actual goal, it is a limitation that cannot address adequately the interests of all auditing stakeholders.

### Insight into cluster activities on the spot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we're conflating too many use cases, and that's why this KEP is so long and complicated. I agree that audit logging is useful for this, but I don't think sig-auth is the right owner for it. There are a lot of different ways one could instrument a Kubernetes deployment (e.g. distributed tracing).

@tallclair
Copy link
Member

Was brainstorming about this, and have some more radical ideas. This is a big departure from what we've discussed so far, but might actually get at the use cases more specifically, and we might be able to tackle it in smaller pieces.


We've mostly been thinking of this in terms of level of detail, noise, how to slice up requests, and how to grant access to the policy.

Let me attempt to reframe the problem in terms of the fundamental properties of the request.

  1. The consequences of the request - I.e. what could happen if this request was abused? This could be sort of a "severity" measurement, but also might not be strictly ordered, e.g. this object holds PII, and this secret holds the documentation site access credentials might be tagged as "PII" and "access-credentials" respectively.
  2. The sensitivity of the request - I.e. what are the consequences of accessing the audit logs? The sensitivity could be used to determine both the level of detail included in the audit logs, and who is allowed to access that log entry.
  3. The trust of the requester - I.e. what are the chances that the requester is compromised or acting maliciously?

I think that (1) and (2) are properties on API objects, or the target of a request. (3) is clearly a property of the requester, or the source of a request.

Observation: (1) and (2) are properties of the resources, and shouldn't vary across log destinations. Given that, what if these properties were attached to the objects themselves? E.g. when I create a resource, I decide how sensitive it is, what bad things could happen if it's accessed maliciously.

Attaching a trust level to a requester is harder, and needs to come from the authentication mechanism. For service accounts, it's easy enough to make it a property on the SA resource. For external users, it would need to come from the external system (or just take some constant baseline).

Now, revisiting some use cases:

  1. Cluster admin: Set up cluster-wide audit policy, register webhook, log everything, set severity threshold (how noisy) and level of detail.
    • Also: tag cluster-scoped objects & system resources with severity & sensitivity.
  2. Operator: Can set the severity & sensitivity on objects they own. Doesn't need to worry about policy beyond that.
  3. Application developer: same as the operator?

The one piece that's missing from this is how to address a noisy request. E.g. a service that's constantly reading a secret that it owns. Auditing each of those requests doesn't tell me anything - I expect that service to be reading that secret. This could be part of the audit policy, but that doesn't scale as well, since the admin controls the policy. It could be part of authorization - when a user is granted permission, the user granting the permission decides how trusted the action is. It could also be tied to the target - e.g. the owner of the secret declares (on the secret itself) which requesters are trusted. I need to think more about these before picking a favorite...


Sorry this is all very rambly. Just trying to get some thoughts written down. Moving forward on any of these ideas is going to require some broader high-bandwidth conversations (i.e. sig meetings).

@shturec
Copy link
Author

shturec commented Oct 16, 2019

Can you think of any ways to break this problem down into smaller pieces that can be addressed independently?

Not before the kep freeze.

@spiffxp spiffxp mentioned this pull request Jan 1, 2020
2 tasks
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020
@palnabarun
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 13, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants