Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design doc for network policies #29814

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Oct 1, 2024

Motivation

Design doc for network policies.

This PR adds a known-desirable feature.
https://github.com/MaterializeInc/database-issues/issues/4637

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@jubrad jubrad force-pushed the network-policies-design-doc branch 4 times, most recently from c269b08 to f78411c Compare October 2, 2024 16:59
@jubrad jubrad requested a review from pH14 October 2, 2024 16:59
@jubrad jubrad force-pushed the network-policies-design-doc branch 4 times, most recently from 9f0db7a to dd0e74a Compare October 3, 2024 13:55
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up @jubrad! I'll defer to others for the SQL syntax proposals, but structurally this all seems sounds to me. Feels like it tackles just the right amount of scope for v1, while leaving stepping stones to future additions as needed

@jubrad jubrad requested review from pH14 and ParkMyCar October 3, 2024 14:35
@jubrad jubrad marked this pull request as ready for review October 3, 2024 14:35
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo! Thanks for writing this @jubrad, looking forward to getting it built!

Comment on lines +29 to +30
- Policy inheritance from associated roles. IE if 'bob' is a member of role 'eng'
we will not apply policies from role 'eng' to 'bob'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Role specific defaults, e.g. ALTER ROLE parker SET search_path do not get inherited either


### Where network policies get applied.

Network policies could be applied at many layers of our stack, from network firewalls or security groups that intercept traffic before it hits application subnets, to k8s or cilium network policies, balancers, or within the database itself. The above solutions choose to implement policies within the database itself. This comes with some disadvantages. For instance, this layer does not have auto-scaling and requires the database to do some work for each denied request. For this reason, it makes sense to shift the policies left. One possible shift is to the balancer layer. In this scenario, balancers would support both HTTP and pgwire load-balancing as well as network policy enforcement. Balancers have auto-scaling and are relatively stateless. A large number of out-of-policy requests to a balancer would likely not impact any ongoing connections. The biggest challenge with implementing network policies in the balancer is that they do not have access to the policies or roles, which are stored in the database. To move network policies to the balancers we would need some way of sharing all the policies and roles for all the environments a balancer is proxying. Another place we could shift these policies would be to a WAF or network firewalls. Neither one of these seems reasonable to implement for both pgwire and HTTP in a multi-tenant ingress layer, but this could be revisited for private ingress. It would still have the same issues of keeping policies up-to-date as the balancer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be relevant but something that comes to mind here when thinking about shifting policies left is how we store SECRETs in an external system. Point being, there is already precedent for environmentd to store and fetch data from something other than the Catalog or Persist. I could imagine doing something similar for communicating network policies with balancers or something even further left

## Open questions

#### Single default policy for all resources?
Should there be different default policies for users, sources, and sinks, or should a single default policy be applied to all resources once those resources start supporting policies? It may be difficult to roll out new resources if we only have one default, but it does seem nicer in the long run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how AWS or other cloud providers handle network policies for their various services? Seems like a similar design space maybe? How to handle Webhook sources might be a point in the "per-resource type" strategy instead of a single default

Sources and sinks are planned as a follow-up to user-based policies, but it remains an open question how we provide a user-friendly mechanism for webhook sources where it may be hard to find a list of IPs if the webhook request is coming from
a third party.

#### The story on lockout is a bit weak.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO explicitly pointing users to our formal support ticket process gets us pretty far

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never seen this! Agree. We do need to be cautious about how we validate users are who they say we are, but this is a great mechanism for getting our attention.

formal support ticket process gets us pretty far

Comment on lines 65 to 82
```sql
CREATE NETWORK POLICY OFFICE_01 (
RULE ( ACTION=ALLOW, SOURCE="10.0.0.0/32", COMMENT="OFFICE IP - 2024-9-28" )
);
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, I do wonder if we want to separate out COMMENT though and instead point folks towards something like:

COMMENT ON NETWORK POLICY office_01 is 'OFFICE IP - 2024-9-28'

Just so it's the same as other objects. Regardless we should probably pass this through the SQL council!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we'd want the comment to be on each rule which is not a resource that's stored outside of the policy.

... maybe description is a better name for this field.

FWIW my inspiration was, in part, aws security groups

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think comment is good—PostgreSQL (and therefore Materialize) uses "comment" to mean what AWS calls "description" pretty routinely.

Here's my recommendation: force every rule to have a name. Something like this would align with how you declare replicas for unmanaged clusters:

CREATE NETWORK POLICY OFFICE_01 (
  RULES(
    nikhils_desktop (ACTION=ALLOW, SOURCE="10.0.0.0/32"),
    justins_laptop (ACTION=ALLOW, SOURCE="10.0.0.2/32"),
  )
 );

Then you could name a specific rule in the COMMENT statement:

COMMENT ON NETWORK RULE office_01.justins_laptop is 'Laptop assigned to Justin Bradfield on 2024-01-03'

It also tees us up to support altering network policies one rule at a time, like:

ALTER NETWORK POLICY DROP RULE justins_laptop

We don't need to support that first thing, but it's nice to have the option to introduce the granular editing syntax.

Copy link
Contributor Author

@jubrad jubrad Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the alter policy would actually be:
ALTER NETWORK POLICY office_01 DROP RULE justins_laptop

If we go down this path we would probably want to treat rules more like cluster replicas, I was thinking about them more like brokers in a kafka connection. They'll probably want unique names and IDs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the alter policy would actually be:
ALTER NETWORK POLICY office_01 DROP RULE justins_laptop

Ah whoops, yes!

If we go down this path we would probably want to treat rules more like cluster replicas, I was thinking about them more like brokers in a kafka connection. They'll probably want unique names and IDs?

Yes, exactly! I think you could punt on IDs for now, but yeah you'd want to ensure unique names.

@jubrad jubrad changed the title WIP design doc for network policies design doc for network policies Oct 3, 2024
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good. Thanks for writing this up, @jubrad!

A new `NetworkPolicy` resource will be added to the catalog.
```rust
struct NetworkPolicy {
id: NetworkPolicyId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does NetworkPolicyId look like? I'm assuming something that looks like enum RoleId or enum ClusterId, so roughly:

enum NetworkPolicyId {
    System(u64),
    User(u64),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will just be a GlobalId

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it'll need to be its own type! Today GlobalIds are only for things that go in the items collection in the catalog—i.e., things in schemas, rather than things that exist in the global namespace (clusters, roles, databases). @ParkMyCar is also refactoring things to introduce a CatalogItemId, but again that will still be just for objects that exist in schemas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think it'll probably be similar to ClusterId

enum NetworkPolicyId {
    System(u64),
    User(u64),
}

I'll update the doc.

Comment on lines 48 to 59
enum NetworkPolicyRule {
Ingress {
action: NetworkPolicyRuleAction,
source: IpNet,
comment: String
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there ever be a case in Materialize where a single policy will be used in a context where there need to be both ingress and egress rules? With sources, we have push sources (webhooks) and pull sources (Kafka), but never both; it seems intuitive to me that a network policy on a webhook source would restrict inbound connections while a network policy on a Kafka source would restrict outbound connections.

If you buy this argument, I think concretely the difference would be using a generic term like ADDRESSES rather than SOURCE or TARGET or DESTINATION in network rules, and then whether the policy applied in the inbound or outbound direction would be a function of whether it was attached to a role, pull source, or push source.

    RULE (ACTION = ALLOW, ADDRESSES = 'cidr')

Copy link
Contributor Author

@jubrad jubrad Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can intuit whether a policy/rule should be ingress or egress... two reasons:

  1. If we want to introduce ingress sinks, then sinks would support both ingress and egress making it impossible to define a single policy default for sinks.
  2. We want to be able to share policies between resource types without having to ensure some color match.

Thought for bullet 2
Let's say a user bob has a specific network policy bob_policy, that allows ingress access from IP 1.2.3.4/32.
It should follow that bob cannot create a source or sink that would expand their policy scope; i.e, any source or sink they create should inherit the policy applied to the user. This requires that the policy for bob be applicable to all resource types bob can create.

Alternatives for enum

  • have a different ingress and egress rule set for policies,
  • add direction to NetworkPolicyRule and change source to address

I like alternative 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If we want to introduce ingress sinks, then sinks would support both ingress and egress making it impossible to define a single policy default for sinks.

Ah this is a very good point. Let's keep the distinction between egress and ingress rules then, and prepare to allow mixing and matching ingress and egress rules in a single policy. I also like the alternative you've proposed (adding a direction to each rule).

Let's say a user bob has a specific network policy bob_policy, that allows ingress access from IP 1.2.3.4/32.
It should follow that bob cannot create a source or sink that would expand their policy scope; i.e, any source or sink they create should inherit the policy applied to the user. This requires that the policy for bob be applicable to all resource types bob can create.

Hm, I'm not sure this follows for me! I can imagine connecting as a user that's limited by a network policy to ingress from office IPs only, but wanting to create a webhook source that allows ingress from some EC2 instance somewhere that I've set up.

Copy link
Contributor Author

@jubrad jubrad Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I agree with you the network policies a user is able to associate with a given source/sink should be controlled by the policies on which the user has usage policies for, not the users existing policies. If one wants to lock that down to only the policy of the user, they could grant the user only usage privileges to that policy.

Comment on lines 65 to 82
```sql
CREATE NETWORK POLICY OFFICE_01 (
RULE ( ACTION=ALLOW, SOURCE="10.0.0.0/32", COMMENT="OFFICE IP - 2024-9-28" )
);
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think comment is good—PostgreSQL (and therefore Materialize) uses "comment" to mean what AWS calls "description" pretty routinely.

Here's my recommendation: force every rule to have a name. Something like this would align with how you declare replicas for unmanaged clusters:

CREATE NETWORK POLICY OFFICE_01 (
  RULES(
    nikhils_desktop (ACTION=ALLOW, SOURCE="10.0.0.0/32"),
    justins_laptop (ACTION=ALLOW, SOURCE="10.0.0.2/32"),
  )
 );

Then you could name a specific rule in the COMMENT statement:

COMMENT ON NETWORK RULE office_01.justins_laptop is 'Laptop assigned to Justin Bradfield on 2024-01-03'

It also tees us up to support altering network policies one rule at a time, like:

ALTER NETWORK POLICY DROP RULE justins_laptop

We don't need to support that first thing, but it's nice to have the option to introduce the granular editing syntax.


Example syntax for updating the default_network_policy
```sql
ALTER SYSTEM SET default_network_policy = OFFICE_01;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way our variable inheritance works I think you just want to call this network_policy; variables that aren't overridden at the role level default to the system parameter of the same name. This is how e.g. cluster works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I'm not fully understanding this.

My understanding for clusters is that there is a system var for cluster that selects which cluster will be selected by default on login. Users can update this var to select a different cluster. I think we have a different story for policies. Customers shouldn't be able to set their policy through a session var. In order to apply a different policy to a role you'd have to alter the role and set a new policy, or alter the default policy for everyone.

I'm happy to change the name to network_policy just want to make sure we're agreeing on the flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers shouldn't be able to set their policy through a session var. In order to apply a different policy to a role you'd have to alter the role and set a new policy, or alter the default policy for everyone.

💯 agree with this.

The reason I think it's still ideal to use network_policy for both the system and role variable is because you're going to need to add special cases to deny inheritance as a session variable. By default, anything settable as a role variable (e.g. cluster) is also settable as a session variable. If you add a system variable named default_network_policy and a role variable named network_policy, you're going to need to add special cases in a few places to handle the inheritance properly. Whereas if you just use network_policy, the existing inheritance should work correctly for the system/role level, and you'll only need to add one special case to ensure that users can't override the network_policy at the session level.

To mitigate user lockouts, we will prevent users from altering their network policy in a way that will block their current `client_ip`. In the case of a lockout, we would need to modify an admin role using the `mz_system` and temporarily set a network policy that either allowed global access for that user or allowed access to a particular IP they provide.

### Possible downsides
This design presents a highly configurable solution that guarantees no access to data and is likely the easiest mechanism to implement, however, it does have some downsides. The largest downside is in the guarantee it provides. The best level of network restriction we could provide is that no network traffic reaches the database. The proposed solution only guarantees that no connection can be established with the data plane (coordinator). This has some implications for DOS attacks which must be handled outside the scope of these policies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a downside worth accepting! Makes the implementation much more straightforward, and we can always add on an eventually consistent L3/L4 firewall that reads these policies and can more efficiently enforce them on incoming connections.

- Moving the `SystemVar` from a `Vec<IpNet>` to an `Ident` pointing to a `NetworkPolicy` resource
- Modifying roles to have an `Option<NetworkPolicy>` and adding the SQL to set this policy
- Adding validations to prevent lock out
- Following up with sinks/sources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how you've scoped down each incremental steps here!

@jubrad jubrad force-pushed the network-policies-design-doc branch from dd0e74a to dd7de2c Compare October 14, 2024 14:12
@jubrad jubrad self-assigned this Oct 14, 2024
doc/developer/design/20240925_network_policies.md Outdated Show resolved Hide resolved

```

Users will be able to create `NetworkPolicies` directly. A user must have `CREATENETWORKPOLICY` privileges to create, modify, or destroy network policies. Network policies will be limited to 25 rules. This will be controlled by an LD flag. `NetworkPolicyRules` must be created through a policy. The policy rules implementation will initially only contain an `Allow` variant, but we should be an enum to allow for a `Deny` variant in the future. `NetworkPolicyRule` will hold a `NetworkPolicyRuleDirection` enum to allow for both ingress and egress policies. The names of `NetworkPolicyRules` must be unique within a policy. A `NetworkPolicyRules` will contain a single `IpNet`. We will allow for comments on `NetworkPolicies` as well as individual `NetworkPolicyRules`, the latter will not be implemented initially.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a corresponding privilege for viewing network policies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean they will be globally readable, or will CREATENETWORKPOLICY also be required for reading policy info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any such restrictions on resources like clusters, so probably no restrictions here either.

@jubrad jubrad force-pushed the network-policies-design-doc branch from 7f05bac to e0a865e Compare October 28, 2024 18:19
@jubrad jubrad mentioned this pull request Oct 30, 2024
5 tasks
@jubrad jubrad force-pushed the network-policies-design-doc branch from e0a865e to 2c98463 Compare November 1, 2024 02:50
@jubrad jubrad enabled auto-merge November 1, 2024 02:51
@jubrad jubrad merged commit cfb7218 into MaterializeInc:main Nov 1, 2024
9 checks passed
@jubrad jubrad mentioned this pull request Nov 1, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants