Skip to content

Commit

Permalink
design doc for network policies
Browse files Browse the repository at this point in the history
  • Loading branch information
jubrad committed Oct 2, 2024
1 parent c80b754 commit c269b08
Showing 1 changed file with 147 additions and 0 deletions.
147 changes: 147 additions & 0 deletions doc/developer/design/20240925_network_policies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Network Policies

- Associated:
- https://github.com/MaterializeInc/database-issues/issues/7062
- https://github.com/MaterializeInc/database-issues/issues/4637
- https://github.com/MaterializeInc/materialize/pull/29739
- https://github.com/MaterializeInc/materialize/pull/29179

## The Problem
Customers would like to restrict access to Materialize by IP address.
https://github.com/MaterializeInc/database-issues/issues/4637

## Success Criteria
- Customers can define a global policy that restricts access to their Materialize environments based on the IP address of the client attempting to connect.
- Materialize support can unlock an environment where policies prevent access.
- The console is aware of whether an environment's network policies are blocking a connection it is trying to make.
- Users will be able to adjust network policies in the console.

Nice to haves:
- Per user network policies.
- Preventing user lockout by inspecting the current users' IP and the policies
they are attempting to apply.
- Termination of active connections based on newly applied policies.
- Policies that restrict egress or ingress of source/sink traffic.

## Out of Scope
- Preventing out-of-policy traffic from reaching Environmentd.
(note this means network policies will not prevent DDOS)
- Policy inheritance from associated roles. IE if 'bob' is a member of role 'eng'
we will not apply policies from role 'eng' to 'bob'.
- Restricting global API access.
- Restricting access to Frontegg.


## Solution Proposal

#### Overview
The proposed solution is to use role-based policies with a default network policy that applies to any role without a policy. This can initially be implemented as a global default policy and will be extended to per-user and per-source/sink. The policy will be applied when an attempt is made to establish a new client connection with the coordinator.

#### New Resources
A new `NetworkPolicy` resource will be added to the catalog.
```rust
struct NetworkPolicy {
id: NetworkPolicyId
name: String,
rules: Vec<NetworkPolicyRule>,
}

enum NetworkPolicyRule {
Ingress {
action: NetworkPolicyRuleAction,
source: IpNet,
comment: String
}
}

enum NetworkPolicyRuleAction {
Allow
// Deny - may be added later
}
```

Users will be able to create `NetworkPolicies` directly. `NetworkPolicyRules` must be created through a policy. The policy rules implementation will initially only contain an `Allow` variant, but we should be an enum to allow for a `Deny` variant in the future. Similarly, `NetworkPolicyRule` will be an enum to allow for both ingress and egress policies, while only ingress policies will be initially created. `NetworkPolicyRules::Ingress` will also contain a single `IpNet` and a comment text field. Comments have become a standard for rules and greatly increase the manageability and auditability of policies.

Example syntax for creating a network policy
```sql
CREATE NETWORK POLICY OFFICE_01 (
RULE ( ACTION=ALLOW, SOURCE="10.0.0.0/32", COMMENT="OFFICE IP - 2024-9-28" )
);
```

Network policies will initially only be assignable to roles. This will later be extended to source and sinks. By default, only superusers will be able to modify and assign network policies, but it will also be possible to assign a network policy to resources if one's role has usage privileges for the network policy and privileges to modify the resource. To prevent network policies from becoming too large and decreasing performance we'll limit the number of rules in a given policy to 25.

Example syntax for assigning a network policy to a role
```sql
ALTER ROLE BOB SET network_policy = OFFICE_01;
```
* Policies can only be applied to login roles. There is no policy inheritance, only the policy assigned to the role the user logged in as will be checked.

Along with network policies a new `SystemVar` (`default_network_policy`) will be added that points to a specific `NetworkPolicy`. This system var will only be modifiable by `mz_system` and `superuser`. If a resource does not have a network policy this policy will be applied.

Example syntax for updating the default_network_policy
```sql
ALTER SYSTEM SET default_network_policy = OFFICE_01;
```


### Policy Enforcement
On `coord::handle_startup` a user will be inspected to see if they have a network policy. If the user does not have a policy, the policy specified by `default_network_policy` will be applied to the user. If the `client_ip` of the user is allowed by the policy the connection will continue normally. If the `client_ip` is denied by the policy, `handle_startup` will return an `AdapterError::UserSessionsDenied`. This error will be handled by the protocol layer, (`HTTP`,`pgwire`) to give the user an L7 response. In the case of `HTTP` this will be a `403 Forbidden`. Additionally, the response body will contain JSON data describing the failure ex:
```json
{
"message": "session denied",
"code": "MZ011",
"detail": "Access denied for address 1.2.3.4",
}
```

When a 403 is returned with a `session denied` message, the console should be made to report that network policies are blocking the user to their environment. Access restriction will not be applied to the Global API, or Frontegg, as such, their UI components may still load.

### Handling lockout.
To mitigate user lockouts, we will prevent users from altering their own network policy in a way that will block their current `client_ip`. In the case of a lockout, we would need to modify an admin role using the `mz_system` and temporarily set a network policy that either allowed global access for that user or allowed access to a particular IP they provide.


### Possible downsides
This design presents a highly configurable solution that guarantees no access to data and is likely the easiest mechanism to implement, however, it does have some downsides. The largest downside is in the guarantee it provides. The best level of network restriction we could provide is that no network traffic reaches the database. The proposed solution only guarantees that no connection can be established with the data plane (coordinator). This has some implications for DOS attacks which must be handled outside the scope of these policies.


## Minimal Viable Prototype

Minimally, this feature can be implemented with a single `SystemVar` (`default_network_policy_allow_list`), configurable by `superusers`, that contains an allow-list of CIDRs (`Vec<IpNet>`).

```sql
ALTER SYSTEM SET default_network_policy_allow_list = '100.10.0.0/28,100.10.128.0/28'
```

The policy will be checked at `coord::handle_startup`, respond with L7 errors on denial, and apply to all user's connections to the system.

PR for minimum prototype: https://github.com/MaterializeInc/materialize/pull/29739

Adding on to this we can move the `SystemVar` from a `Vec<IpNet>` to an `Ident` pointing to a `NetworkPolicy` resource. Again, we can still use a single default. Then we can start allowing assignment to roles followed by sinks/sources.


## Alternatives

### What policies apply to.
A common alternative approach is to have a global allow-list. This approach was considered and will be the initially delivered solution, but adding per-user policies with a default had similar complexity and added clarity around the scope of the policies; i.e., they only impact users, not sources, or sinks.


### Where network policies get applied.

Network policies could be applied at many layers of our stack. We are choosing the layer closest to the data, however, this layer does not have the same auto-scaling and still requires the database to do some work for each denied request. For this reason, it may have made sense to apply policies in the Balancers. In this scenario, balancers would support both HTTP and pgwire load-balancing as well as network policy enforcement. Balancers have auto-scaling and are relatively stateless. A large number of out-of-policy requests to a balancer would likely not impact any ongoing connections. The biggest challenge with implementing network policies in the balancer is that they do not have access to the policies or roles, which are stored in the database. To move network policies to the balancers we would need some way of sharing all the policies and roles for all the environments a balancer is proxying. It may also be possible to provide access restrictions via WAF or network firewalls, neither one of these seems reasonable to implement for both pgwire and HTTP in a multi-tenant ingress layer. This could be revisited for private ingress.


## Open questions

#### Single default policy for all resources?
Should there be different default policies for users, sources, and sinks, or should a single default policy be applied to all resources once those resources start supporting policies? It may be difficult to roll out new resources if we only have one default, but it does seem nicer in the long run.

#### How do we handle Webhook Sources:
Sources and sinks are planned as a follow-up to user-based policies, but it remains an open question how we provide a user-friendly mechanism for webhook sources where it may be hard to find a list of IPs if the webhook request is coming from
a third party.

#### The story on lockout is a bit weak
We may want to provide a more programmatic way to handle this, but we can wait and see if this becomes a problem.

#### Should we support per-database or per-cluster policies?
I think the answer to this is just no, at least not right now. These sorts of things can be enforced with RBAC for the foreseeable future. It may be worth revisiting if we ever get per-cluster use-case isolation where envd isn't the single access point.

0 comments on commit c269b08

Please sign in to comment.