Skip to content

Commit

Permalink
address review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
richabanker committed Jun 7, 2023
1 parent cde58c3 commit 1a2cbf8
Showing 1 changed file with 80 additions and 30 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Network location of apiservers](#network-location-of-apiservers)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
<!-- /toc -->

Expand Down Expand Up @@ -204,7 +205,7 @@ API server change:

- If the request is for a group/version/resource the apiserver doesn't have
locally (we can use the StorageVersion API), it will proxy the request to
one of the apiservers that is listed in the object. If an apiserver fails
one of the apiservers that is listed in the [ServerStorageVersion](https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/apis/apiserverinternal/types.go#L64) object. If an apiserver fails
to respond, then we will return a 503 (there is a small
possibility of a race between the controller registering the apiserver
with the resources it can serve and receiving a request for a resource
Expand Down Expand Up @@ -270,33 +271,42 @@ This might be a good place to talk about core concepts and how they relate.

### Risks and Mitigations

Cluster admins might not read the release notes and realize they should enable
network/firewall connectivity between apiservers. In this case clients will
receive 503s instead of transparently being proxied. 503 is still safer than
today's behavior.
1. **Network connectivity isues between apiservers**

Cluster admins might not read the release notes and realize they should enable network/firewall connectivity between apiservers. In this case clients will receive 503s instead of transparently being proxied. 503 is still safer than today's behavior. We will clearly document the steps needed to enable the feature and also include steps to verify that the feature is working as intended. Looking at the following exposed metrics can help wth that
1. `kubernetes_uvip_proxied_request_total` to monitor the number of UVIP-proxies requests. This metric can tell us the number of requests that were successfully proxied and the ones that failed
2. `apiserver_request_total` to check the success/error status of the requests

Requests will consume egress bandwidth for 2 apiservers when proxied. We can cap
the number if needed, but upgrades aren't that frequent and few resources are
changed on releases, so these requests should not be common. We will count them
with a metric.
2. **Increase in egress bandwidth**

Requests will consume egress bandwidth for 2 apiservers when proxied. We can cap the number if needed, but upgrades aren't that frequent and few resources are changed on releases, so these requests should not be common. We will count them with a metric.

There could be a large volume of requests for a specific resource which might result in the identified apiserver being unable to serve the proxied requests. This scenario should not occur too frequently, since resource types which have large request volume should not be added or removed during an upgrade -- that would cause other problems, too.
3. **Increase in request traffic directed at destination kube-apiserver**

We should ensure at most one proxy, rather than proxying the request over and over again (if the source apiserver has an incorrect understanding of what the destination apiserver can serve).
There could be a large volume of requests for a specific resource which might result in the identified apiserver being unable to serve the proxied requests. This scenario should not occur too frequently, since resource types which have large request volume should not be added or removed during an upgrade - that would cause other problems, too.

To prevent server-side request forgeries we will not give control over information about apiserver IP/endpoint and the trust bundle (used to authenticate server while proxying) to users via REST APIs.
4. **Indefinite rerouting of the request**

We should ensure at most one proxy, rather than proxying the request over and over again (if the source apiserver has an incorrect understanding of what the destination apiserver can serve). To do this, we will add a new header such as `X-Kubernetes-UVIP-Rerouted:true` to the request once it is determined that the request cannot be served by the local apiserver and should therefore be proxied.
We will remove this header after the request is received by the destination apiserver (i.e. after the proxy has happened once) at which point it will be served locally.

5. **Putting IP/endpoint and trust bundle control in user hands in REST APIs**

To prevent server-side request forgeries we will not give control over information about apiserver IP/endpoint and the trust bundle (used to authenticate server while proxying) to users via REST APIs.

## Design Details

### Aggregation Layer

![Alt text](https://user-images.githubusercontent.com/26771552/243512491-47c79882-1f32-4067-8e10-f8e16faa4997.png?raw=true "Optional Title")

1. A new filter will be added to the [handler chain] of the aggregation layer. This filter will maintain an internal map with the key being the group-version-resource and the value being a list of server IDs of apiservers that are capable of serving that group-version-resource
1. This internal map is populated using an informer for StorageVersion objects. An event handler will be added for this informer that will get the apiserver ID of the requested group-version-resource and update the internal map accordingly

2. This filter will pass on the request to the next handler in the local aggregator chain, if:
1. It is a non resource request
2. The StorageVersion informer cache hasn't synced yet or if `StorageVersionManager.Completed()` has returned false. We will serve error 503 in this case
3. The request has a header that indicates that this request has been proxied once already. If for some reason the resource is not found locally, we will serve error 503
3. The request has a header `X-Kubernetes-UVIP-Rerouted:true` that indicates that this request has been proxied once already. If for some reason the resource is not found locally, we will serve error 503
4. No StorageVersion was retrieved for it, meaning the request is for an aggregated API or for a custom resource
5. If the local apiserver ID is found in the list of serviceable-by server IDs from the internal map

Expand All @@ -306,39 +316,49 @@ To prevent server-side request forgeries we will not give control over informati

5. If the proxy call fails for network issues or any reason, we serve 503 with error `Error while proxying request to destination apiserver`

6. We will also add a poststarthook for the apiserver to ensure that it does not start serving requests until we are done creating/updating SV objects

[handler chain]:https://github.com/kubernetes/kubernetes/blob/fc8f5a64106c30c50ee2bbcd1d35e6cd05f63b00/staging/src/k8s.io/apiserver/pkg/server/config.go#L639

#### StorageVersion enhancement needed

StorageVersion API currently tells us whether a particular StorageVersion can be read from etcd by the listed apiserver. We will enhance this API to also include apiserver ID of the server that can serve this StorageVersion.

#### Identifying destination apiserver's network location

We will use the already existing [masterlease reconciler](https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/reconcilers/lease.go) to store/retrieve the IPs and ports for kube-apiservers. Major reasons to use this are
With the enhancement, the new [ServerStorageVersion](https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/apis/apiserverinternal/types.go#L62-L73) object will have this structure

1. masterlease reconciler already stores kube-apiserver IPs currently
2. this information is not exposed to users in an API that can be used maliciously
3. existing code to handle lifecycle of the masterleases is convenient

How the masterlease reconciler will be used is as follows:

1. We will use the already existing IP in Endpoints.Subsets.Addresses of the masterlease by default
```
type ServerStorageVersion struct {
// The ID of the reporting API server.
APIServerID string
// The API server encodes the object to this version
// when persisting it in the backend (e.g., etcd).
EncodingVersion string
// The API server can decode objects encoded in these versions.
// The encodingVersion must be included in the decodableVersions.
DecodableVersions []string
// Versions that can be served by the reporting API server
ServedVersions []string
}
```

2. For users with network configurations that would not allow Endpoints.Subsets.Addresses to be reachable from a kube-apiserver, we will introduce a new --bind-peer-ip flag to kube-apiserver. We will store its value as an annotation on the masterlease and use this to route the request to the right destination server
#### Identifying destination apiserver's network location

3. We will also expose the IP and port information of the kube-apiservers as annotations in APIserver identity lease object for visibility/debugging purposes
We have listed the approaches to store/retrieve apiservers' ip and port information in the [Alternatives](#alternatives) section.

4. We will also use an egress dialer for network connections made to peer kube-apiservers. For this, will create a new type for the network context to be used for peer kube-apiserver connections ([xref](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L56-L71))
We will use an egress dialer for network connections made to peer kube-apiservers. For this, will create a new type for the network context to be used for peer kube-apiserver connections ([xref](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L56-L71))

#### Proxy transport between apiservers and authn

For the mTLS between source and destination apiservers, we will do the following

1. For server authentication by the client (source apiserver) : the client needs to validate the server certs (presented by the destination apiserver), for which it will
1. look at the CA bundle of the authority that signed those certs. We will introduce a new flag --peer-ca-file that must be passed to the kube-apiserver to verify the other kube-apiserver's server certs
1. For server authentication by the client (source apiserver) : the client needs to validate the [server certs](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L59) (presented by the destination apiserver), for which it will
1. look at the CA bundle of the authority that signed those certs. We will introduce a new flag --peer-ca-file that **is required be passed (for this feature)** to the kube-apiserver that will be used to verify the presented server certs
2. look at the ServerName `kubernetes.default.svc` for SNI to verify server certs against

2. For client authentication by the server (destination apiserver) : destination apiserver will check the source apiserver certs to determine that the proxy request is from an authenticated client. The destination apiserver will use requestheader authentication (and NOT client cert authentication) for this using the kube-aggregator proxy client cert/key and the --requestheader-client-ca-file passed to the apiserver upon bootstrap
2. The server (destination apiserver) will check the client (source apiserver) certs to determine that the proxy request is from an authenticated client. We will use requestheader authentication (and NOT client cert authentication) for this. The client (source apiserver) will provide the [proxy-client certfiles](https://github.com/kubernetes/kubernetes/blob/release-1.27/cmd/kube-apiserver/app/options/options.go#L222-L233) to the server (destination apiserver) which will verify the presented certs using the CA bundle provided in the [--requestheader-client-ca-file](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/apiserver/pkg/server/options/authentication.go#L125-L128) passed to the apiserver upon bootstrap

### Discovery Merging
TODO: detailed description of discovery merging. (not scheduled until beta.)
Expand Down Expand Up @@ -403,8 +423,11 @@ https://storage.googleapis.com/k8s-triage/index.html
In the first alpha phase, the integration tests are expected to be added for:

- The behavior with feature gate turned on/off
- Request is proxied to an apiserver that is able to handle it
- Validation where an apiserver tries to serve a request that has already been proxied once
- Validation where an apiserver tries to call a peer but actually calls itself (to simulate a networking configuration where this happens on accident), and the test fails
- Validation where an apiserver tries to call a peer but actually calls itself (to simulate a networking configuration where this happens on accident), and the test fails with error 503
- Validation where a request that cannot be served by any apiservers is received, and is passed on locally that eventually gets handled by the NotFound handler resulting in 404 error
- Validation where apiserver is mis configured and is proxied to an incorrect peer resulting in 503 error

##### e2e tests

Expand Down Expand Up @@ -926,6 +949,33 @@ Why should this KEP _not_ be implemented?

## Alternatives

### Network location of apiservers

1. Use [endpoint masterlease](https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/controlplane/reconcilers/lease.go)
1. We will use the already existing IP in Endpoints.Subsets.Addresses of the masterlease by default
2. For users with network configurations that would not allow Endpoints.Subsets.Addresses to be reachable from a kube-apiserver, we will introduce a new optional --bind-peer-ip flag to kube-apiserver. We will store its value as an annotation on the masterlease and use this to route the request to the right destination server
3. We will also expose the IP and port information of the kube-apiservers as annotations in APIserver identity lease object for visibility/debugging purposes

* Pros
1. masterlease reconciler already stores kube-apiserver IPs currently
2. this information is not exposed to users in an API that can be used maliciously
3. existing code to handle lifecycle of the masterleases is convenient

* Cons
1. using masterlease will include making some changes to the legacy code that does the endpoint reconciliation which is known to be brittle

2. Use [coordination.v1.Lease](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/client-go/informers/coordination/v1/lease.go)
1. By default, we can store the [External Address](https://github.com/kubernetes/kubernetes/blob/release-1.27/staging/src/k8s.io/apiserver/pkg/server/config.go#L149) of apiservers as labels in the [APIServerIdentity Lease](https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/controlplane/instance.go#L559-L577) objects.
2. If `--peer-bind-address` flag is specified for the kube-apiserver, we will store its value in the APIServerIdentity Lease label
3. We will retrieve this information in the UVIP handler using an informer cache for these lease objects

* Pros
1. Simpler solution, does not modify any legacy code that can cause unintended bugs
2. Since in approach 1 we decided we want to store the apiserver IP, port in the APIServerIdentity lease object anyway for visibility to the user, we will be just making this change once in the APIServerIdentity lease instead of both here and in masterleases

* Cons
1. If we take this approach, there is a risk of putting control of the apiserver IP, port information in user hands. This info, can be used maliciously to route a request to a rogue ip:port

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
Expand Down

0 comments on commit 1a2cbf8

Please sign in to comment.