KEP-0275: Cluster to noncluster connection

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in [submariner-io /enhancements] (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in submariner-io/website, for publication to submariner-io
Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Motivation

Egress source IP from pod to outside Kubernetes cluster is not a fixed value. Pod IP can change across pod restart. In addition, when packets leave the cluster, some CNI plugins translate(SNAT) it to appear as the node IP, which can also change across pod restart. However, there are many devices and software that use IP based ACLs to restrict incoming traffic for security reasons and bandwidth limitations. As a result, this kind of ACLs outside k8s cluster will block packets from the pod, which causes a connectivity issue. To resolve this issue, we need a feature to assign a particular static egress source IP to one or more particular pods.

Related discussions are done in here and here. In addition, PoC implementation can be found here.

Goals

Provide users with an official and common way to assign a static egress source IP for packets from one or more pods to outside k8s cluster.
- Scope1: Access from pods in one k8s cluster to servers outside the cluster
- Scope2: Access from pods in one of multiple k8s clusters to servers outside the clusters

Non-Goals

TBD

Proposal

User Stories (optional)

Story 1

There is an existing database server which restricts access by source IP in on-premise data center. New application deployed on k8s in the same data center needs to access to the database server (Scope1).

Story 2

There is an existing database server which restricts access by source IP in on-premise data center. New application deployed on k8s in a different cloud needs to access to the database server (Scope2).

Notes/Constraints/Caveats (optional)

Risks and Mitigations

Security Risks

As this proposal provides users with a way to change source IP addresses, and source IPs can be used to restrict acccess, it is required to carefully prevent malicious users from setting source IP addresses.
- User facing API should be able to restrict only right sources to assign the right source IPs,
- Tunneling components should only allow access from right sources,

Performance and Scalability Risks

This proposal provides a kind of tunneling between pods and external servers, therefore there will be performance overhead,
The number of tunnels needed will be the number fo combinations of pods and external servers, therefore scalability of performance needs to be cared,
Scalability of the number of source IPs consumed should be cared, especially for ingress access. To allow ingress access, combination of source IP and port is dedicated for the access, therefore the source IP can't be reused to listen on the same port for other purpose. As a result, source IP will be easily exhausted, if there is a requirement to use a specific port to access to mulitple pods. (For egress access, on the other hand, targetIP can be shared across tunnels, for each tunnel consumes clusterIP as a dedicated resource rather than targetIP.)

UX

There will be two types of actors in this use case, cluster managers and users.

Cluster managers provide users with set of IPs that can be consumed as a targetIP,
User consume targetIP to make sets of pods to access to the targetIP to external servers.

Design Details

User facing API

New API ExternalService is introduced.

ExternalService:

type ExternalService struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   ExternalServiceSpec   `json:"spec,omitempty"`
    Status ExternalServiceStatus `json:"status,omitempty"`
}

Where ExternalServiceSpec and ExternalServiceStatus are defined as below.

ExternalServiceSpec:

type ExternalServiceSpec struct {
    TargetIP string               `json:"targetIP"`
    Sources  []Source             `json:"sources"`
    Ports    []corev1.ServicePort `json:"ports"`
}

type Source struct {
    Service  ServiceRef `json:"service"`
    SourceIP string     `json:"sourceIP"`
}

type ServiceRef struct {
    Namespace string `json:"namespace,omitempty"`
    Name      string `json:"name,omitempty"`
}

ExternalServiceStatus:

type ExternalServiceStatus struct {
}

TODO: Consider adding fields that will be informative for user to know the status.

Note that there are two things that are needed to be considered, maybe later:

Source struct needs to have identifier for cluster to decide in which cluster the pod exists, if it needs to work across cluster,
ExternalSeriveSpec struct needs to have identifier for cluster to decide which cluster can be access to TargetIP, if it needs to work across cluster,
SourceIP in ExternalServiceSpec shouldn't be exposed to users directly, to avoid malicious IP address to be specified by users. Instead concept like PersistentVolume, PersistentVolumeClaim, StorageClass can be applied. For example, by defining SourceIP, SourceIPClaim, and SourceIPClass, cluster managers will specify range of IPs in SourceIPClass. Then, users can consume SourceIP by specifying SourceIPClaim, which has SourceIP bound. See Design Cosideration of SourceIPClass.

Example of the ExternalService is as below:

apiVersion: submariner.io/v1alpha1
kind: ExternalService
metadata:
  name: my-externalservice
spec:
  targetIP: 192.168.122.139
  sources:
    - service:
        namespace: ns1
        name: my-service1
      sourceIP: 192.168.122.200
    - service:
        namespace: ns2
        name: my-service2
      sourceIP: 192.168.122.201
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Above example defines that:

Access to targetPort of service named metadata.name will be forwarded to port of targetIP if sources are the pods associated with the service,
The source IP of the packets from the pod associated with the service will be sourceIP defined for the service,
Access from targetIP to service's port of sourceIP will be forwarded to the service.

In above case:

Acccess to my-externalservice.external-services:8000 will be forwarded to 192.168.122.139:8000 if sources are the pods associated with my-service1 or my-service2,
The source IP of the packets from the pods associated with my-service1 will be 192.168.122.200 and that with my-service2 will be 192.168.122.201,
For reverse access, access from 192.168.122.139 to 192.168.122.200:80 will be forwarded to my-service1:80 and that to 192.168.122.201:80 will be forwarded to my-service2:80 (if both my-service1 and my-service2 define port 80).

Note that ExternalService resouce is namespaced resource and users will create this resource in their namespace.

Design Cosideration of SourceIPClass

To avoid sourceIP from being directly specified in ExternalService by users, SourceIP in Source struct needs to be changed to SourceIPClaim and it will reference to the name of SourceIPClaim in the same namespace.

Source struct:

type Source struct {
    Service           ServiceRef `json:"service"`
    SourceIPClaimName string     `json:"sourceipclaimname"`
}

Then, SourceIPClass, SourceIPClaim, and SourceIP will be defined as below:

SourceIPClass:

type SourceIPClass struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   SourceIPClassSpec   `json:"spec,omitempty"`
}

type SourceIPClassSpec struct {
    Ranges  []Range             `json:"ranges"`
}

type Range struct {
    Start  string `json:"start"`
    End    string `json:"end"`
}

SourceIPClaim:

type SourceIPClaim struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   SourceIPClaimSpec   `json:"spec,omitempty"`
    Status SourceIPClaimStatus `json:"status,omitempty"`
}

type SourceIPClaimSpec struct {
    SourceIPClassName string       `json:"sourceipclass"`
    SourceIP          string       `json:"sourceip,omitempty"`
}

type SourceIPClaimStatus struct {
    Conditions status.Conditions `json:"conditions"`
    Phase      string            `json:"phase"`
}

SourceIP:

type SourceIP struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   SourceIPSpec   `json:"spec,omitempty"`
    Status SourceIPStatus `json:"status,omitempty"`
}

type SourceIPSpec struct {
    SourceIP string               `json:"sourceip"`
    claimRef ObjectReference      `json:"claimref"`
}

type SourceIPStatus struct {
    Conditions status.Conditions `json:"conditions"`
    Phase      string            `json:"phase"`
}

An example use case is shown as below.

Admin will create SourceIPClass like below:

apiVersion: submariner.io/v1alpha1
kind: SourceIPClass
metadata:
  name: my-source-ip-class
spec:
  ranges:
    - start: 192.168.122.1
      end: 192.168.122.100
    - start: 192.168.122.200
      end: 192.168.122.210

User will create SourceIPClaim like below:

apiVersion: submariner.io/v1alpha1
kind: SourceIPClaim
metadata:
  name: my-source-ip-claim
  namespace: ns1
spec:
  sourceIPClassName: my-source-ip-class

Then, a kind of provisioner will create SourceIP like below:

apiVersion: submariner.io/v1alpha1
kind: SourceIP
metadata:
  name: my-source-ip-claim-XXXXX
  namespace: ns1
spec:
  sourceIP: 192.168.122.1
  objectRef:
    kind: SourceIPClaim
    name: my-source-ip-claim
    namespace: ns1
status:
  phase: bound

After that, user can consume the SourceIP via SourceIPClaim in ExternalService like below:

apiVersion: submariner.io/v1alpha1
kind: ExternalService
metadata:
  name: my-externalservice
spec:
  targetIP: 192.168.122.139
  sources:
    - service:
        namespace: ns1
        name: my-service1
      sourceIPClaimName: my-source-ip-claim
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Note that there will still be a room to discuss a model about whether sourceIP can be shared

within a namespace,
across namespaces.

The simplest model would be to deny both of them. It will be achieved by

only allow consuming one SourceIPClaim in one ExternalService,
only bind one SourceIP to one SourceIPClaim.

However, to maximize the utilization of IP addresses, these constraint might need to be relaxed.

Implementation

There are mainly three components:

operator: It creates and deletes forwarder pod and keep configurations for forwarder and gateway up-to-date. Configurations are passed by using Forwarder CRDs and Gateway CRDs. These CRDs aren't user-facing API and expected to be used only by forwarder and gateway,
forwarder: It runs on a forwarder pod created by operator and it is accessible via service from pods which are associated with service defined as sources in ExternalService. It is created per external server. It will receive packets from pods and forward them to gateway which has sourceIP for egress, and receive packets from gateway and forwarde them to pods for ingress,
gateway: It runs on the gateway node and has sourceIP assigned. It will receive packets from forwarder pod and forward them to targetIP for egress, and receive packets from targetIP and forward them to forwarder pod for ingress.

The base idea for egress packets is that pods will access to a forwarder pod via service, then it will forward the packets to a specified external serer via gateway which has sourceIP. As a result, the external server which has targetIP will see the packets coming from the SourceIP. That for ingress packets is done in reverse. External servers will access to a gateway, then it will forward the packets to a pod via forwarder pod. As a result, the pod will see the packets coming from targetIP. See here for egress flow and here for ingress flow.

Implementations for each component are discussed below:

Operator

Operator is a component that is in charge of:

keeping forwarder pod per ExternalService,
keeping mapping of podIP and TargetIP up-to-date.

For keeping forwarder pod per ExternalService, forwarder pod should be created on ExternalService's creation and should be deleted on ExternalService's deletion. For keeping mapping of podIP and TargetIP up-to-date, podIP for all the spec.sources.service in ExternalService needs to be regularly checked. The corresponding mapping should be updated on the changes in podIP for the service. It would be achived by watching k8s endpoint and k8s operator pattern can be applied to implement it. Mappings created by operator are needed to be handled by forwarder and gateway, and mappings can be shared with them by using below non-user facing APIs, or ForwarderSpec and GatewaySpec defined below. ForwarderSpec should be implementation-agnostic so that forwarder and gateway can choose any implementations. (However, RelayPort is specific to the implementation using ssh and iptables discussed in forwarder and gateway section below. It needs to be improved.)

ForwarderSpec:

type ForwarderSpec struct {
    EgressRules  []ForwarderRule `json:"egressrules"`
    IngressRules []ForwarderRule `json:"ingressrules"`
    ForwarderIP  string          `json:"forwarderip,omitempty"`
}

type ForwarderRule struct {
    Protocol        string     `json:"protocol,omitempty"`
    SourceIP        string     `json:"sourceip,omitempty"`
    TargetPort      string     `json:"targetport,omitempty"`
    DestinationIP   string     `json:"destinationip,omitempty"`
    DestinationPort string     `json:"destinationport,omitempty"`
    Gateway         GatewayRef `json:"gateway"`
    GatewayIP       string     `json:"gatewayip,omitempty"`
    RelayPort       string     `json:"relayPort,omitempty"`
}

type GatewayRef struct {
    Namespace string `json:"namespace,omitempty"`
    Name      string `json:"name,omitempty"`
}

type ForwarderStatus struct {
    Conditions     status.Conditions `json:"conditions"`
    RuleGeneration int               `json:"rulegeneration,omitempty"`
    SyncGeneration int               `json:"syncgeneration,omitempty"`
}

GatewaySpec:

type GatewaySpec struct {
    EgressRules  []GatewayRule `json:"egressrules"`
    IngressRules []GatewayRule `json:"ingressrules"`
    GatewayIP    string        `json:"gatewayip,omitempty"`
}

type GatewayRule struct {
    Protocol        string       `json:"protocol,omitempty"`
    SourceIP        string       `json:"sourceip,omitempty"`
    TargetPort      string       `json:"targetport,omitempty"`
    DestinationPort string       `json:"destinationport,omitempty"`
    DestinationIP   string       `json:"destinationip,omitempty"`
    Forwarder       ForwarderRef `json:"forwarder"`
    ForwarderIP     string       `json:"forwarderip,omitempty"`
    RelayPort       string       `json:"relayport,omitempty"`
}

type ForwarderRef struct {
    Namespace string `json:"namespace,omitempty"`
    Name      string `json:"name,omitempty"`
}

type GatewayStatus struct {
    Conditions     status.Conditions `json:"conditions"`
    RuleGeneration int               `json:"rulegeneration,omitempty"`
    SyncGeneration int               `json:"syncgeneration,omitempty"`
}

Note that ForwarderSpec and GatewaySpec should be created from the information that users defined in their namespaces, but they should be created in a different namespace to avoid them from being modified by malicious users. Both of them will be able to be created in operator's namespace, however they might need different namespaces, if there are security concern.

Forwarder

Forwarder is a component that is in charge of forwarding:

egress packets from pods to gateways,
ingress packets from gateways to pods.

It regularly reads Forwarder CRD for the forwarder and update the forwarding rules. One example implemenation for achieving this forwarding is by using ssh port-forward and iptables rules. See here for egress flow and here for ingress flow.

Gateway

Gateway is a component that is in charge of forwarding:

egress packets from forwarder pods to external servers,
ingress packets from external servers to forwarder pods

It regularly reads Gateway CRD for the gateway and update the forwarding rules. Example implementation is the same to forwarder.

Test Plan

TBD

Graduation Criteria

TBD

Upgrade / Downgrade Strategy

TBD

Version Skew Strategy

TBD

Production Readiness Review Questionnaire

TBD

Feature enablement and rollback

This section must be completed when targeting alpha to a release.

How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in kep.yaml)
  - Feature gate name:
  - Components depending on the feature gate:
- Other
  - Describe the mechanism:
  - Will enabling / disabling the feature require downtime of the control plane?
  - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
Does enabling the feature change any default behavior? Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here.
Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)? Also set disable-supported to true or false in kep.yaml. Describe the consequences on existing workloads (e.g. if this is runtime feature, can it break the existing applications?).
What happens if we reenable the feature if it was previously rolled back?
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling and disabling feature gates. However, unit tests in each component dealing with managing data created with and without the feature are necessary. At the very least, think about conversion tests if API types are being modified.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g. what if some components will restart in the middle of rollout?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and do that now.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.

Monitoring requirements

This section must be completed when targeting beta graduation to a release.

How can an operator determine if the feature is in use by workloads? Ideally, this should be a metrics. Operations against Kubernetes API (e.g. checking if there are objects with field X set) may be last resort. Avoid logs or events for this purpose.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
  - Metric name:
  - [Optional] Aggregation method:
  - Components exposing the metric:
- Other (treat as last resort)
  - Details:
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? At the high-level this usually will be in the form of "high percentile of SLI per day <= X". It's impossible to provide a comprehensive guidance, but at the very high level (they needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
Are there any missing metrics that would be useful to have to improve observability if this feature? Describe the metrics themselves and the reason they weren't added (e.g. cost, implementation difficulties, etc.).

Dependencies

This section must be completed when targeting beta graduation to a release.

Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.

For each of the fill in the following, thinking both about running user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
  - Usage description:
    - Impact of its outage on the feature:
    - Impact of its degraded performance or high error rates on the feature:

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirms the previous answers based on experience in the field.

Will enabling / using this feature result in any new API calls? Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)
Will enabling / using this feature result in introducing new API types? Describe them providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
Will enabling / using this feature result in any new calls to cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them providing:
- API type(s):
- Estimated increase in size: (e.g. new annotation of size 32B)
- Estimated amount of new objects: (e.g. new Object X for every existing Pod)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data send and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.

Troubleshooting

Troubleshooting section serves the Playbook role as of now. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now we leave it here though.

This section must be completed when targeting beta graduation to a release.

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes? For each of them fill in the following information by copying the below template:
- [Failure mode brief description]
  - Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without loogging into a master or worker node?
  - Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
  - Diagnostics: What are the useful log messages and their required logging levels that could help debugging the issue? Not required until feature graduated to Beta.
  - Testing: Are there any tests for failure mode? If not describe why.
What steps should be taken if SLOs are not being met to determine the problem?

Files

20200602-cluster-to-noncluster-connection.md

Latest commit

History

20200602-cluster-to-noncluster-connection.md

File metadata and controls

KEP-0275: Cluster to noncluster connection

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (optional)

Story 1

Story 2

Notes/Constraints/Caveats (optional)

Risks and Mitigations

Security Risks

Performance and Scalability Risks

UX

Design Details

User facing API

Design Cosideration of SourceIPClass

Implementation

Operator

Forwarder

Gateway

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature enablement and rollback

Rollout, Upgrade and Rollback Planning

Monitoring requirements

Dependencies

Scalability

Troubleshooting

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (optional)