- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in [submariner-io /enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in submariner-io/website, for publication to submariner-io
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Egress source IP from pod to outside Kubernetes cluster is not a fixed value. Pod IP can change across pod restart. In addition, when packets leave the cluster, some CNI plugins translate(SNAT) it to appear as the node IP, which can also change across pod restart. However, there are many devices and software that use IP based ACLs to restrict incoming traffic for security reasons and bandwidth limitations. As a result, this kind of ACLs outside k8s cluster will block packets from the pod, which causes a connectivity issue. To resolve this issue, we need a feature to assign a particular static egress source IP to one or more particular pods.
Related discussions are done in here and here. In addition, PoC implementation can be found here.
- Provide users with an official and common way to assign a static egress source IP for packets from one or more pods to outside k8s cluster.
- Scope1: Access from pods in one k8s cluster to servers outside the cluster
- Scope2: Access from pods in one of multiple k8s clusters to servers outside the clusters
- TBD
There is an existing database server which restricts access by source IP in on-premise data center. New application deployed on k8s in the same data center needs to access to the database server (Scope1).
There is an existing database server which restricts access by source IP in on-premise data center. New application deployed on k8s in a different cloud needs to access to the database server (Scope2).
- As this proposal provides users with a way to change source IP addresses,
and source IPs can be used to restrict acccess, it is required to carefully
prevent malicious users from setting source IP addresses.
- User facing API should be able to restrict only right sources to assign the right source IPs,
- Tunneling components should only allow access from right sources,
- This proposal provides a kind of tunneling between pods and external servers, therefore there will be performance overhead,
- The number of tunnels needed will be the number fo combinations of pods and external servers, therefore scalability of performance needs to be cared,
- Scalability of the number of source IPs consumed should be cared, especially for ingress access. To allow ingress access, combination of source IP and port is dedicated for the access, therefore the source IP can't be reused to listen on the same port for other purpose. As a result, source IP will be easily exhausted, if there is a requirement to use a specific port to access to mulitple pods. (For egress access, on the other hand, targetIP can be shared across tunnels, for each tunnel consumes clusterIP as a dedicated resource rather than targetIP.)
There will be two types of actors in this use case, cluster managers and users.
- Cluster managers provide users with set of IPs that can be consumed as a targetIP,
- User consume targetIP to make sets of pods to access to the targetIP to external servers.
New API ExternalService is introduced.
ExternalService:
type ExternalService struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ExternalServiceSpec `json:"spec,omitempty"`
Status ExternalServiceStatus `json:"status,omitempty"`
}
Where ExternalServiceSpec and ExternalServiceStatus are defined as below.
ExternalServiceSpec:
type ExternalServiceSpec struct {
TargetIP string `json:"targetIP"`
Sources []Source `json:"sources"`
Ports []corev1.ServicePort `json:"ports"`
}
type Source struct {
Service ServiceRef `json:"service"`
SourceIP string `json:"sourceIP"`
}
type ServiceRef struct {
Namespace string `json:"namespace,omitempty"`
Name string `json:"name,omitempty"`
}
ExternalServiceStatus:
type ExternalServiceStatus struct {
}
TODO: Consider adding fields that will be informative for user to know the status.
Note that there are two things that are needed to be considered, maybe later:
Source
struct needs to have identifier for cluster to decide in which cluster the pod exists, if it needs to work across cluster,ExternalSeriveSpec
struct needs to have identifier for cluster to decide which cluster can be access toTargetIP
, if it needs to work across cluster,SourceIP
inExternalServiceSpec
shouldn't be exposed to users directly, to avoid malicious IP address to be specified by users. Instead concept likePersistentVolume
,PersistentVolumeClaim
,StorageClass
can be applied. For example, by definingSourceIP
,SourceIPClaim
, andSourceIPClass
, cluster managers will specify range of IPs inSourceIPClass
. Then, users can consumeSourceIP
by specifyingSourceIPClaim
, which hasSourceIP
bound. See Design Cosideration of SourceIPClass.
Example of the ExternalService
is as below:
apiVersion: submariner.io/v1alpha1
kind: ExternalService
metadata:
name: my-externalservice
spec:
targetIP: 192.168.122.139
sources:
- service:
namespace: ns1
name: my-service1
sourceIP: 192.168.122.200
- service:
namespace: ns2
name: my-service2
sourceIP: 192.168.122.201
ports:
- protocol: TCP
port: 8000
targetPort: 8000
Above example defines that:
- Access to
targetPort
of service namedmetadata.name
will be forwarded toport
oftargetIP
if sources are the pods associated with theservice
, - The source IP of the packets from the pod associated with the
service
will besourceIP
defined for theservice
, - Access from
targetIP
toservice
's port ofsourceIP
will be forwarded to theservice
.
In above case:
- Acccess to
my-externalservice.external-services:8000
will be forwarded to192.168.122.139:8000
if sources are the pods associated withmy-service1
ormy-service2
, - The source IP of the packets from the pods associated with
my-service1
will be192.168.122.200
and that withmy-service2
will be192.168.122.201
, - For reverse access, access from
192.168.122.139
to192.168.122.200:80
will be forwarded tomy-service1:80
and that to192.168.122.201:80
will be forwarded tomy-service2:80
(if bothmy-service1
andmy-service2
define port 80).
Note that ExternalService
resouce is namespaced resource and users will create this resource in
their namespace.
To avoid sourceIP
from being directly specified in ExternalService
by users,
SourceIP
in Source
struct needs to be changed to SourceIPClaim
and it will reference to
the name of SourceIPClaim
in the same namespace.
Source
struct:
type Source struct {
Service ServiceRef `json:"service"`
SourceIPClaimName string `json:"sourceipclaimname"`
}
Then, SourceIPClass
, SourceIPClaim
, and SourceIP
will be defined as below:
SourceIPClass:
type SourceIPClass struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec SourceIPClassSpec `json:"spec,omitempty"`
}
type SourceIPClassSpec struct {
Ranges []Range `json:"ranges"`
}
type Range struct {
Start string `json:"start"`
End string `json:"end"`
}
SourceIPClaim:
type SourceIPClaim struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec SourceIPClaimSpec `json:"spec,omitempty"`
Status SourceIPClaimStatus `json:"status,omitempty"`
}
type SourceIPClaimSpec struct {
SourceIPClassName string `json:"sourceipclass"`
SourceIP string `json:"sourceip,omitempty"`
}
type SourceIPClaimStatus struct {
Conditions status.Conditions `json:"conditions"`
Phase string `json:"phase"`
}
SourceIP:
type SourceIP struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec SourceIPSpec `json:"spec,omitempty"`
Status SourceIPStatus `json:"status,omitempty"`
}
type SourceIPSpec struct {
SourceIP string `json:"sourceip"`
claimRef ObjectReference `json:"claimref"`
}
type SourceIPStatus struct {
Conditions status.Conditions `json:"conditions"`
Phase string `json:"phase"`
}
An example use case is shown as below.
Admin will create SourceIPClass
like below:
apiVersion: submariner.io/v1alpha1
kind: SourceIPClass
metadata:
name: my-source-ip-class
spec:
ranges:
- start: 192.168.122.1
end: 192.168.122.100
- start: 192.168.122.200
end: 192.168.122.210
User will create SourceIPClaim
like below:
apiVersion: submariner.io/v1alpha1
kind: SourceIPClaim
metadata:
name: my-source-ip-claim
namespace: ns1
spec:
sourceIPClassName: my-source-ip-class
Then, a kind of provisioner will create SourceIP
like below:
apiVersion: submariner.io/v1alpha1
kind: SourceIP
metadata:
name: my-source-ip-claim-XXXXX
namespace: ns1
spec:
sourceIP: 192.168.122.1
objectRef:
kind: SourceIPClaim
name: my-source-ip-claim
namespace: ns1
status:
phase: bound
After that, user can consume the SourceIP
via SourceIPClaim
in ExternalService
like below:
apiVersion: submariner.io/v1alpha1
kind: ExternalService
metadata:
name: my-externalservice
spec:
targetIP: 192.168.122.139
sources:
- service:
namespace: ns1
name: my-service1
sourceIPClaimName: my-source-ip-claim
ports:
- protocol: TCP
port: 8000
targetPort: 8000
Note that there will still be a room to discuss a model about whether sourceIP can be shared
- within a namespace,
- across namespaces.
The simplest model would be to deny both of them. It will be achieved by
- only allow consuming one
SourceIPClaim
in oneExternalService
, - only bind one
SourceIP
to oneSourceIPClaim
.
However, to maximize the utilization of IP addresses, these constraint might need to be relaxed.
There are mainly three components:
- operator: It creates and deletes forwarder pod and keep configurations for forwarder and gateway up-to-date. Configurations are passed by using Forwarder CRDs and Gateway CRDs. These CRDs aren't user-facing API and expected to be used only by forwarder and gateway,
- forwarder: It runs on a forwarder pod created by operator and it is accessible via service
from pods which are associated with service defined as
sources
inExternalService
. It is created per external server. It will receive packets from pods and forward them to gateway which hassourceIP
for egress, and receive packets from gateway and forwarde them to pods for ingress, - gateway: It runs on the gateway node and has
sourceIP
assigned. It will receive packets from forwarder pod and forward them totargetIP
for egress, and receive packets fromtargetIP
and forward them to forwarder pod for ingress.
The base idea for egress packets is that pods will access to a forwarder pod via service,
then it will forward the packets to a specified external serer via gateway which has sourceIP
.
As a result, the external server which has targetIP
will see the packets coming from the
SourceIP
.
That for ingress packets is done in reverse. External servers will access to a gateway,
then it will forward the packets to a pod via forwarder pod. As a result, the pod will see
the packets coming from targetIP
.
See here for egress
flow and here
for ingress flow.
Implementations for each component are discussed below:
Operator is a component that is in charge of:
- keeping forwarder pod per
ExternalService
, - keeping mapping of
podIP
andTargetIP
up-to-date.
For keeping forwarder pod per ExternalService
, forwarder pod should be created on
ExternalService
's creation and should be deleted on ExternalService
's deletion.
For keeping mapping of podIP
and TargetIP
up-to-date, podIP
for all the
spec.sources.service
in ExternalService
needs to be regularly checked.
The corresponding mapping should be updated on the changes in podIP
for the service.
It would be achived by watching k8s endpoint
and k8s operator pattern can be applied
to implement it. Mappings created by operator are needed to be handled by forwarder
and gateway, and mappings can be shared with them by using below non-user facing APIs,
or ForwarderSpec
and GatewaySpec
defined below. ForwarderSpec
should be
implementation-agnostic so that forwarder and gateway can choose any implementations.
(However, RelayPort
is specific to the implementation using ssh and iptables discussed
in forwarder and gateway section below. It needs to be improved.)
ForwarderSpec:
type ForwarderSpec struct {
EgressRules []ForwarderRule `json:"egressrules"`
IngressRules []ForwarderRule `json:"ingressrules"`
ForwarderIP string `json:"forwarderip,omitempty"`
}
type ForwarderRule struct {
Protocol string `json:"protocol,omitempty"`
SourceIP string `json:"sourceip,omitempty"`
TargetPort string `json:"targetport,omitempty"`
DestinationIP string `json:"destinationip,omitempty"`
DestinationPort string `json:"destinationport,omitempty"`
Gateway GatewayRef `json:"gateway"`
GatewayIP string `json:"gatewayip,omitempty"`
RelayPort string `json:"relayPort,omitempty"`
}
type GatewayRef struct {
Namespace string `json:"namespace,omitempty"`
Name string `json:"name,omitempty"`
}
type ForwarderStatus struct {
Conditions status.Conditions `json:"conditions"`
RuleGeneration int `json:"rulegeneration,omitempty"`
SyncGeneration int `json:"syncgeneration,omitempty"`
}
GatewaySpec:
type GatewaySpec struct {
EgressRules []GatewayRule `json:"egressrules"`
IngressRules []GatewayRule `json:"ingressrules"`
GatewayIP string `json:"gatewayip,omitempty"`
}
type GatewayRule struct {
Protocol string `json:"protocol,omitempty"`
SourceIP string `json:"sourceip,omitempty"`
TargetPort string `json:"targetport,omitempty"`
DestinationPort string `json:"destinationport,omitempty"`
DestinationIP string `json:"destinationip,omitempty"`
Forwarder ForwarderRef `json:"forwarder"`
ForwarderIP string `json:"forwarderip,omitempty"`
RelayPort string `json:"relayport,omitempty"`
}
type ForwarderRef struct {
Namespace string `json:"namespace,omitempty"`
Name string `json:"name,omitempty"`
}
type GatewayStatus struct {
Conditions status.Conditions `json:"conditions"`
RuleGeneration int `json:"rulegeneration,omitempty"`
SyncGeneration int `json:"syncgeneration,omitempty"`
}
Note that ForwarderSpec
and GatewaySpec
should be created from the information that
users defined in their namespaces, but they should be created in a different namespace
to avoid them from being modified by malicious users.
Both of them will be able to be created in operator's namespace, however they might need
different namespaces, if there are security concern.
Forwarder is a component that is in charge of forwarding:
- egress packets from pods to gateways,
- ingress packets from gateways to pods.
It regularly reads Forwarder
CRD for the forwarder and update the forwarding rules.
One example implemenation for achieving this forwarding is by using ssh port-forward and
iptables rules.
See here for egress
flow and here
for ingress flow.
Gateway is a component that is in charge of forwarding:
- egress packets from forwarder pods to external servers,
- ingress packets from external servers to forwarder pods
It regularly reads Gateway
CRD for the gateway and update the forwarding rules.
Example implementation is the same to forwarder.
- TBD
- TBD
- TBD
- TBD
- TBD
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled).
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here.
-
Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)? Also set
disable-supported
totrue
orfalse
inkep.yaml
. Describe the consequences on existing workloads (e.g. if this is runtime feature, can it break the existing applications?). -
What happens if we reenable the feature if it was previously rolled back?
-
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling and disabling feature gates. However, unit tests in each component dealing with managing data created with and without the feature are necessary. At the very least, think about conversion tests if API types are being modified.
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g. what if some components will restart in the middle of rollout?
-
What specific metrics should inform a rollback?
-
Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and do that now.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? Ideally, this should be a metrics. Operations against Kubernetes API (e.g. checking if there are objects with field X set) may be last resort. Avoid logs or events for this purpose.
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? At the high-level this usually will be in the form of "high percentile of SLI per day <= X". It's impossible to provide a comprehensive guidance, but at the very high level (they needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
-
Are there any missing metrics that would be useful to have to improve observability if this feature? Describe the metrics themselves and the reason they weren't added (e.g. cost, implementation difficulties, etc.).
This section must be completed when targeting beta graduation to a release.
-
Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.
For each of the fill in the following, thinking both about running user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high error rates on the feature:
- Usage description:
- [Dependency name]
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirms the previous answers based on experience in the field.
-
Will enabling / using this feature result in any new API calls? Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)
-
Will enabling / using this feature result in introducing new API types? Describe them providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-
Will enabling / using this feature result in any new calls to cloud provider?
-
Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them providing:
- API type(s):
- Estimated increase in size: (e.g. new annotation of size 32B)
- Estimated amount of new objects: (e.g. new Object X for every existing Pod)
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data send and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.
Troubleshooting section serves the Playbook
role as of now. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now we leave it here though.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable?
-
What are other known failure modes? For each of them fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without loogging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
- Diagnostics: What are the useful log messages and their required logging levels that could help debugging the issue? Not required until feature graduated to Beta.
- Testing: Are there any tests for failure mode? If not describe why.
- [Failure mode brief description]
-
What steps should be taken if SLOs are not being met to determine the problem?