Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Egress using IPs from a separate subnet #5799

Merged
merged 2 commits into from
Jan 5, 2024

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Dec 14, 2023

By default, it's assumed that the IPs allocated from the pool are in the
same subnet as the Node IPs. In some cases, users want to use IPs in
different subnets as Egress IPs. Additionally, users may want to use
VLAN tagging to segment the Egress traffic and the Node traffic.

The commit implements the requirements by introducing an optional field,
subnetInfo, to the ExternalIPPool resource. The subnetInfo field
contains the subnet attributes of the IPs in this pool. When using a
different subnet:

  • gateway and prefixLength must be set. Antrea will route Egress
    traffic to the specified gateway when the destination is not in the
    same subnet of the Egress IP, otherwise route it to the destination
    directly.

  • Optionally, you can specify vlan if the underlying network is
    expecting it. Once set, Antrea will tag Egress traffic leaving the
    Egress Node with the specified VLAN ID. Correspondingly, it's
    expected that reply traffic towards these Egress IPs are also tagged
    with the specified VLAN ID when arriving the Egress Node.

The implementation involves VLAN sub-interfaces and policy routing.

  • For a given subnet with a VLAN ID, a separate VLAN sub-interface will
    be created to hold the Egress IPs allocated from it. Egress traffic
    and its reply traffic will be sent over and received from the VLAN
    sub-interface for proper tagging and untagging.

  • For a given subnet, a separate route table will be created, routing
    the selected Egress traffic to the specified gateway, or to its
    neighbor.

  • For multiple Egress IPs associated allocated from the same subnet, a
    separate IP rule will be created for each Egress IP, matching its pkt
    mark and looking up the shared route table.

The feature is gated by the alpha "EgressSeparateSubnet" feature gate.

@luolanzone luolanzone added this to the Antrea v1.15 release milestone Dec 15, 2023
@tnqn tnqn force-pushed the egress-vlan branch 6 times, most recently from bbb649e to 524e43b Compare December 19, 2023 15:53
@tnqn tnqn changed the title Support tagging Egress traffic with VLAN Support Egress using IPs from a separate subnet Dec 19, 2023
@tnqn tnqn marked this pull request as ready for review December 19, 2023 16:15
@antoninbas
Copy link
Contributor

antoninbas commented Dec 19, 2023

typos in PR description:
s/users may want to use VLAN taggaing/users may want to use VLAN tagging
s/when the destination is not in the same subnet of the Egress IP/when the destination is not in the same subnet as the Egress IP
s/For multiple Egress IPs associated allocated from the same subnet/For multiple Egress IPs allocated from the same subnet
s/reply traffic towards these Egress IPs are also tagged with the specified VLAN ID when arriving the Egress Node./reply traffic towards these Egress IPs is also tagged with the specified VLAN ID when arriving at the Egress Node.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review, I am not completely done yet

@@ -210,6 +210,8 @@ type ExternalIPPool struct {
type ExternalIPPoolSpec struct {
// The IP ranges of this IP pool, e.g. 10.10.0.0/24, 10.10.10.2-10.10.10.20, 10.10.10.30-10.10.10.30.
IPRanges []IPRange `json:"ipRanges"`
// The Subnet info of this IP pool. If set, all IP ranges in the IP pool should share the same subnet attributes.
SubnetInfo *SubnetInfo `json:"subnetInfo,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ExternalIPPool is used by other features besides Egress (at least by ServiceExternalIP), will this field ever be relevant for these other features? For example, VLAN tagging / untagging for external Service traffic?

Should the documentation mention that this field is only used when an IP is allocated from the pool for the Egress feature, and is ignored otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't verified whether it can be used by ServiceExternalIP but likely infeasible because ServiceExternalIP requires the IP to be not assigned to the Node physically, which can't work with VLAN sub-interface according to my experiments when implementing this feature.

I have added a comment to the field about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think better to reuse SubnetIPRange to be consistent with IPPool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this in the beginning but found some deficiencies in this way:

  1. With a list of SubnetIPRange in a single pool, it‘s possible to have different subnets in a pool, which makes implementation complex and inefficient: for example, when we want to know the subnet info of an IP, we have to loop over all ranges to figure it out; whenever there is a change in any of the IP ranges, we have to resync all resources associated with IPs allocated from it, increasing the span of the a single pool.
  2. It may be unclear to users that they should create a pool for each subnet or a pool for all subnets. From API's perspective, one object for one resource is clearer and simpler.
  3. It's redundant to fill the same subnet info when multiple ip ranges sharing the subnet.

I talked about this in the community meeting https://www.youtube.com/watch?v=q3vaVzzH6WM, starting from 20:42.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your points, should we update IPPool to have a single SubnetInfo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I discussed with @gran-vmv, I think there was no specific reason why IPPool was made so, we can update it when bumping it up to beta.

Comment on lines +231 to +235
// Gateway IP for this subnet, e.g. 10.10.1.1.
Gateway string `json:"gateway"`
// Prefix length for the subnet, e.g. 24.
PrefixLength int32 `json:"prefixLength"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it would be more natural for people to explicitly have to provide a CIDR for the subnet and a separate IP address (part of the CIDR) for the gateway, as opposed to deducing the subnet from the gateway IP + prefix length. This may be more in line with how people are used to interact with ip tools? But I don't have a strong opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refered to SubnetInfo in IPPool for this, though not completely. Let me check how the struct was determined and how such configuration was provided in other systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems when IPPool was designed, it just asked the least information from users to construct a subnet. Asking a subnet may be a bit tedious for users as they may have provided the same information in ipRanges once. For example, the 2nd config below would look strange, and we need have one more validation to ensure the provided subnet matches the ip ranges.

  ipRanges:
  - cidr: 10.10.0.0/24
  subnetInfo:
    gateway: 10.10.0.1
    prefixLength: 24
    vlan: 10
  ipRanges:
  - cidr: 10.10.0.0/24
  subnetInfo:
    gateway: 10.10.0.1
    subnet: 10.10.0.0/24
    vlan: 10

prefixLength:
type: integer
minimum: 1
maximum: 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it be 127? I think you reject 128 for IPv6 addresses is the validation webhook

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +117 to +118
start := net.ParseIP(ipRange.Start)
end := net.ParseIP(ipRange.End)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, do we ever validate that start <= end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess no, will add it via another PR, I found some other issues in validation when implementing this.

},
},
{
name: "Adding unmatched SubnetInfo should not be allowed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this test case is invalid, this is a valid case as shown by the expectedResponse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching it

Comment on lines 75 to 76
minEgressTable = 101
maxEgressTable = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we add Route to the name, e.g. minEgressRouteTable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 73 to 74
// minEgressTable to maxEgressTable are the route table IDs that can be configured on a Node for Egress traffic.
// Each distinct subnet uses one route table. 20 subnets should be enough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allocate any other table ID for any other feature? I don't think we do, except maybe for policyOnly mode, which doesn't support Egress anyway. Just wondering if we should define this in a "central" place, like we do for packet marks, OVS registers, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, moved it to a common package., will see if there are other usages and move them to the same file if yes via another PR.

@@ -70,6 +70,11 @@ const (
// maxEgressMark is the maximum mark of Egress IPs can be configured on a Node.
maxEgressMark = 255

// minEgressTable to maxEgressTable are the route table IDs that can be configured on a Node for Egress traffic.
// Each distinct subnet uses one route table. 20 subnets should be enough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the limit is per Node theoretically, but I was wondering if we should mention in the Egress documentation that we recommend staying under 20 subnets across all ExternalIPPools (when explicitly providing subnetInfo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a note in egress.md:

Currently, the maximum number of different subnets that can be supported in a
cluster is 20, which should be sufficient for most cases. If you need to have
more subnets, please raise an issue with your use case, and we will consider
revising the limit based on that.

docs/egress.md Outdated
* Optionally, you can specify `vlan` if the underlying network is expecting it.
Once set, Antrea will tag Egress traffic leaving the Egress Node with the
specified VLAN ID. Correspondingly, it's expected that reply traffic towards
these Egress IPs are also tagged with the specified VLAN ID when arriving the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are also tagged/is also tagged
s/when arriving the Egress Node/when arriving at the Egress Node

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if pool.Spec.SubnetInfo == nil {
return
}
c.queue.Add(pool.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I may be asking a basic question, but how can we share the same workqueue between Egress resources and ExternalIPPool resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, the code was added at the last minite and I haven't tested it.
Will add an unit test to cover it.

@tnqn tnqn force-pushed the egress-vlan branch 3 times, most recently from 7672b02 to f2d525a Compare December 20, 2023 04:53
@luolanzone luolanzone added the action/release-note Indicates a PR that should be included in release notes. label Dec 20, 2023
@tnqn tnqn force-pushed the egress-vlan branch 4 times, most recently from c46f060 to c6024ac Compare December 27, 2023 11:46
@tnqn tnqn force-pushed the egress-vlan branch 3 times, most recently from fe8cb8c to ad1fe76 Compare January 2, 2024 15:53
@@ -210,6 +210,8 @@ type ExternalIPPool struct {
type ExternalIPPoolSpec struct {
// The IP ranges of this IP pool, e.g. 10.10.0.0/24, 10.10.10.2-10.10.10.20, 10.10.10.30-10.10.10.30.
IPRanges []IPRange `json:"ipRanges"`
// The Subnet info of this IP pool. If set, all IP ranges in the IP pool should share the same subnet attributes.
SubnetInfo *SubnetInfo `json:"subnetInfo,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think better to reuse SubnetIPRange to be consistent with IPPool?

pkg/controller/externalippool/validate.go Show resolved Hide resolved
pkg/agent/types/net.go Show resolved Hide resolved
pkg/agent/route/route_linux.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/ip_assigner_linux.go Show resolved Hide resolved
pkg/agent/ipassigner/ip_assigner_linux.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain to me how stale rules are cleaned up after agent restarts?

docs/feature-gates.md Outdated Show resolved Hide resolved
pkg/agent/controller/egress/egress_controller.go Outdated Show resolved Hide resolved
By default, it's assumed that the IPs allocated from the pool are in the
same subnet as the Node IPs. In some cases, users want to use IPs in
different subnets as Egress IPs. Additionally, users may want to use
VLAN taggaing to segment the Egress traffic and the Node traffic.

The commit implements the requirements by introducing an optional field,
`subnetInfo`, to the ExternalIPPool resource. The `subnetInfo` field
contains the subnet attributes of the IPs in this pool. When using a
different subnet:

* `gateway` and `prefixLength` must be set. Antrea will route Egress
  traffic to the specified gateway when the destination is not in the
  same subnet of the Egress IP, otherwise route it to the destination
  directly.

* Optionally, you can specify `vlan` if the underlying network is
  expecting it. Once set, Antrea will tag Egress traffic leaving the
  Egress Node with the specified VLAN ID. Correspondingly, it's
  expected that reply traffic towards these Egress IPs are also tagged
  with the specified VLAN ID when arriving the Egress Node.

The implementation involves VLAN sub-interfaces and policy routing.

* For a given subnet with a VLAN ID, a separate VLAN sub-interface will
  be created to hold the Egress IPs allocated from it. Egress traffic
  and its reply traffic will be sent over and received from the VLAN
  sub-interface for proper tagging and untagging.

* For a given subnet, a separate route table will be created, routing
  the selected Egress traffic to the specified gateway, or to its
  neighbor.

* For multiple Egress IPs associated allocated from the same subnet, a
  separate IP rule will be created for each Egress IP, matching its pkt
  mark and looking up the shared route table.

The feature is gated by the alpha "EgressSeparateSubnet" feature gate.

Signed-off-by: Quan Tian <[email protected]>
@tnqn
Copy link
Member Author

tnqn commented Jan 3, 2024

Could you explain to me how stale rules are cleaned up after agent restarts?

It's done by RestoreEgressRoutesAndRules, we simply remove all routes and rules associated with table IDs between 101 and 120. Later we can try a more graceful way, but we reallocate marks anyway, so there is no big difference before we can make marks permanent.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one thing that came up during the community meeting 2 weeks ago was whether overlapping subnets (CIDRs) could exist for different VLAN IDs. IIRC, this is not supported (and perhaps some parts of the code make that assumption). Do we have any validation for this or any meaningful error log? Any need to mention it in the documentation?

@@ -66,11 +69,17 @@ where:
--subnets: a subnet creates a separate Docker bridge network (named 'antrea-<idx>') with the assigned subnet. A worker
Node will be connected to one of those network. Default is empty: all worker Nodes connected to the default Docker
bridge network created by kind.
--vlan-subnets: specifies the subnets of the VLAN to which all Nodes will be connected, in addition to the primary network.
The IP expression of the subnet will be used as the gateway IP. For example, '--vlan-subnets 10.100.100.1/24' means
10.100.100.1/24 will be assigned to the VLAN sub-interface of the network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the last sentence a bit confusing (specifically "VLAN sub-interface of the network"). Maybe: "means that a VLAN sub-interface will be created on the primary Docker bridge, and it will be assigned the 10.100.100.1/24 address"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

The IP expression of the subnet will be used as the gateway IP. For example, '--vlan-subnets 10.100.100.1/24' means
10.100.100.1/24 will be assigned to the VLAN sub-interface of the network.
--vlan-id: specifies the ID of the VLAN to which all Nodes will be connected, in addition to the primary network. Note,
'--vlan-subnets' and '--vlan-id' must be specified together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe give an error if only one is specified (instead of silently ignoring it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +278 to +279
docker_run_with_host_net iptables -t filter -D FORWARD -i $bridge_interface -o $interface_name -j ACCEPT || true
docker_run_with_host_net iptables -t filter -D FORWARD -o $bridge_interface -i $interface_name -j ACCEPT || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically, it would make more sense to delete the iptables rules before deleting the interface? I don't know what happens to iptables rules referencing an interface that doesn't exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
BTW, it's ok to reference an non-existing interface in iptables rules.

@@ -466,6 +482,38 @@ func (data *TestData) RunCommandOnNodeExt(nodeName, cmd string, envs map[string]
return data.provider.RunCommandOnNodeExt(nodeName, cmd, envs, stdin, sudo)
}

func (data *TestData) collectExternalInfo() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function never returns an error, I don't know if this was intended

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error when input is invalid.

Comment on lines +192 to +193
externalServerIPv4 string
externalServerIPv6 string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that for existing Egress e2e tests, we use a network namespace to simulate an external server (see getCommandInFakeExternalNetwork). It's too bad that we have 2 separate mechanisms, but I assume that:

  1. the netns way doesn't work for VLAN testing?
  2. the new way is specific to Kind, hence we want to keep the netns way as well for Jenkins tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes
  2. we can change the other test to lerverage the external server, that's also why the external server is not made specific to vlan configuration, but I'd like to do it via another PR, the current PR is already too large. I feel it's not worth to run the test in a hacky way on non-kind testbeds.

// It can be used to determine whether it's safe to delete an interface when it's no longer used.
const vlanInterfacePrefix = "antrea-ext."

// assignee is the unit that IPs are assigned to. All IPs from the same subnet share an assignee.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the distinction is meaningful here, but I think there is one assignee per VLAN ID, not per "subnet". Maybe the term "subnet" is a bit misleading here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// The field must not be nil.
logicalInterface *net.Interface
// link is used for IP link management and IP address add/del operation. The field can be nil if IPs don't need to
// assigned to an interface physically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to assigned/to be assigned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


func (as *assignee) destroy() error {
if err := netlink.LinkDel(as.link); err != nil {
return fmt.Errorf("error deleting interface %v", as.link)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not wrapping the actual error is intentional here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

}
}
}
func (a *ipAssigner) InitIPs(desired map[string]*crdv1b1.SubnetInfo) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of InitIPs would acquire the mutex for the entire function execution, but now we acquire the mutex for individual calls to AssignIP / UnassignIP. I assume this is ok based on how the function is meant to be used (called once for initialization, and not concurrently with other calls), but I just wanted to double-check with you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct, added a comment to make it clear.

It's not thread-safe and should only be called once for initialization before calling other methods.

return nil
}

func (a *ipAssigner) GetInterfaceID(subnetInfo *crdv1b1.SubnetInfo) (int, bool) {
as, _ := a.getAssignee(subnetInfo, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not protected by read lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, missed it when adding the method

@@ -59,6 +59,21 @@ type Interface interface {
// DeleteSNATRule should delete rule to SNAT outgoing traffic with the mark.
DeleteSNATRule(mark uint32) error

// RestoreEgressRoutesAndRules restores the routes and rules configured on the system for Egress to the cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Egress -> Egresses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -980,6 +993,38 @@ func (c *Client) listIPRoutesOnGW() ([]netlink.Route, error) {
return routes, nil
}

// RestoreEgressRoutesAndRules simply deletes all IP routes and rules created for Egress for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Egresses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/egress.md Outdated
@@ -198,6 +199,52 @@ The `ipRanges` field contains a list of IP ranges representing the available IPs
of this IP pool. Each IP range may consist of a `cidr` or a pair of `start` and
`end` IPs (which are themselves included in the range).

### SubnetInfo

By default, it's assumed that the IPs allocated from the pool are in the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pool -> an ExternalIPPool

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/egress.md Outdated
these Egress IPs is also tagged with the specified VLAN ID when arriving at the
Egress Node.

An example of ExternalIPPool using a different subnet is as below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different -> non-default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

**Note**: Specifying different subnets is currently in alpha version. To use
this feature, users should enable the `EgressSeparateSubnet` feature gate.
Currently, the maximum number of different subnets that can be supported in a
cluster is 20, which should be sufficient for most cases. If you need to have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit is per Node, not per cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's per Node, however it will be risky for users to use more than 20 subnets in a cluster as they don't have control over how the Egress IPs are distributed, and in worst case (like most Egress candidate Nodes crash and Egress are scheduled to a few remaining Nodes) it may sometimes work and sometimes not. So I feel we should just ask users to not use 20 subnets in a cluster before we support it.

Another reason is if say 20 subnets per Node, we need to elaborate they can't assume the cluster limit is not 20*Nodes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can start from 20 per cluster. But I assume there can be cases that users configure different VLANs on different sets of Egress Nodes. This seems more likely to happen when Nodes are connected to physical network. We can see if users ask for more VLANs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas @jianjuns thanks for your review. I have addressed all comments in the 2nd commit.

@@ -66,11 +69,17 @@ where:
--subnets: a subnet creates a separate Docker bridge network (named 'antrea-<idx>') with the assigned subnet. A worker
Node will be connected to one of those network. Default is empty: all worker Nodes connected to the default Docker
bridge network created by kind.
--vlan-subnets: specifies the subnets of the VLAN to which all Nodes will be connected, in addition to the primary network.
The IP expression of the subnet will be used as the gateway IP. For example, '--vlan-subnets 10.100.100.1/24' means
10.100.100.1/24 will be assigned to the VLAN sub-interface of the network.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

The IP expression of the subnet will be used as the gateway IP. For example, '--vlan-subnets 10.100.100.1/24' means
10.100.100.1/24 will be assigned to the VLAN sub-interface of the network.
--vlan-id: specifies the ID of the VLAN to which all Nodes will be connected, in addition to the primary network. Note,
'--vlan-subnets' and '--vlan-id' must be specified together.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +278 to +279
docker_run_with_host_net iptables -t filter -D FORWARD -i $bridge_interface -o $interface_name -j ACCEPT || true
docker_run_with_host_net iptables -t filter -D FORWARD -o $bridge_interface -i $interface_name -j ACCEPT || true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
BTW, it's ok to reference an non-existing interface in iptables rules.

Comment on lines +192 to +193
externalServerIPv4 string
externalServerIPv6 string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes
  2. we can change the other test to lerverage the external server, that's also why the external server is not made specific to vlan configuration, but I'd like to do it via another PR, the current PR is already too large. I feel it's not worth to run the test in a hacky way on non-kind testbeds.

@@ -466,6 +482,38 @@ func (data *TestData) RunCommandOnNodeExt(nodeName, cmd string, envs map[string]
return data.provider.RunCommandOnNodeExt(nodeName, cmd, envs, stdin, sudo)
}

func (data *TestData) collectExternalInfo() error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error when input is invalid.

@@ -59,6 +59,21 @@ type Interface interface {
// DeleteSNATRule should delete rule to SNAT outgoing traffic with the mark.
DeleteSNATRule(mark uint32) error

// RestoreEgressRoutesAndRules restores the routes and rules configured on the system for Egress to the cache.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -980,6 +993,38 @@ func (c *Client) listIPRoutesOnGW() ([]netlink.Route, error) {
return routes, nil
}

// RestoreEgressRoutesAndRules simply deletes all IP routes and rules created for Egress for now.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/egress.md Outdated
@@ -198,6 +199,52 @@ The `ipRanges` field contains a list of IP ranges representing the available IPs
of this IP pool. Each IP range may consist of a `cidr` or a pair of `start` and
`end` IPs (which are themselves included in the range).

### SubnetInfo

By default, it's assumed that the IPs allocated from the pool are in the same
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/egress.md Outdated
these Egress IPs is also tagged with the specified VLAN ID when arriving at the
Egress Node.

An example of ExternalIPPool using a different subnet is as below:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

**Note**: Specifying different subnets is currently in alpha version. To use
this feature, users should enable the `EgressSeparateSubnet` feature gate.
Currently, the maximum number of different subnets that can be supported in a
cluster is 20, which should be sufficient for most cases. If you need to have
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's per Node, however it will be risky for users to use more than 20 subnets in a cluster as they don't have control over how the Egress IPs are distributed, and in worst case (like most Egress candidate Nodes crash and Egress are scheduled to a few remaining Nodes) it may sometimes work and sometimes not. So I feel we should just ask users to not use 20 subnets in a cluster before we support it.

Another reason is if say 20 subnets per Node, we need to elaborate they can't assume the cluster limit is not 20*Nodes...

@tnqn
Copy link
Member Author

tnqn commented Jan 4, 2024

I think one thing that came up during the community meeting 2 weeks ago was whether overlapping subnets (CIDRs) could exist for different VLAN IDs. IIRC, this is not supported (and perhaps some parts of the code make that assumption). Do we have any validation for this or any meaningful error log? Any need to mention it in the documentation?

Let me try if it can work with overlapping subnets, will add validation or documentation if it can't. But I assume this can be in another PR given it's not common to use overlapping external IPs.

Signed-off-by: Quan Tian <[email protected]>
@tnqn
Copy link
Member Author

tnqn commented Jan 4, 2024

I think one thing that came up during the community meeting 2 weeks ago was whether overlapping subnets (CIDRs) could exist for different VLAN IDs. IIRC, this is not supported (and perhaps some parts of the code make that assumption). Do we have any validation for this or any meaningful error log? Any need to mention it in the documentation?

Let me try if it can work with overlapping subnets, will add validation or documentation if it can't. But I assume this can be in another PR given it's not common to use overlapping external IPs.

It can't work, created #5842 for enhancing the validation.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only 2 minor comments / questions

apiGroups: ["crd.antrea.io"]
apiVersions: ["v1alpha2"]
apiVersions: ["v1alpha2", "v1beta1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this change before, should it be in a separate PR?

Copy link
Member Author

@tnqn tnqn Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is a bug and thought it couldn't work without listing v1beta1 here, however, it turns out working even without it because the matchPolicy defaults to Equivalent. So at least for now this could work with or without v1beta1.
I keep the change because I need to add "CREATE" anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, no need for the separate PR then

fb = fb.Action().GotoTable(EgressQoSTable.GetID())
} else {
fb = fb.MatchCTStateNew(true).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't see this change before. I assume it's not specific to this PR. Was it just a redundant match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to the patch. Previously we only need the first packet of a connection to be marked because SNAT applies to the whole connection. But for policy routing, we need all packets of a connection to be marked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in PR description that can be fixed when merging: s/taggaing/tagging

LGTM

@tnqn
Copy link
Member Author

tnqn commented Jan 4, 2024

typo in PR description that can be fixed when merging: s/taggaing/tagging

thanks, corrected description, will edit commit message when merging.

@tnqn
Copy link
Member Author

tnqn commented Jan 5, 2024

/test-all
/test-ipv6-all
/test-ipv6-only-all

@tnqn
Copy link
Member Author

tnqn commented Jan 5, 2024

/test-windows-all

@tnqn tnqn merged commit 43427a1 into antrea-io:main Jan 5, 2024
52 of 60 checks passed
@tnqn tnqn deleted the egress-vlan branch January 5, 2024 06:04
hongliangl added a commit to hongliangl/antrea that referenced this pull request Jun 26, 2024
In certain e2e tests (e.g., Egress, ServiceExternalIP, NodePort/LoadBalancer
Service), an external client/server is required.

Currently, we use either a network namespace or a Node deployed with Antrea for
this purpose. However, a network namespace can only be created on a single Node to
do the tests within the Node, and using a Node deployed with Antrea, whose network
configurations are affected by Antrea, potentially impacting related e2e tests.

A similar functionality was introduced in antrea-io#5799, an external container is created
after Kind cluster setup to serve as an external server/client for most e2e tests.

For more complex e2e tests (e.g., involving an FRR router), the requirements include:

- K8s-managed creation and deletion of the FRR router.
- A network environment for the FRR router unaffected by Antrea.
- Maximizing reuse of existing test framework code.

To meet these needs, this commit introduces an option to add an extra worker Node
to the Kind cluster where Antrea will not be deployed. This allows deploying a host
network Pod on that Node, ensuring a clean network environment unaffected by Antrea.

Signed-off-by: Hongliang Liu <[email protected]>
hongliangl added a commit to hongliangl/antrea that referenced this pull request Jun 26, 2024
In certain e2e tests (e.g., Egress, ServiceExternalIP, NodePort/LoadBalancer
Service), an external client/server is required.

Currently, we use either a network namespace or a Node deployed with Antrea for
this purpose. However, a network namespace can only be created on a single Node to
do the tests within the Node, and using a Node deployed with Antrea, whose network
configurations are affected by Antrea, potentially impacting related e2e tests.

A similar functionality was introduced in antrea-io#5799, an external container is created
after Kind cluster setup to serve as an external server/client for most e2e tests.

For more complex e2e tests (e.g., involving an FRR router), the requirements include:

- K8s-managed creation and deletion of the FRR router.
- A network environment for the FRR router unaffected by Antrea.
- Maximizing reuse of existing test framework code.

To meet these needs, this commit introduces an option to add an extra worker Node
to the Kind cluster where Antrea will not be deployed. This allows deploying a host
network Pod on that Node, ensuring a clean network environment unaffected by Antrea.

Signed-off-by: Hongliang Liu <[email protected]>
hongliangl added a commit to hongliangl/antrea that referenced this pull request Jun 27, 2024
In certain e2e tests (e.g., Egress, ServiceExternalIP, NodePort/LoadBalancer
Service), an external client/server is required.

Currently, we use either a network namespace or a Node deployed with Antrea for
this purpose. However, a network namespace can only be created on a single Node to
do the tests within the Node, and using a Node deployed with Antrea, whose network
configurations are affected by Antrea, potentially impacting related e2e tests.

A similar functionality was introduced in antrea-io#5799, an external container is created
after Kind cluster setup to serve as an external server/client for most e2e tests.

For more complex e2e tests (e.g., involving an FRR router), the requirements include:

- K8s-managed creation and deletion of the FRR router.
- A network environment for the FRR router unaffected by Antrea.
- Maximizing reuse of existing test framework code.

To meet these needs, this commit introduces an option to add an extra worker Node
to the Kind cluster where Antrea will not be deployed. This allows deploying a host
network Pod on that Node, ensuring a clean network environment unaffected by Antrea.

Signed-off-by: Hongliang Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants