Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add design for managing IPs on secondary networks #39

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ For more information about Operators, see the
- [how-ironic-works](design/how-ironic-works.md)
- [image-ownership](design/image-ownership.md)
- [worker-config-drive](design/worker-config-drive.md)
- [secondary-network-ipam](design/secondary-network-ipam.md)

## Around the Web

Expand Down
181 changes: 181 additions & 0 deletions design/secondary-network-ipam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
<!--
This work is licensed under a Creative Commons Attribution 3.0
Unported License.

http://creativecommons.org/licenses/by/3.0/legalcode
-->

# secondary-network-ipam

## Status

One of: provisional

## Table of Contents

<!--ts-->
* [secondary-network-ipam](#secondary-network-ipam)
* [Status](#status)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
* [Risks and Mitigations](#risks-and-mitigations)
* [Design Details](#design-details)
* [Work Items](#work-items)
* [Dependencies](#dependencies)
* [Test Plan](#test-plan)
* [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
* [Version Skew Strategy](#version-skew-strategy)
* [Drawbacks [optional]](#drawbacks-optional)
* [Alternatives](#alternatives)
* [References](#references)

<!-- Added by: dhellmann, at: Mon Jun 17 12:54:02 EDT 2019 -->

<!--te-->

## Summary

metal3 needs to manage IP addresses on the secondary network to ensure
that supporting applications such as Ceph have persistent addresses on
each host.

## Motivation

### Goals

1. Configure secondary network interfaces on all hosts in the same way.
1. Support PXE booting hosts for provisioning.
1. Support static IPs on all hosts on secondary networks so the metal3
components are not locked to running on the master hosts.

### Non-Goals

1. Integrate with external IPAM solutions.
1. Describe how to manage the IP or access to the web server with the
image(s) to be provisioned.

## Proposal

### Implementation Details/Notes/Constraints

Ceph, and potentially other supporting services that use the secondary
network in some deployments, get confused if a client IP changes. We
therefore want to ensure that those IPs do not change.

We use dnsmasq to manage PXE booting servers during
provisioning. dnsmasq will not bind to an interface managed by
dhclient, so at least some of the hosts must have statically allocated
IPs on the secondary network to allow us to run dnsmasq at all. This
also means it is not sufficient to manage DHCP reservations to ensure
a given host always receives the same IP.

When we implement host discovery, we will want to allow discovered
hosts to use part of the IP range on the provisioning network that is
not used for static allocations so that a user does not have to clean
up those static allocations for hosts they are not using in their
cluster.

To meet all of these requirements, we need to configure the secondary
network interfaces on each host with a static IP address.

### Risks and Mitigations

We need to ensure the DHCP address range and static address range do
not overlap. We should be able to ensure that with careful management
of the CIDRs.

[inwinstack/ipam](https://github.com/inwinstack/ipam) may not be
stable or reliable, and we would have to either fix it, fork it, or
build a replacement.

## Design Details

We need to divide the subnet range for the provisioning network
between a set of addresses we can use for DHCP and a set for static
IPs.

We need the installer to allocate IPs for the master nodes as it
provisions them, and to record that information in the kubernetes
database so those same IPs are not used for other hosts later.

We need to store the subnet CIDR and existing allocations in the
kubernetes database somewhere so new IPs can be allocated when hosts
are provisioned.

The [inwinstack/ipam](https://github.com/inwinstack/ipam) controller
provides `Pool` and `IP` resources for allocating IPs from address
ranges. We should evaluate it to see if we can use it for managing the
IP allocations.

The machine-api-provider-baremetal controller is responsible for
making decisions about how to configure a host, so it should request
IPs for secondary networks, assign them to the interfaces, and pass
the relevant data using the ignition configuration data. It will need
to create host-specific ignition configuration resources because it
will be different for each host. It should also set the `Machine` as
an owner of the `IP` so that the reservation is deleted when the
`Machine` is deleted.

### Work Items

1. Ensure the IP ranges for secondary networks are captured by the
installer and saved to the kubernetes database as `Pool` resources.
1. Ensure the installer registers the IP allocations for masters.
1. Ensure the IPAM service is deployed along with the other metal3
components.
1. Update the metal3 machine controller to allocate IPs and create
host-specific ignition configurations containing the IPs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of these work items are OpenShift / CoreOS specific, the "installer" and "ignition" references.

The machine controller is not involved today in creating the ignition config (or cloud-init config, or whatever user data is in use), so maybe this should go somewhere else.

I expect management of secondary interfaces should be done by something else, like https://github.com/nmstate/kubernetes-nmstate

an IPAM component allocating addresses for secondary network interfaces could then create the CRs that specify that the interface should be configured with that IP, and applying the configuration would be done by kubernetes-nmstate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying to make sure I didn't miss anything. Should I move this doc to an internal location?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for it to be internal. It just depends if we come up with something that's more generally useful, or is an OpenShift specific integration thing. If it's just for OpenShift integration, the openshift-metal3 github org has a docs repo as well that hasn't been used much yet.

Speaking of OpenShift specific solutions and how to do network config ... I wonder if the existing MachineConfig resource is enough, provided by the machine-config-operator. We can create those resources to drop new config files on nodes, or to replace existing config files.

1. Create image to hold IPAM operator.
1. Add IPAM operator to metal3 deployment.

### Dependencies

* [inwinstack/ipam](https://github.com/inwinstack/ipam)

### Test Plan

No special requirements

### Upgrade / Downgrade Strategy

Add IPAM operator to deployment configuration

### Version Skew Strategy

N/A

## Drawbacks [optional]

This further complicates the configuration for the metal3 components
by adding yet another container/Pod/Deployment.

## Alternatives

We could require an external DHCP and IPAM solution for the secondary
networks, as we do for the primary network. This complicates
deployments and requires more services running outside of the cluster
to know about implementation details of the cluster in order to have
the external DHCP server pass PXE requests to the dnsmasq instance
that is part of the metal3 deployment, and which might change hosts
and IPs if the pod is restarted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess we could do this, but I think it would require writing a new dhcp option backend for ironic and having some kind of agent on the dhcp server to update the configurations. Currently dnsmasq and ironic need to have access to the same filesystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an alternative so the details won't matter as much because we aren't going to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironic is working with static configuration in this case, and as long as it points back to a vip that ironic can be on, then the world is a happy place. If not, then... we'll need to be able to set configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK ironic only manages the pxe configuration - the dnsmasq configuration is static and in a different container to the ironic conductor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dnsmasq configuration is static - yes, in a different container to the ironic conductor - yes (although built from ironic's Dockerfile for some reason). What ironic does modify is PXE configuration, but this is not related to the discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I see, I was confused. It's really just the TFTP config we are updating. So we could separate those if we wanted.

We could monitor the DHCP reservations given by dnsmasq and ensure
they are configured to be persistent, then also use those IPs to set
static addresses on the hosts during provisioning. This would leave a
reservation to be cleaned up when a host is removed, which might be
tricky for a discovered host that is never actually provisioned.

dhellmann marked this conversation as resolved.
Show resolved Hide resolved
Have the dnsmasq container (or another container) manage an IP using a
"lifetime" setting, as described in [this alternative
proposal](https://github.com/metal3-io/metal3-docs/pull/38). That
approach leaves an opportunity for two hosts to try to have the same
IP if fencing doesn't work properly or if a timeout is to long.

## References

None