Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace tectonic-network-operator with cluster-network-operator #600

Merged
merged 3 commits into from
Nov 12, 2018

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Nov 2, 2018

This removes tectonic-network-operator and replaces it with the cluster-network-operator.

This won't work quite yet, because there is a bug in origin. But how we generate the configuration is up for review.

Network configuration is a funny beast. We basically can't change it once the cluster is up. However, there is a distinct "happy path" along with a large number of possible settings tweaks. So, right now this generates a sane network configuration from the installer network configuration.

We probably want to figure out a better way to do that at some point; the installer config doesn't capture even the full set of IP address options.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 2, 2018
Gopkg.toml Show resolved Hide resolved
@squeed
Copy link
Contributor Author

squeed commented Nov 5, 2018

Okay, refactored a bit. This change is now somewhat broader in scope.

I deprecated the badly-named podCIDR variable in the install-config, and added a correct multiple-cluster-cidr variable. We do need to think if we want to expose the complete network-operator API directly in the install-config, or if we want to just support the happy-path and leave the rest to customization.

Another open question is what the user should do when they change their cidrs. We actually will support expanding the service and cluster ip ranges. These changes will have to be propagated to their various consumers.

@abhinavdahiya
Copy link
Contributor

I deprecated the badly-named podCIDR variable in the install-config, and added a correct multiple-cluster-cidr variable. We do need to think if we want to expose the complete network-operator API directly in the install-config, or if we want to just support the happy-path and leave the rest to customization.

InstallConfig is supposed to provide very simple options. Why expose []ClusterCIDR ie list of these and if the NetworkOperator supports expanding these in running cluster allow users to give only one and customize later on.

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X

@squeed
Copy link
Contributor Author

squeed commented Nov 5, 2018

InstallConfig is supposed to provide very simple options. Why expose []ClusterCIDR ie list of these and if the NetworkOperator supports expanding these in running cluster allow users to give only one and customize later on.

Oh, it's so much muddier than that.

First of all, the network-operator currently doesn't support rolling out any changes, and it probably won't for a while. So we can't support that. Also, we won't ever support changes for non-Openshift-maintained network plugins. I'd also just like to avoid network changes as much as possible. They're always risky, even when done correctly.

Right now, the address configurations are read in to the installer, then split out in to the install-config, NetworkConfig.networkoperator.openshift.io, and Cluster.cluster.k8s.io. Too many things, right now, read from install-config, including the kube-apiserver-operator. Right now things are fine because IPs are immutable.

In the future, downstream consumers stop reading install-config and just use Cluster.cluster.k8s.io for addressing information - and we can just manage that object in the network-operator. The Cluster object isn't exclusively the domain of the network-operator, though, so we need to be somewhat subtle.

Even when we reach that ideal state, it's not clear how we should plumb through IP space configuration in the installer. Since IP blocks are general-case immutable, users do need to choose them correctly (or else!), so we need it to be part of even the happy-path configuration. However, any subsequent changes will presumably be made to the NetworkConfig object directly, leaving the install-config object in cluster out of date. Is that OK? Is there a way to give the installer IP configuration without writing it to the cluster's install-config?

@squeed squeed force-pushed the network-operator branch 2 times, most recently from ee2ed91 to c0d51ec Compare November 5, 2018 20:20
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2018
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2018
@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2018

Filed openshift/cluster-kube-apiserver-operator#106 to stop reading cidrs from the install-config. This is not a blocker.

@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2018

I had to refactor this slightly one more time: the network-operator was setting KUBERNETES_SVC_HOST to 127.0.0.1, but that doesn't work because the bootstrap apiserver is still active. So I now have the installer creating an additional config map.

The alternative would be to special-casing the network-operator and running it on the bootstrap node. That seems uglier.

@squeed squeed changed the title [wip] Replace tectonic-network-operator with cluster-network-operator Replace tectonic-network-operator with cluster-network-operator Nov 6, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2018
@squeed squeed changed the title Replace tectonic-network-operator with cluster-network-operator [wip] Replace tectonic-network-operator with cluster-network-operator Nov 6, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2018
@squeed
Copy link
Contributor Author

squeed commented Nov 6, 2018

A few more things to work out:

  • Can the network-operator run in runlevel 1? I don't think so, but would love to be wrong
  • If not, is there a better way to pass the apiserver URL?

return errors.Wrapf(err, "Could not generate ClusterNetworkingConfig")
}

cluster := clusterv1a1.Cluster{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some context: currently we need to create the cluster object because the actuator interface that we use for creating machines expects it to exist. However we don’t rely on it at all and we aim to convince upstream to have a clear separation between "machine api" and "cluster api" and decouple it from the actuator interface.
Eventually we might want to consider and approach that progressively moves the installer definition/infra into a cluster actuator driven by that object so this might be a kind of an starting point but that's way far from scope and we need to be aware that the “cluster api” is effectively a “machine api” atm for us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'm thinking this might be relevant for a 3.11 to 4.0 upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. This is a bit of a bigger topic, then. Other operators need a way to discover the servicecidrs and clustercidrs for their own configuration. It's ugly to have every operator parsing out the install configuration. Let's meet up and talk about this. Note that pulling clustercidr / servicecidr from installconfig is a big problem for the apiserver-operator, see openshift/cluster-kube-apiserver-operator#/105

@knobunc
Copy link

knobunc commented Nov 9, 2018

@lucab do you know why there is any relabeling going on? It sounds wrong for the pod to be relabeling the host filesystem. Or am I misunderstanding what the problem is. Thanks!

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2018
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2018
* Generate cluster-network-operator config from install-config
* Refactor install-config to better reflect network config
* remove tectonic-network-operator
* Remove temporary kube-proxy and cvo override
This is to break an import loop between pkg/assets/machines and
pkg/assets/manifests.
@knobunc
Copy link

knobunc commented Nov 12, 2018

@squeed what if we had an init container run the loop to test?

@squeed
Copy link
Contributor Author

squeed commented Nov 12, 2018

Getting closer. In the bad runs, the container processes are started with context container_t. In the good runs, they start with spc_t. Not sure why this would change between executions. Thoughts, @mrunalp?

@knobunc
Copy link

knobunc commented Nov 12, 2018

@rhatdan Do you have any idea? Thanks.

@squeed
Copy link
Contributor Author

squeed commented Nov 12, 2018

@lucab suggests removing runAsUser: 0, which I'm trying now.

@squeed
Copy link
Contributor Author

squeed commented Nov 12, 2018

Just waiting for the network operator changes to be picked up by the release, then we can retest.

@abhinavdahiya
Copy link
Contributor

/retest

@squeed
Copy link
Contributor Author

squeed commented Nov 12, 2018

That seems to have been it.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2018
@squeed
Copy link
Contributor Author

squeed commented Nov 12, 2018

All green. Just needs a lgtm.

@abhinavdahiya
Copy link
Contributor

The PR is green and changes look fine. We can iterate on this. :yay:

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2018
@openshift-merge-robot openshift-merge-robot merged commit 54f341c into openshift:master Nov 12, 2018
@rhatdan
Copy link

rhatdan commented Nov 12, 2018

Might have discovered a bug though. If RunAsUser was causing privileged to be ingored?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.