-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use kubenet instead of weavenet for GCP provider #107
Conversation
Hi @kcoronado. Thanks for your PR. I'm waiting for a kubernetes or kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @krousey @jessicaochen |
/cc @maisem |
@kcoronado: GitHub didn't allow me to request PR reviews from the following users: maisem. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions to understand the what and why of this review.
If create cluster commands now fails, does that mean the config map PR broke cluster creation? If so, I would want to bump an issue tracking having a presubmit that actually creates a cluster.
cloud/google/serviceaccount.go
Outdated
@@ -65,6 +65,10 @@ func (gce *GCEClient) CreateMachineControllerServiceAccount(cluster *clusterv1.C | |||
if err != nil { | |||
return fmt.Errorf("couldn't grant permissions to service account: %v", err) | |||
} | |||
err = run("gcloud", "projects", "add-iam-policy-binding", project, "--member=serviceAccount:"+email, "--role=roles/iam.serviceAccountActor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a bit of context. Why does the machine controller service account need to impersonate other accounts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we got the master set up, there was an error message in the machine controller saying it needed access to the default service account when it was trying to create the node, so I added it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused by this. Why does the machine controller service account need to be able to impersonate other service accounts? Shouldn't it just need the appropriate scopes to do what it needs and that's it? Or is this the permission that lets it create instances with service accounts?
Also, iam.serviceAccountActor
has been deprecated. https://cloud.google.com/iam/docs/service-accounts#the_service_account_actor_role
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it looks like this is the permission to create instances with service accounts. Since I removed the service account from new worker nodes, I can get rid of this role. This is assuming we aren't going to have the machine controller spin up new masters, though.
cloud/google/machineactuator.go
Outdated
}, | ||
Labels: labels, | ||
ServiceAccounts: []*compute.ServiceAccount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some context, why do we need a specified service account now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some error in the API server when spinning up the cluster that said it couldn't fetch the initial token and that it couldn't find the default service account. I don't remember the error message exactly, but it was preventing the API server from starting. I think @mkjelland was working on replacing this default service account with something more specific, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think every node needs this account. Just the masters since they are the ones that will be making GCE API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the service account because the controller manager will be using the GCE api to handle load balancers and pod CIDR routing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check so only masters get the service account added.
cloud/google/machineactuator.go
Outdated
@@ -256,10 +256,15 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
tags := []string{"https-server"} | |||
if !util.IsMaster(machine) { | |||
tags = append(tags, fmt.Sprintf("%s-worker", cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What uses this %s-worker tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodes need this %s-worker tag because the firewall rules that GCP creates for LoadBalancers are applied to the node tag (which is specified in the /etc/kubernetes/cloud-config file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we apply this tag to nodes via gce but masters via the startup script? Why can't nodes and masters use the same method of getting the tag applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both nodes and masters get tags applied via GCE. They both get the https-server tag but only nodes get %s-worker. The master startup script creates a cloud-config file where it specifies the node-tags as %s-worker, but that doesn't add the tag to the master. I think the cloud-config is just for GCP to know how to create the firewall rules and maybe do some other networking things. I'm not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nodeTags represent the machines that should have firewall opened for load balancers, the master machine should be part of that set. In open source, masters are schedulable nodes.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce.go#L128
nodeTags []string // List of tags to use on firewall rules for load balancers
https://github.com/kubernetes/kubernetes/blob/f49f799dbda6ce47d5d5709b73bede68d3ccde0f/pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go#L952
TargetTags: hostTags, Allowed: []*compute.FirewallAllowed{ { // TODO: Make this more generic. Currently this method is only // used to create firewall rules for loadbalancers, which have // exactly one protocol, so we can never end up with a list of // mixed TCP and UDP ports. It should be possible to use a // single firewall rule for both a TCP and UDP lb. IPProtocol: strings.ToLower(string(ports[0].Protocol)), Ports: allowedPorts, },
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the tag to the master. This change will be moved to a new PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your question, I think the config map PR broke cluster creation. It has been a while since I started working on that though so I don't totally remember. But before that PR, I don't think I was ever able to spin up a node after the master was created. So the cluster create command executed successfully, but the cluster wasn't completely functional. I think it would be a good idea to have a presubmit for creating a cluster.
cloud/google/machineactuator.go
Outdated
@@ -256,10 +256,15 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
tags := []string{"https-server"} | |||
if !util.IsMaster(machine) { | |||
tags = append(tags, fmt.Sprintf("%s-worker", cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodes need this %s-worker tag because the firewall rules that GCP creates for LoadBalancers are applied to the node tag (which is specified in the /etc/kubernetes/cloud-config file).
cloud/google/machineactuator.go
Outdated
}, | ||
Labels: labels, | ||
ServiceAccounts: []*compute.ServiceAccount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some error in the API server when spinning up the cluster that said it couldn't fetch the initial token and that it couldn't find the default service account. I don't remember the error message exactly, but it was preventing the API server from starting. I think @mkjelland was working on replacing this default service account with something more specific, though.
cloud/google/serviceaccount.go
Outdated
@@ -65,6 +65,10 @@ func (gce *GCEClient) CreateMachineControllerServiceAccount(cluster *clusterv1.C | |||
if err != nil { | |||
return fmt.Errorf("couldn't grant permissions to service account: %v", err) | |||
} | |||
err = run("gcloud", "projects", "add-iam-policy-binding", project, "--member=serviceAccount:"+email, "--role=roles/iam.serviceAccountActor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we got the master set up, there was an error message in the machine controller saying it needed access to the default service account when it was trying to create the node, so I added it here.
cloud/google/machineactuator.go
Outdated
@@ -256,10 +256,15 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
tags := []string{"https-server"} | |||
if !util.IsMaster(machine) { | |||
tags = append(tags, fmt.Sprintf("%s-worker", cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we apply this tag to nodes via gce but masters via the startup script? Why can't nodes and masters use the same method of getting the tag applied?
cloud/google/machineactuator.go
Outdated
@@ -256,10 +256,15 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
tags := []string{"https-server"} | |||
if !util.IsMaster(machine) { | |||
tags = append(tags, fmt.Sprintf("%s-worker", cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nodeTags represent the machines that should have firewall opened for load balancers, the master machine should be part of that set. In open source, masters are schedulable nodes.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce.go#L128
nodeTags []string // List of tags to use on firewall rules for load balancers
https://github.com/kubernetes/kubernetes/blob/f49f799dbda6ce47d5d5709b73bede68d3ccde0f/pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go#L952
TargetTags: hostTags, Allowed: []*compute.FirewallAllowed{ { // TODO: Make this more generic. Currently this method is only // used to create firewall rules for loadbalancers, which have // exactly one protocol, so we can never end up with a list of // mixed TCP and UDP ports. It should be possible to use a // single firewall rule for both a TCP and UDP lb. IPProtocol: strings.ToLower(string(ports[0].Protocol)), Ports: allowedPorts, },
To expand on @jessicaochen's questions, this PR's title and description state one objective, without much context: switching the network provider from weave-net to kubenet. Looking at the diff, though, the actual changes are at least:
There is a lot bundled up in this single PR! I think it would make the changes much easier to reason about if they were broken up into smaller, single-purpose diffs with clear rationales for why the changes are desirable. It also sounds like the very first PR should probably contain just the changes needed to fix any regressions introduced by #104, which probably does not require changing the network provider at all. Does that sound reasonable? |
echo "exit 101" > /usr/sbin/policy-rc.d | ||
chmod +x /usr/sbin/policy-rc.d | ||
trap "rm /usr/sbin/policy-rc.d" RETURN | ||
apt install -y docker.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's an alias I didn't notice, this looks broken. Shouldn't the command be apt-get
, not apt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, @krousey schooled me. Apparently apt
does work now and I am just an old person clinging to the past. Might be worthwhile to standardize on one or the other, though, because this file now uses a mixture of them both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I changed it to apt-get.
cloud/google/machineactuator.go
Outdated
}, | ||
Labels: labels, | ||
ServiceAccounts: []*compute.ServiceAccount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think every node needs this account. Just the masters since they are the ones that will be making GCE API calls.
cloud/google/machineactuator.go
Outdated
}, | ||
Labels: labels, | ||
ServiceAccounts: []*compute.ServiceAccount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the service account because the controller manager will be using the GCE api to handle load balancers and pod CIDR routing.
cloud/google/serviceaccount.go
Outdated
@@ -65,6 +65,10 @@ func (gce *GCEClient) CreateMachineControllerServiceAccount(cluster *clusterv1.C | |||
if err != nil { | |||
return fmt.Errorf("couldn't grant permissions to service account: %v", err) | |||
} | |||
err = run("gcloud", "projects", "add-iam-policy-binding", project, "--member=serviceAccount:"+email, "--role=roles/iam.serviceAccountActor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused by this. Why does the machine controller service account need to be able to impersonate other service accounts? Shouldn't it just need the appropriate scopes to do what it needs and that's it? Or is this the permission that lets it create instances with service accounts?
Also, iam.serviceAccountActor
has been deprecated. https://cloud.google.com/iam/docs/service-accounts#the_service_account_actor_role
apt-transport-https \ | ||
cloud-utils \ | ||
prips | ||
|
||
function install_configure_docker () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't similar changes be required for the nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the change in for nodes.
@pipejakob turns out cluster creation was not as broken as I thought it was using weavenet. It creates the master and says it creates the node, but the node doesn't exist. The command runs to completion though. I double checked what was happening before #104 was merged and this is the same functionality as before, so there weren't any regressions (I'll update the description). I must have just messed things up myself when I was testing it. In terms of all the changes, I agree there are several things going on. I can split out the dynamic tags for GCE workers since that's adding load balancing functionality, and this doesn't prevent the cluster from starting successfully. I don't think it makes as much sense to break the rest of the changes up since they're all changes needed to fix errors when starting a cluster with kubenet. I listed the error/reason for most of the changes in the first commit in this PR, but I could move them to the PR description for better visibility. Would that work? |
- add a worker tag to worker nodes so firewall rules created for loadbalancers will be applied to the worker nodes. - add the default service account to new VM instances. - add serviceAccountActor role to the machine controller service account so the master can spin up new nodes. - add a function to the startup scripts to prevent docker from starting on installation so we can configure DOCKER_OPTS and then start it manually. This is needed to prevent it from messing with the IP tables.
The flag was replaced with a workaround to create the missing file that triggered the error. The cluster was functioning when all preflight errors were ignored, but in case another error comes up in the future, we don't want to accidentally ignore it.
These changes will be in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed all the comments. I'll update the PR description with what changes were added and why.
cloud/google/machineactuator.go
Outdated
}, | ||
Labels: labels, | ||
ServiceAccounts: []*compute.ServiceAccount{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check so only masters get the service account added.
echo "exit 101" > /usr/sbin/policy-rc.d | ||
chmod +x /usr/sbin/policy-rc.d | ||
trap "rm /usr/sbin/policy-rc.d" RETURN | ||
apt install -y docker.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I changed it to apt-get.
cloud/google/serviceaccount.go
Outdated
@@ -65,6 +65,10 @@ func (gce *GCEClient) CreateMachineControllerServiceAccount(cluster *clusterv1.C | |||
if err != nil { | |||
return fmt.Errorf("couldn't grant permissions to service account: %v", err) | |||
} | |||
err = run("gcloud", "projects", "add-iam-policy-binding", project, "--member=serviceAccount:"+email, "--role=roles/iam.serviceAccountActor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it looks like this is the permission to create instances with service accounts. Since I removed the service account from new worker nodes, I can get rid of this role. This is assuming we aren't going to have the machine controller spin up new masters, though.
cloud/google/machineactuator.go
Outdated
@@ -256,10 +256,15 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
tags := []string{"https-server"} | |||
if !util.IsMaster(machine) { | |||
tags = append(tags, fmt.Sprintf("%s-worker", cluster.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the tag to the master. This change will be moved to a new PR though.
apt-transport-https \ | ||
cloud-utils \ | ||
prips | ||
|
||
function install_configure_docker () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the change in for nodes.
cloud/google/machineactuator.go
Outdated
@@ -266,10 +266,21 @@ func (gce *GCEClient) Create(cluster *clusterv1.Cluster, machine *clusterv1.Mach | |||
if gce.machineClient == nil { | |||
labels[BootstrapLabelKey] = "true" | |||
} | |||
serviceAccounts := []*compute.ServiceAccount{nil} | |||
if util.IsMaster(machine) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assuming that the service account is needed for the machine controller. The assumption here is that the machine controller only runs on the master. Could we call out that assumption in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually needed for kubernetes' cloud provider specific code. K8s needs to be about to read compute API for faster node deletion, create advanced routes for pod CIDRs on nodes, etc...
lgtm from me after you add the comment @jessicaochen wanted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
/ok-to-test
/lgtm
/hold
Remove hold after you have done a last E2E validation.
/ok-to-test |
rebuilt everything and created a cluster successfully. |
/approved |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jessicaochen, kcoronado The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Defer the patch
What this PR does / why we need it:
This PR modifies the GCP provider to create clusters that use kubenet instead of weave. Clusters are not being created successfully (create commands would complete and spin up a master, but no nodes), so this change will spin up a working master and node(s).
Specific changes:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):#81
Fixes #83
Fixes #115
Special notes for your reviewer:
Release note:
@kubernetes/kube-deploy-reviewers