Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2084450: Add unit/file for AWS to compute instance provider-id and pass it to the kubelet #3162

Merged
merged 1 commit into from
May 30, 2022

Conversation

damdo
Copy link
Member

@damdo damdo commented May 25, 2022

- What I did
I Added AWS specific systemd unit (aws-kubelet-providerid.service) and file (/usr/local/bin/aws-kubelet-providerid) for generating the AWS instance provider-id (then stored in the KUBELET_PROVIDERID env var), in order to pass it as the --provider-id argument to the kubelet service binary.

We needed to add such flag, and make it non-empty only on AWS, to make the node syncing (specifically backing instance detection) work via provider-id detection, to cover cases where the node hostname doesn't match the expected private-dns-name (e.g. when a custom DHCP Option Set with empty domain-name is used).

Should fix: https://bugzilla.redhat.com/show_bug.cgi?id=2084450
Reference to an upstream issue with context: kubernetes/cloud-provider-aws#384

- How to verify it
Try the reproduction steps available at: https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0 while launching a cluster with this MCO PR included.
Verify that the issue is not reproducible anymore.

- Description for the changelog
Add systemd units/files for AWS specific kubelet service


I tested this manually and here's what I got:

Every 2.0s: oc -n openshift-machine-api get machine.machine.openshift.io,nodes -o wide                                                                                                                                                        Damianos-MacBook-Pro: Wed May 25 17:53:47 2022

NAME                                                                            PHASE     TYPE         REGION         ZONE            AGE   NODE                                            PROVIDERID                                 STATE
machine.machine.openshift.io/ddonati-test111-tlq5x-master-0                     Running   m6i.xlarge   eu-central-1   eu-central-1a   68m   ip-10-0-138-106.eu-central-1.compute.internal   aws:///eu-central-1a/i-065897aad3875d4cd   running
machine.machine.openshift.io/ddonati-test111-tlq5x-master-1                     Running   m6i.xlarge   eu-central-1   eu-central-1b   68m   ip-10-0-180-99.eu-central-1.compute.internal    aws:///eu-central-1b/i-006f4c08cfbf1bf29   running
machine.machine.openshift.io/ddonati-test111-tlq5x-master-2                     Running   m6i.xlarge   eu-central-1   eu-central-1c   68m   ip-10-0-218-116.eu-central-1.compute.internal   aws:///eu-central-1c/i-08e52d41b18dac1c2   running
machine.machine.openshift.io/ddonati-test111-tlq5x-worker-eu-central-1a-xsg4h   Running   m6i.xlarge   eu-central-1   eu-central-1a   62m   ip-10-0-149-5.eu-central-1.compute.internal     aws:///eu-central-1a/i-0cd1a4e5a7bb7f7f9   running
machine.machine.openshift.io/ddonati-test111-tlq5x-worker-eu-central-1b-fwssv   Running   m6i.xlarge   eu-central-1   eu-central-1b   62m   ip-10-0-172-133.eu-central-1.compute.internal   aws:///eu-central-1b/i-0944ed2824e6dba03   running
machine.machine.openshift.io/ddonati-test111-tlq5x-worker-eu-central-1c-tln2r   Running   m6i.xlarge   eu-central-1   eu-central-1c   40m   ip-10-0-219-246                                 aws:///eu-central-1c/i-0df46c39b1323c47b   running  <--- NEW

NAME                                                 STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
node/ip-10-0-138-106.eu-central-1.compute.internal   Ready    master   67m   v1.23.3+ad897c4   10.0.138.106   <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8
node/ip-10-0-149-5.eu-central-1.compute.internal     Ready    worker   54m   v1.23.3+ad897c4   10.0.149.5     <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8
node/ip-10-0-172-133.eu-central-1.compute.internal   Ready    worker   58m   v1.23.3+ad897c4   10.0.172.133   <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8
node/ip-10-0-180-99.eu-central-1.compute.internal    Ready    master   68m   v1.23.3+ad897c4   10.0.180.99    <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8
node/ip-10-0-218-116.eu-central-1.compute.internal   Ready    master   68m   v1.23.3+ad897c4   10.0.218.116   <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8
node/ip-10-0-219-246                                 Ready    worker   34m   v1.23.3+ad897c4   10.0.219.246   <none>        Red Hat Enterprise Linux CoreOS 411.85.202205192031-0 (Ootpa)   4.18.0-348.23.1.el8_5.x86_64   cri-o://1.24.0-45.rhaos4.11.gitccef160.el8 <--- NEW
$ oc -n openshift-controller-manager logs -f aws-cloud-controller-manager-86778cf86b-gwttj
I0525 15:19:49.855913       1 node_controller.go:391] Initializing node ip-10-0-219-246 with cloud provider
I0525 15:19:50.045112       1 node_controller.go:493] Adding node label from cloud provider: beta.kubernetes.io/instance-type=m6i.xlarge
I0525 15:19:50.045456       1 node_controller.go:494] Adding node label from cloud provider: node.kubernetes.io/instance-type=m6i.xlarge
I0525 15:19:50.045606       1 node_controller.go:505] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=eu-central-1c
I0525 15:19:50.045634       1 node_controller.go:506] Adding node label from cloud provider: topology.kubernetes.io/zone=eu-central-1c
I0525 15:19:50.045688       1 node_controller.go:516] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=eu-central-1
I0525 15:19:50.045708       1 node_controller.go:517] Adding node label from cloud provider: topology.kubernetes.io/region=eu-central-1
I0525 15:19:50.066905       1 node_controller.go:455] Successfully initialized node ip-10-0-219-246 with cloud provider
I0525 15:19:50.067622       1 event.go:294] "Event occurred" object="ip-10-0-219-246" kind="Node" apiVersion="v1" type="Normal" reason="Synced" message="Node synced successfully"
I0525 15:20:40.349078       1 controller.go:265] Node changes detected, triggering a full node sync on all loadbalancer services
I0525 15:20:40.349210       1 controller.go:741] Syncing backends for all LB services.
I0525 15:20:40.349292       1 controller.go:804] Updating backends for load balancer openshift-ingress/router-default with node set: map[ip-10-0-138-106.eu-central-1.compute.internal:{} ip-10-0-149-5.eu-central-1.compute.internal:{} ip-10-0-172-133.eu-central-1.compute.internal:{} ip-10-0-180-99.eu-central-1.compute.internal:{} ip-10-0-218-116.eu-central-1.compute.internal:{} ip-10-0-219-246:{}]
I0525 15:20:40.522259       1 aws_loadbalancer.go:1462] Instances added to load-balancer a781e5af36d8f4550baf5e3776505e3c
I0525 15:20:40.705767       1 controller.go:748] Successfully updated 1 out of 1 load balancers to direct traffic to the updated set of nodes
I0525 15:20:40.705790       1 event.go:294] "Event occurred" object="openshift-ingress/router-default" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"
  • The new machine machine.machine.openshift.io/ddonati-test111-tlq5x-worker-eu-central-1c-tln2r is correctly Provisioned and goes into Running state
  • A new node node/ip-10-0-219-246 is created for the machine, which is then synced up and linked correctly

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2022
@openshift-ci openshift-ci bot requested review from jkyros and sinnykumari May 25, 2022 10:45
@damdo damdo marked this pull request as ready for review May 25, 2022 16:39
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2022
@damdo damdo changed the title Add systemd units/files for AWS specific kubelet service Add unit/file AWS to compute instance provider-id and pass it to the kubelet May 25, 2022
@damdo damdo changed the title Add unit/file AWS to compute instance provider-id and pass it to the kubelet Add unit/file for AWS to compute instance provider-id and pass it to the kubelet May 25, 2022
@openshift-ci openshift-ci bot requested a review from yuqi-zhang May 25, 2022 16:41
- What I did
I Added AWS specific systemd unit (aws-kubelet-providerid.service) and file (/usr/local/bin/aws-kubelet-providerid) for generating the AWS instance provider-id (then stored in the KUBELET_PROVIDERID env var), in order to pass it as the --provider-id argument to the kubelet service binary.
We needed to add such flag, and make it non-empty only on AWS, to make the node syncing (specifically backing instance detection) work via provider-id detection, to cover cases where the node hostname doesn't match the expected private-dns-name (e.g. when a custom DHCP Option Set with empty domain-name is used).

Should fix: https://bugzilla.redhat.com/show_bug.cgi?id=2084450
Reference to an upstream issue with context: kubernetes/cloud-provider-aws#384

- How to verify it
Try the reproduction steps available at: https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0 while launching a cluster with this MCO PR included.
Verify that the issue is not reproducible anymore.
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'd like to see what @sinnykumari and @mdbooth think

@lobziik
Copy link
Contributor

lobziik commented May 25, 2022

lgtm

@@ -40,6 +40,7 @@ contents: |
--volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \
{{cloudConfigFlag . }} \
--hostname-override=${KUBELET_NODE_NAME} \
--provider-id=${KUBELET_PROVIDERID} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, since you are adding this to all platforms, how does this affect non-AWS platforms? Would this be empty? Does that cause issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be empty on all platforms aside from AWS.
And judging by the behaviour of the kubelet, this is only assigned to node.Spec.ProviderID if is not empty ("").
https://github.com/kubernetes/kubernetes/blob/39c76ba2edeadb84a115cc3fbd9204a2177f1c28/pkg/kubelet/kubelet_node_status.go#L374

So it should be pretty safe.

@kikisdeliveryservice kikisdeliveryservice changed the title Add unit/file for AWS to compute instance provider-id and pass it to the kubelet Bug 2084450: Add unit/file for AWS to compute instance provider-id and pass it to the kubelet May 25, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2022

@damdo: This pull request references Bugzilla bug 2084450, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 2084450: Add unit/file for AWS to compute instance provider-id and pass it to the kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 May 25, 2022 21:15
@kikisdeliveryservice
Copy link
Contributor

For viz since this changes kubelet templates:
@rphillips PTAL

@mdbooth
Copy link
Contributor

mdbooth commented May 30, 2022

I'm definitely in favour of this, and whatever approach we go for I will aim to add it to OpenStack. This is a much more robust way for CCM to initially match a Node to a kubelet. Otherwise we're left matching the instance name to the node name. As we've seen, node name is fraught with edge cases so a simple UUID match is very attractive.

API-wise this just means that kubelet will add the ProviderID to the initialnode instead of CCM doing it after the initial fuzzy match by node name, so the end result is the same but without the initial edge cases.

The only thing that stopped me doing this last time was that we discussed adding this to the kubelet config file which was going to be a much more involved change in MCO. I would like to see us using providerID in kubelet before CCM is GA, so I would also be in favour of merging this simpler approach quickly and tackling the larger task of moving to a config file separately.

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2022
@damdo
Copy link
Member Author

damdo commented May 30, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 30, 2022

@damdo: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sinnykumari
Copy link
Contributor

Pr looks good, still would like to get a review from node team since they own the kubelet part
/assign @rphillips
/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 30, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damdo, mdbooth, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2022
@sinnykumari sinnykumari removed their assignment May 30, 2022
@openshift-merge-robot openshift-merge-robot merged commit 6ceb3af into openshift:master May 30, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 30, 2022

@damdo: All pull requests linked via external trackers have merged:

Bugzilla bug 2084450 has been moved to the MODIFIED state.

In response to this:

Bug 2084450: Add unit/file for AWS to compute instance provider-id and pass it to the kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

don't know why bot thought lgtm means add approved label :/

@andreaskaris
Copy link
Contributor

I think this broke stuff.

That's from the cloud-network-config controller pod:

2022-05-31T06:54:25.394180800Z E0531 06:54:25.394140       1 controller.go:165] error syncing 'ip-10-0-213-157.us-west-1.compute.internal': error retrieving the private IP configuration for node: ip-10-0-213-157.us-west-1.compute.internal, err: the URI is not expected: aws://us-west-1b/i-0c19616e4c5427e53, requeuing in node workqueue
2022-05-31T06:55:01.718861205Z E0531 06:55:01.718820       1 controller.go:165] error syncing 'ip-10-0-184-199.us-west-1.compute.internal': error retrieving the private IP configuration for node: ip-10-0-184-199.us-west-1.compute.internal, err: the URI is not expected: aws://us-west-1c/i-0f7b0659dce846ed8, requeuing in node workqueue
2022-05-31T06:55:06.355822553Z E0531 06:55:06.355780       1 controller.go:165] error syncing 'ip-10-0-213-157.us-west-1.compute.internal': error retrieving the private IP configuration for node: ip-10-0-213-157.us-west-1.compute.internal, err: the URI is not expected: aws://us-west-1b/i-0c19616e4c5427e53, requeuing in node workqueue
2022-05-31T06:56:23.638990397Z I0531 06:56:23.638951       1 node_controller.go:82] corev1.Node: 'ip-10-0-184-199.us-west-1.compute.internal' in work queue no longer exists
2022-05-31T06:56:23.638990397Z I0531 06:56:23.638975       1 controller.go:160] Dropping key 'ip-10-0-184-199.us-west-1.compute.internal' from the node workqueue
2022-05-31T06:56:28.276665181Z I0531 06:56:28.276626       1 node_controller.go:82] corev1.Node: 'ip-10-0-213-157.us-west-1.compute.internal' in work queue no longer exists
2022-05-31T06:56:28.276665181Z I0531 06:56:28.276648       1 controller.go:160] Dropping key 'ip-10-0-213-157.us-west-1.compute.internal' from the node workqueue
2022-05-31T07:18:36.233407135Z E0531 07:18:36.233356       1 leaderelection.go:367] Failed to update lock: Put "https://api-int.ci-op-z2mw0vdr-be673.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": read tcp 10.130.0.14:42286->10.0.131.50:6443: read: connection reset by peer

That's from the cloud-network-config-controller's code - before it looked like this: aws:///us-west-2a/i-008447f243eead273

324 //  This is what the node's providerID looks like on AWS
325 //      spec:
326 //   providerID: aws:///us-west-2a/i-008447f243eead273
327 //  i.e: zone/instanceID
328 func (a *AWS) getInstance(node *corev1.Node) (*ec2.Instance, error) {
329         providerData := strings.Split(node.Spec.ProviderID, "/")
330         if len(providerData) != 5 {
331                 return nil, UnexpectedURIError(node.Spec.ProviderID)
332         }

But, this changed now to look like this (number of slashes):

[akaris@linux failed-egressip]$ omg get node ip-10-0-168-208.ec2.internal -o yaml | grep -i providerid
        f:providerID: {}
  providerID: aws://us-east-1d/i-036464a5bf998bef9

Likely culprit: #3162 | https://bugzilla.redhat.com/show_bug.cgi?id=2084450

@andreaskaris
Copy link
Contributor

@damdo
Copy link
Member Author

damdo commented May 31, 2022

@andreaskaris I think you are right.
Just checked and there are 3 slashes in the schema (in the Node spec) when that was not passed but inferred through the nodeName lookup:

spec:
  providerID: aws:///us-west-1a/i-0aebe929106a0c6e9

@andreaskaris
Copy link
Contributor

andreaskaris commented May 31, 2022

I'm currently filing a bug. But if we both agree, then let's fix this in the mco (instead of adjusting the cncc) :-) I already have a fixup ready, but I'm also o.k. if you post it yourself

@andreaskaris
Copy link
Contributor

andreaskaris commented May 31, 2022

damdo added a commit to damdo/machine-config-operator that referenced this pull request May 31, 2022
the change in
openshift#3162 is
missing a third `/` in the `aws:///` schema.
This fixes it.
@damdo
Copy link
Member Author

damdo commented May 31, 2022

@andreaskaris Was about to open a PR, you beat me to it. Thanks for opening a bug and filing a fix!

neisw pushed a commit to neisw/machine-config-operator that referenced this pull request May 31, 2022
This reverts commit 6ceb3af, reversing
changes made to b52e75e.
deads2k added a commit that referenced this pull request May 31, 2022
Revert "Merge pull request #3162 from damdo/BZ2084450-2"
andreaskaris added a commit to andreaskaris/machine-config-operator that referenced this pull request Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants