HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

scottdhowell3 · 2020-05-05T16:29:09Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Add aws-encryption-config.yaml static pod as a base64 encoded file to KubeAdmConfigSpec
Add encryption-config.yaml as a base64 encoded file to KubeAdmConfigSpec
Create KMS key in AWS
Spin up 3 control-plane and 1 worker node cluster using the cluster-api-aws-provider
First control-plane node comes up correctly and is be able to be seen with kubectl
Worker node joins the initial control-plane node in the cluster
Second control-plane node errors on kubelet with this error

May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: Starting kubelet: The Kubernetes Node Agent...
May 05 15:34:12 ip-10-90-20-215.ec2.internal kubelet[4583]: F0505 15:34:12.917439    4583 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubele
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service failed.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: Starting kubelet: The Kubernetes Node Agent...

What did you expect to happen:
We expected the second control-plane node to join the cluster along with the third one.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
encryption-config.yaml for api-server

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    providers:
    - kms:
        name: aws-encryption-provider
        endpoint: unix:///var/run/kmsplugin/socket.sock
        cachesize: 1000
        timeout: 3s
    - identity: {}

aws-encryption-provider.yaml
apiVersion: v1
kind: Pod
metadata:
  name: aws-encryption-provider
  namespace: kube-system
spec:
  containers:
  - image: payitadmin/aws-encryption-provider:latest
    name: aws-encryption-provider
    command:
    - /aws-encryption-provider
    - --key=arn:aws:kms:<account_specific_arn>
    - --region=us-east-1
    - --listen=/var/run/kmsplugin/socket.sock
    - --health-port=:8083
    ports:
    - containerPort: 8083
      protocol: TCP
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8083
    volumeMounts:
    - mountPath: /var/run/kmsplugin
      name: var-run-kmsplugin
  hostNetwork: true
  volumes:
  - name: var-run-kmsplugin
    hostPath:
      path: /var/run/kmsplugin
      type: DirectoryOrCreate

Environment:

Cluster-api-provider-aws version: 0.3.3
Kubernetes version: (use kubectl version): 1.17.5
OS (e.g. from /etc/os-release): Amazon Linux 2

The text was updated successfully, but these errors were encountered:

randomvariable · 2020-05-05T17:51:57Z

Bit hard to tell here. If we can get the logs from the following, would help find out what's going on:

Logs from the Cluster API AWS controller

kubectl logs -n capa-system deployments/capa-controller-manager manager

Logs from the kubeadmcontrolplane controller

kubectl logs -n capi-kubeadm-control-plane-system deployments/capi-kubeadm-control-plane-controller-manager manager

Logs from the kubeadm bootstrap controller

kubectl logs -n capi-kubeadm-bootstrap-system deployments/capi-kubeadm-bootstrap-controller-manager manager

If you can SSH to the failed 2nd and 3rd control plane instances, then the contents of /var/log/cloud-init-output.log would be helpful.

Also, can you verify the version of the AWS controller you're using? 0.3.3 is a long time ago. I assume that's Cluster API v0.3.3 and a 0.5 series of Cluster API Provider AWS?

scottdhowell3 · 2020-05-05T18:31:55Z

Bit hard to tell here. If we can get the logs from the following, would help find out what's going on:

Logs from the Cluster API AWS controller
kubectl logs -n capa-system deployments/capa-controller-manager manager
Logs from the kubeadmcontrolplane controller
kubectl logs -n capi-kubeadm-control-plane-system deployments/capi-kubeadm-control-plane-controller-manager manager
Logs from the kubeadm bootstrap controller
kubectl logs -n capi-kubeadm-bootstrap-system deployments/capi-kubeadm-bootstrap-controller-manager manager
If you can SSH to the failed 2nd and 3rd control plane instances, then the contents of /var/log/cloud-init-output.log would be helpful.

Also, can you verify the version of the AWS controller you're using? 0.3.3 is a long time ago. I assume that's Cluster API v0.3.3 and a 0.5 series of Cluster API Provider AWS?
ClusterAPI v0.3.5
CAPA v0.5.3
kubeadmincontrolplane_controller.txt
kubeadm_bootstrap_controller.txt
cloud-inti-output.txt
cluster_api_aws_controller.txt

msawka · 2020-05-05T19:16:37Z

Here are the cloud-init-output.log files from a server that works, and a server that doesn't.

works.txt
broken.txt

We have a static pod defined (/etc/kubernetes/manifests/aws-encryption-provider.yaml) to start the encryption provider, which looks like it might be causing an issue with the kubeadm join?

CecileRobertMichon · 2020-05-06T17:41:58Z

/priority awaiting-more-evidence
/area bootstrap
/assign @randomvariable

vincepri · 2020-05-06T17:42:02Z

/milestone v0.3.x

scottdhowell3 · 2020-05-07T20:18:53Z

Do we need to provide anything else for troubleshooting?

randomvariable · 2020-05-11T13:30:13Z

@scottdhowell3 nope. had a foreshortened week due to public holidays last week, will sort out this week.

randomvariable · 2020-05-12T12:50:53Z

/priority important-soon

randomvariable · 2020-05-12T12:51:01Z

/lifecycle active

rikirolly · 2023-02-27T12:04:04Z

@scottdhowell3
I am trying to install aws-encryption-provider without using EKS.
Could you please provide a description on how to implement the first 2 points?

Add aws-encryption-config.yaml static pod as a base64 encoded file to KubeAdmConfigSpec
Add encryption-config.yaml as a base64 encoded file to KubeAdmConfigSpec

ncdc transferred this issue from kubernetes-sigs/cluster-api-provider-aws May 6, 2020

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2020

k8s-ci-robot assigned randomvariable May 6, 2020

k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/bootstrap Issues or PRs related to bootstrap providers labels May 6, 2020

k8s-ci-robot added this to the v0.3.x milestone May 6, 2020

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 12, 2020

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 12, 2020

randomvariable mentioned this issue May 12, 2020

🐛 Ignore custom static manifests on kubeadm join #3053

Merged

k8s-ci-robot closed this as completed in #3053 May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

scottdhowell3 commented May 5, 2020

randomvariable commented May 5, 2020 •

edited

Loading

scottdhowell3 commented May 5, 2020 •

edited

Loading

msawka commented May 5, 2020

CecileRobertMichon commented May 6, 2020

vincepri commented May 6, 2020

scottdhowell3 commented May 7, 2020

randomvariable commented May 11, 2020

randomvariable commented May 12, 2020

randomvariable commented May 12, 2020

rikirolly commented Feb 27, 2023

HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

Comments

scottdhowell3 commented May 5, 2020

randomvariable commented May 5, 2020 • edited Loading

scottdhowell3 commented May 5, 2020 • edited Loading

msawka commented May 5, 2020

CecileRobertMichon commented May 6, 2020

vincepri commented May 6, 2020

scottdhowell3 commented May 7, 2020

randomvariable commented May 11, 2020

randomvariable commented May 12, 2020

randomvariable commented May 12, 2020

rikirolly commented Feb 27, 2023

randomvariable commented May 5, 2020 •

edited

Loading

scottdhowell3 commented May 5, 2020 •

edited

Loading