Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA cluster will not join second control-plane node when aws-encryption-provider is running #3019

Closed
scottdhowell3 opened this issue May 5, 2020 · 10 comments · Fixed by #3053
Assignees
Labels
area/bootstrap Issues or PRs related to bootstrap providers kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@scottdhowell3
Copy link

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

  1. Add aws-encryption-config.yaml static pod as a base64 encoded file to KubeAdmConfigSpec
  2. Add encryption-config.yaml as a base64 encoded file to KubeAdmConfigSpec
  3. Create KMS key in AWS
  4. Spin up 3 control-plane and 1 worker node cluster using the cluster-api-aws-provider
  5. First control-plane node comes up correctly and is be able to be seen with kubectl
  6. Worker node joins the initial control-plane node in the cluster
  7. Second control-plane node errors on kubelet with this error
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: Starting kubelet: The Kubernetes Node Agent...
May 05 15:34:12 ip-10-90-20-215.ec2.internal kubelet[4583]: F0505 15:34:12.917439    4583 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubele
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: Unit kubelet.service entered failed state.
May 05 15:34:12 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service failed.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: kubelet.service holdoff time over, scheduling restart.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 05 15:34:23 ip-10-90-20-215.ec2.internal systemd[1]: Starting kubelet: The Kubernetes Node Agent...

What did you expect to happen:
We expected the second control-plane node to join the cluster along with the third one.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
encryption-config.yaml for api-server

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    providers:
    - kms:
        name: aws-encryption-provider
        endpoint: unix:///var/run/kmsplugin/socket.sock
        cachesize: 1000
        timeout: 3s
    - identity: {}
aws-encryption-provider.yaml
apiVersion: v1
kind: Pod
metadata:
  name: aws-encryption-provider
  namespace: kube-system
spec:
  containers:
  - image: payitadmin/aws-encryption-provider:latest
    name: aws-encryption-provider
    command:
    - /aws-encryption-provider
    - --key=arn:aws:kms:<account_specific_arn>
    - --region=us-east-1
    - --listen=/var/run/kmsplugin/socket.sock
    - --health-port=:8083
    ports:
    - containerPort: 8083
      protocol: TCP
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8083
    volumeMounts:
    - mountPath: /var/run/kmsplugin
      name: var-run-kmsplugin
  hostNetwork: true
  volumes:
  - name: var-run-kmsplugin
    hostPath:
      path: /var/run/kmsplugin
      type: DirectoryOrCreate

Environment:

  • Cluster-api-provider-aws version: 0.3.3
  • Kubernetes version: (use kubectl version): 1.17.5
  • OS (e.g. from /etc/os-release): Amazon Linux 2
@randomvariable
Copy link
Member

randomvariable commented May 5, 2020

Bit hard to tell here. If we can get the logs from the following, would help find out what's going on:

  • Logs from the Cluster API AWS controller
kubectl logs -n capa-system deployments/capa-controller-manager manager
  • Logs from the kubeadmcontrolplane controller
kubectl logs -n capi-kubeadm-control-plane-system deployments/capi-kubeadm-control-plane-controller-manager manager
  • Logs from the kubeadm bootstrap controller
kubectl logs -n capi-kubeadm-bootstrap-system deployments/capi-kubeadm-bootstrap-controller-manager manager

If you can SSH to the failed 2nd and 3rd control plane instances, then the contents of /var/log/cloud-init-output.log would be helpful.

Also, can you verify the version of the AWS controller you're using? 0.3.3 is a long time ago. I assume that's Cluster API v0.3.3 and a 0.5 series of Cluster API Provider AWS?

@scottdhowell3
Copy link
Author

scottdhowell3 commented May 5, 2020

Bit hard to tell here. If we can get the logs from the following, would help find out what's going on:

Logs from the Cluster API AWS controller
kubectl logs -n capa-system deployments/capa-controller-manager manager
Logs from the kubeadmcontrolplane controller
kubectl logs -n capi-kubeadm-control-plane-system deployments/capi-kubeadm-control-plane-controller-manager manager
Logs from the kubeadm bootstrap controller
kubectl logs -n capi-kubeadm-bootstrap-system deployments/capi-kubeadm-bootstrap-controller-manager manager
If you can SSH to the failed 2nd and 3rd control plane instances, then the contents of /var/log/cloud-init-output.log would be helpful.

Also, can you verify the version of the AWS controller you're using? 0.3.3 is a long time ago. I assume that's Cluster API v0.3.3 and a 0.5 series of Cluster API Provider AWS?
ClusterAPI v0.3.5
CAPA v0.5.3
kubeadmincontrolplane_controller.txt
kubeadm_bootstrap_controller.txt
cloud-inti-output.txt
cluster_api_aws_controller.txt

@msawka
Copy link

msawka commented May 5, 2020

Here are the cloud-init-output.log files from a server that works, and a server that doesn't.

works.txt
broken.txt

We have a static pod defined (/etc/kubernetes/manifests/aws-encryption-provider.yaml) to start the encryption provider, which looks like it might be causing an issue with the kubeadm join?

@ncdc ncdc transferred this issue from kubernetes-sigs/cluster-api-provider-aws May 6, 2020
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2020
@CecileRobertMichon
Copy link
Contributor

/priority awaiting-more-evidence
/area bootstrap
/assign @randomvariable

@k8s-ci-robot k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/bootstrap Issues or PRs related to bootstrap providers labels May 6, 2020
@vincepri
Copy link
Member

vincepri commented May 6, 2020

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone May 6, 2020
@scottdhowell3
Copy link
Author

Do we need to provide anything else for troubleshooting?

@randomvariable
Copy link
Member

@scottdhowell3 nope. had a foreshortened week due to public holidays last week, will sort out this week.

@randomvariable
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 12, 2020
@randomvariable
Copy link
Member

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 12, 2020
@rikirolly
Copy link

@scottdhowell3
I am trying to install aws-encryption-provider without using EKS.
Could you please provide a description on how to implement the first 2 points?

  1. Add aws-encryption-config.yaml static pod as a base64 encoded file to KubeAdmConfigSpec
  2. Add encryption-config.yaml as a base64 encoded file to KubeAdmConfigSpec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap providers kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants