Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Kubectl client outside of HA/multi-master Epiphany cluster fails to connect to server with invalid certificate #1520

Closed
ks4225 opened this issue Aug 3, 2020 · 8 comments

Comments

@ks4225
Copy link

ks4225 commented Aug 3, 2020

Describe the bug
On a HA / multi-master, issuing kubectl commands from a machine outside the cluster (e.g. CI agent) will sometime fail with a certificate error. The thought is that the HAProxy on the k8s master machines ends up routing the kubectl in a way that mismatches with the config on the external machine.

To Reproduce
Steps to reproduce the behavior:

  1. Build an Epiphany cluster with HA / multi-master (3 masters in this case)
  2. Copy the kube config from one of the k8s master machines to an external machine (as part of this, localhost needs to be replaced in the kube config)
  3. Issue kubectl commands from the external machine, which will fail periodically (depending on how traffic is routed)

Expected behavior
It should be possible to issue kubectl commands from the external machine that work consistently.

Config files
Key aspects of the config are:

components:
    kubernetes_master:
      count: 3
...
use_ha_control_plane: true

OS (please complete the following information):

  • OS: Ubuntu

Cloud Environment (please complete the following information):

  • Cloud Provider: MS Azure

Additional context
Add any other context about the problem here.

cc @jsmith085 @sunshine69

@ks4225
Copy link
Author

ks4225 commented Aug 3, 2020

Example error message is:
Unable to connect to the server: x509: certificate is valid for ###, ###, not ### (where ### are IPs)

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Aug 4, 2020

Thank you for reporting the issue, @ks4225 !

I've checked that indeed in non-HA and HA clusters the kubeconfig handling differ.

I believe, two things need to be done to fix the problem:

  1. In new clusters - during the kubeadm init run we need to provide addtional cert SANs in the kubeadm config file/map.
  2. In existing clusters that already have this issue - we need to modify the config map and regenerate certificates.

All this should be done during the epicli apply run.

As a temporary workaround some kind of tcp proxy can be used, for example:

$ ssh -L 3446:localhost:3446 [email protected] -N
$ kubectl --kubeconfig admin.conf get nodes,pods -A
NAME        STATUS   ROLES    AGE     VERSION
node/x1a1   Ready    master   58m     v1.18.6
node/x1a2   Ready    master   10m     v1.18.6
node/x1a3   Ready    master   9m12s   v1.18.6
node/x1b1   Ready    <none>   56m     v1.18.6

NAMESPACE              NAME                                             READY   STATUS    RESTARTS   AGE
kube-system            pod/coredns-74c98659f4-5c6tj                     1/1     Running   0          57m
kube-system            pod/coredns-74c98659f4-hc7fw                     1/1     Running   0          57m
kube-system            pod/etcd-x1a1                                    1/1     Running   0          58m
kube-system            pod/etcd-x1a2                                    1/1     Running   0          10m
kube-system            pod/etcd-x1a3                                    1/1     Running   0          9m1s
kube-system            pod/kube-apiserver-x1a1                          1/1     Running   1          58m
kube-system            pod/kube-apiserver-x1a2                          1/1     Running   0          10m
kube-system            pod/kube-apiserver-x1a3                          1/1     Running   0          9m1s
kube-system            pod/kube-controller-manager-x1a1                 1/1     Running   2          58m
kube-system            pod/kube-controller-manager-x1a2                 1/1     Running   0          10m
kube-system            pod/kube-controller-manager-x1a3                 1/1     Running   0          9m1s
kube-system            pod/kube-flannel-ds-amd64-5cmmr                  1/1     Running   0          9m12s
kube-system            pod/kube-flannel-ds-amd64-9wk8s                  1/1     Running   0          58m
kube-system            pod/kube-flannel-ds-amd64-btbmt                  1/1     Running   1          10m
kube-system            pod/kube-flannel-ds-amd64-j7s4c                  1/1     Running   0          56m
kube-system            pod/kube-proxy-5zvck                             1/1     Running   1          56m
kube-system            pod/kube-proxy-nfgld                             1/1     Running   1          58m
kube-system            pod/kube-proxy-q5rnd                             1/1     Running   0          9m12s
kube-system            pod/kube-proxy-ww4tf                             1/1     Running   0          10m
kube-system            pod/kube-scheduler-x1a1                          1/1     Running   2          58m
kube-system            pod/kube-scheduler-x1a2                          1/1     Running   0          10m
kube-system            pod/kube-scheduler-x1a3                          1/1     Running   0          9m1s
kubernetes-dashboard   pod/dashboard-metrics-scraper-667d84869b-tv8d2   1/1     Running   0          57m
kubernetes-dashboard   pod/kubernetes-dashboard-78fbf9d49c-qs7nr        1/1     Running   0          57m

It's not very convenient though :(

@atsikham
Copy link
Contributor

atsikham commented Aug 11, 2020

Hello @ks4225,
There is a simple workaround to use kubectl with --insecure-skip-tls-verify for example kubectl --insecure-skip-tls-verify get nodes
I will continue work on final solution.

@ks4225
Copy link
Author

ks4225 commented Aug 13, 2020

Thank you for the update @tolikt.

We have actually been using --insecure-skip-tls-verify already. Good to hear it's the recommended workaround.

@mkyc
Copy link
Contributor

mkyc commented Aug 25, 2020

@przemyslavic @atsikham why is it back in pipeline? Can you leave any comment?

@przemyslavic
Copy link
Collaborator

przemyslavic commented Aug 25, 2020

I did some testing by following the instructions posted here to reproduce the issue. I deployed an HA cluster with public IP addresses on Azure, then logged into one machine (other than the master/node), copied admin.conf from one of the masters, replaced localhost with the private IP address of the master node, and now try to run kubectl. I am getting the same error that is described in this task. The support for public IPs will probably be removed here for security reasons, but I think @atsikham will be able to provide more details about the fix.
The result of the kubectl command:

NAME                                            STATUS   ROLES    AGE   VERSION
ci-devhaazurubuflannel-kubernetes-master-vm-0   Ready    master   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-master-vm-1   Ready    master   22h   v1.18.6
ci-devhaazurubuflannel-kubernetes-master-vm-2   Ready    master   22h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-0     Ready    <none>   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-1     Ready    <none>   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-2     Ready    <none>   21h   v1.18.6
[operations@ci-devhaazurubuflannel-logging-vm-0 ~]$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for 10.96.0.1, 10.1.1.9, 51.xx.yy.72, 51.xx.yy.71, 51.xx.yy.68, 127.0.0.1, not 10.1.1.6
[operations@ci-devhaazurubuflannel-logging-vm-0 ~]$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for 10.96.0.1, 10.1.1.7, 127.0.0.1, 51.xx.yy.72, 51.xx.yy.71, 51.xx.yy.68, not 10.1.1.6
[operations@ci-devhaazurubuflannel-logging-vm-0 ~]$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ci-devhaazurubuflannel-kubernetes-master-vm-0   Ready    master   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-master-vm-1   Ready    master   22h   v1.18.6
ci-devhaazurubuflannel-kubernetes-master-vm-2   Ready    master   22h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-0     Ready    <none>   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-1     Ready    <none>   21h   v1.18.6
ci-devhaazurubuflannel-kubernetes-node-vm-2     Ready    <none>   21h   v1.18.6

@przemyslavic
Copy link
Collaborator

przemyslavic commented Aug 26, 2020

Reported an issue [BUG] Duplicated SANs for K8s apiserver certificate #1587

Should be fixed in kubeadm v1.19 - kubernetes/kubernetes#92753

@przemyslavic
Copy link
Collaborator

The fix has been tested. Now there should be no issues with running kubectl commands on an HA cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants