Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

Closed
jordan-lumley opened this issue Aug 4, 2022 · 21 comments

Comments

@jordan-lumley
Copy link

jordan-lumley commented Aug 4, 2022

Environmental Info:
RKE2 Version: rke2 version v1.23.9+rke2r1 (2d206eb)
go version go1.17.5b7

Node(s) CPU architecture, OS, and Version: 3 vm nodes under a proxmox controller.

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

Linux pos01.prd 5.13.0-40-generic #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 server and 2 worker nodes

Describe the bug:
Following the quickstart guide step by step and keep running into CA cert errors. I have spent 4 days destroying the VM's and starting back from scratch time and time again. Sometimes worker node 1 will join but 2 will fail with these errors. At the time of writing this ticket, worker node 1 is getting the error but node 2 joined successfully.

Steps To Reproduce:
Just following the quick start guide.

  • Installed RKE2:

Expected behavior:
The nodes to join successfully.

Actual behavior:

-- Logs begin at Thu 2022-08-04 16:33:11 UTC, end at Thu 2022-08-04 18:46:35 UTC. --
Aug 04 18:15:41 pos02.prd systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Aug 04 18:15:41 pos02.prd sh[48430]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Aug 04 18:15:41 pos02.prd sh[48431]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=warning msg="not running in CIS mode"
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=info msg="Starting rke2 agent v1.23.9+rke2r1 (2d206eba8d018>
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=info msg="Running load balancer rke2-agent-load-balancer 12>
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:43 pos02.prd rke2[48435]: time="2022-08-04T18:15:43Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:45 pos02.prd rke2[48435]: time="2022-08-04T18:15:45Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:47 pos02.prd rke2[48435]: time="2022-08-04T18:15:47Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:49 pos02.prd rke2[48435]: time="2022-08-04T18:15:49Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:51 pos02.prd rke2[48435]: time="2022-08-04T18:15:51Z" level=error msg="token CA hash does not match the Cluster CA

Additional context / logs:
The logs above are from the output from the beginning of journactl from the 1st worker node ONLY. The full error message is:
Aug 04 18:47:11 pos02.prd rke2[48435]: time="2022-08-04T18:47:11Z" level=error msg="token CA hash does not match the Cluster CA certificate hash: 537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868 != 5dc21fa3143eea7ca48f468cdfb5c36553b8b0e184c7fbb9db9d1777d66bfed1"

@brandond
Copy link
Member

brandond commented Aug 4, 2022

What is the --server URL set to on the agents? Did you put a SSL-terminating load-balancer in front of your server, or does the URL point directly at the server node?

@jordan-lumley
Copy link
Author

jordan-lumley commented Aug 4, 2022

What is the --server URL set to on the agents? Did you put a SSL-terminating load-balancer in front of your server, or does the URL point directly at the server node?

config yaml:

server: https://192.168.100.39:9345 
token: K10537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868::server:57b288f04abf15074b85b2fdf5c17403

This is pointing directly to the main server node.

@jordan-lumley
Copy link
Author

jordan-lumley commented Aug 4, 2022

And also to note, over the entirety of the week and the multiple attempts, either node no. 2 will fail to join or no. 3 will fail to join with errors such as this one. Node no. 3 joined perfectly fine in this case this time.

@jordan-lumley
Copy link
Author

Not sure if this will help but here is the output of curl -kv https://192.168.100.39:9345 (server node)

From worker 1(failing to join):

*   Trying 192.168.100.39:9345...
* TCP_NODELAY set
* Connected to 192.168.100.39 (192.168.100.39) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Mar  4 15:48:59 2022 GMT
*  expire date: Mar  4 15:48:59 2023 GMT
*  issuer: CN=rke2-server-ca@1646408939
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET / HTTP/1.1
> Host: 192.168.100.39:9345
> User-Agent: curl/7.68.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 04 Aug 2022 19:08:50 GMT
< Content-Length: 19
< 
404 page not found
* Connection #0 to host 192.168.100.39 left intact

From worker node 2(successful join):

*   Trying 192.168.100.39:9345...
* TCP_NODELAY set
* Connected to 192.168.100.39 (192.168.100.39) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_CHACHA20_POLY1305_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Aug  4 18:01:20 2022 GMT
*  expire date: Aug  4 18:01:20 2023 GMT
*  issuer: CN=rke2-server-ca@1659636080
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x557fe121dd80)
> GET / HTTP/2
> Host: 192.168.100.39:9345
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 404 
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< content-length: 19
< date: Thu, 04 Aug 2022 19:08:46 GMT
< 
404 page not found
* Connection #0 to host 192.168.100.39 left intact

@brandond
Copy link
Member

brandond commented Aug 4, 2022

Do you have the exact same token on both agents? Did you copy it directly from the server node? Does the token match the contents of /var/lib/rancher/rke2/server/token on the server?

What do you get from curl -ks https://192.168.100.39:9345/cacerts | sha256sum on both agents? The sha256sum should match the bit immediately following K10 in the token value: 537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868 - if it does not, then you don't have the correct token for your server.

@jordan-lumley
Copy link
Author

I used the same config.yaml between both worker nodes. But the result of the curl command you had asked me to run is different.

Screenshots:
Worker 1
image

Worker 2
image

How is that possible?

@brandond
Copy link
Member

brandond commented Aug 4, 2022

Are these VMs deployed to different environments or something? Are you sure 192.168.100.39 reaches the same host from both nodes? Do you have a HTTP proxy or something else that might be interfering?

@jordan-lumley
Copy link
Author

jordan-lumley commented Aug 4, 2022

These are VM's, yes. What way would you like me to check to verify that they're hitting the same host? Willing to try whatever you need to me to try in order to troubleshoot. We have a proxy that faces external to handle basic web traffic but this is just a local <-> to local connection so the proxy is avoided at this point.

@brandond
Copy link
Member

brandond commented Aug 4, 2022

That's a bit beyond what I can help you troubleshoot; it seems pretty clear that something is wrong with your environment - but you might try just the basics:

  • What is the complete output of curl -vks https://192.168.100.39:9345/cacerts - is there a bit in there about * Uses proxy env variable HTTPS_PROXY == ?
  • If you ssh to that IP from both of the agents, do you end up on the same host?

@jordan-lumley
Copy link
Author

I destroyed the 1st worker node and going to try and rebuild the VM real quick. I'll report back in 5-10 minutes.

@jordan-lumley
Copy link
Author

jordan-lumley commented Aug 4, 2022

So, surpsingly enough, after the double digits worth of times I deleted proxmox VM's and restarted the process over again, this time it worked. However, the proxy wont stay open :( why meeeee lol. Not sure if the join was actually correct if its showing this now

From my machine running kubectl proxy:
image

Server:
image

Worker 1:
image

@jordan-lumley
Copy link
Author

image

@jordan-lumley
Copy link
Author

Should I open a separate issue for this next issue im running into?

@brandond
Copy link
Member

brandond commented Aug 4, 2022

no, it looks like another aspect of the same thing. Is something going on with the network between the VMs?

@dsyorkd
Copy link

dsyorkd commented Aug 29, 2022

@jordan-lumley did you figure this out? I'm seeing much the same, but the only difference I can tell, is that whenever I use node-ip in the server config, pods like coredns and nginx aren't able to ever complete, leaving out node-ip, and my cluster starts but using the wrong adapter/ip for communication.

@CagiHub
Copy link

CagiHub commented Aug 31, 2022

This one helped me:
ssh into the agent machine and run command:

systemctl restart rke2-agent.service

Hope it helps someone :)

@jostmart
Copy link

jostmart commented Oct 6, 2022

I think there is a clarification problem in the documentation. If you add token to your initial servers /etc/rancher/rke2/config.yaml, that token is not picked up by RKE. In fact, another token will be generated and added to /var/lib/rancher/rke2/server/node-token.

172.20.10.100 loadbalancer pointing towards 172.20.10.15

Example from my initial master:

cat /var/lib/rancher/rke2/server/node-token
K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2

curl -ks https://172.20.10.100:9345/cacerts | sha256sum
c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953  -

curl -ks https://172.20.10.15:9345/cacerts | sha256sum
c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953  -

cat /etc/rancher/rke2/config.yaml
tls-san:
  - 172.20.10.100

And this is from my second master:

cat /etc/rancher/rke2/config.yaml
server: https://172.20.10.100:9345
token: K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2
tls-san:
  - 172.20.10.100

When using curl, to compare cacerts like above you need to remember that even an error output will be piped to sha256sum. Example below look like a faulty token, but its in fact an empty reply from a non existing service:

curl -ks https://172.20.10.16:9345/cacerts | sha256sum
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  -

I'd like to know how to force the token on my initial master, because I'm doing Ansible templating with config.yml, and deploying multiple clusters. I've been doing like jordan-lumley reinstalling vms, but more like tripple-digit rollbacks on VM snapshots in order to figure out whats wrong with the installation documentation.

Even after figuring out the issue with the token, I still don't get the same ca-cert on my 3 masters:

curl -ks https://172.20.10.15:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBejCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
6WhkmwtevTAKBggqhkjOPQQDAgNJADBGAiEAyXfeVDXlQO/vwDFYxuERdhOOMUki
Rzf13NaySaDmflACIQDNjjhHxFS1Dv5VrzH6zSZFULDAqcPXVqKdVbkXVbqbqg==
-----END CERTIFICATE-----

curl -ks https://172.20.10.16:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBejCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
6WhkmwtevTAKBggqhkjOPQQDAgNJADBGAiEAyXfeVDXlQO/vwDFYxuERdhOOMUki
Rzf13NaySaDmflACIQDNjjhHxFS1Dv5VrzH6zSZFULDAqcPXVqKdVbkXVbqbqg==
-----END CERTIFICATE-----

And the last one is different:

curl -ks https://172.20.10.17:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBeDCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
XmnwTGSh3DAKBggqhkjOPQQDAgNHADBEAiAo1QsU6Ga4QwtJww9pp8mZvQ12PC8f
0cJ6vQeLXOLwVgIgcQzXiMYwLr1iaSuzwv8/xJdtbwOyYqmVRbGoc9Qst0E=
-----END CERTIFICATE-----

Looking at the tokens again on the failing master:

cat /var/lib/rancher/rke2/server/node-token
K10509ef8bd4ce536d2a800c5400051dc0717edad0b52fa4181a259792c8a84dc24::server:1025fb77676a7200f6953e9d0630d618


grep token /etc/rancher/rke2/config.yml
token: K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2

Why are /var/lib/rancher/rke2/server/node-token populated with another token? I am 100% certain I initiated config.yml with the proper token before starting anything on the master.

This is how I initialized the last master:

4  mkdir -p /etc/rancher/rke2
5  vim /etc/rancher/rke2/config.yaml
6  curl -sfL https://get.rke2.io | sh -
7  cat /etc/rancher/rke2/config.yaml
8  systemctl enable rke2-server.service
9  systemctl start rke2-server.service

Step 5, this is where I cut and paste content from a working server.
Step 6 work fine
Step 7 I verify that the content have not changed since step 5
Step 9 fails

After the failed step, journalctl tells me:

Oct 07 09:32:42 k8s-rancher3 sh[1870]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 07 09:32:42 rancher3 sh[1870]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=warning msg="not running in CIS mode"
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=info msg="Starting rke2 v1.24.6+rke2r1 (473cc354adecd235f00f0d80828611b0556b83e5)"
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=fatal msg="starting kubernetes: preparing server: CA cert validation failed: Get \"https://172.20.10.100:9345/cacerts\": x509: ce
Oct 07 09:32:42 k8s-rancher3 systemd[1]: rke2-server.service: main process exited, code=exited, status=1/FAILURE
Oct 07 09:32:42 k8s-rancher3 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

I'm not using network manager, so that complaint about nm-cloud-setup is caused by that. But the complaint about cacerts is probably a reason to why my certificate doesn't match.

After a while, the installation starts anyway and I see how the third master initialize and joins the cluster of masters.

@jostmart
Copy link

jostmart commented Oct 24, 2022

This error still exist in: rke2_version: v1.25.3+rke2r1 on RPM based OS. @brandond could you give me some insight in why the token from the configuration file are not picked up by the initial server?

@brandond
Copy link
Member

brandond commented Oct 24, 2022

why the token from the configuration file are not picked up by the initial server?

You can't change the passphrase portion of the token after the fact. I can't really follow the sequence that you described above, but if you have a short (passphrase-only) token set in the config file at the initial startup of the cluster, that token will be honored. If you want to later update the short token to a full K10 format token after the CA hash has been calculated, that's fine, but it doesn't really make a difference.

Note that it only makes sense to set the short token on the first server, not a full K10 format token. Since the first server to start up generates new CA certificates, and the full token includes a hash of the CA certificates, there is no way that your manually generated K10 token's embedded hash is ever going to match the actual hash of the certificates.

@jostmart
Copy link

So after changing into a passphrase instead of token, for the token parameter, things work. Thanks for the clarification about the difference between a K10 format token, and passphrase for the token parameter!

@caroline-suse-rancher
Copy link
Contributor

Closing as this does not appear to be a bug with rke2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants