Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

jordan-lumley · 2022-08-04T18:48:54Z

Environmental Info:
RKE2 Version: rke2 version v1.23.9+rke2r1 (2d206eb)
go version go1.17.5b7

Node(s) CPU architecture, OS, and Version: 3 vm nodes under a proxmox controller.

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

Linux pos01.prd 5.13.0-40-generic #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 server and 2 worker nodes

Describe the bug:
Following the quickstart guide step by step and keep running into CA cert errors. I have spent 4 days destroying the VM's and starting back from scratch time and time again. Sometimes worker node 1 will join but 2 will fail with these errors. At the time of writing this ticket, worker node 1 is getting the error but node 2 joined successfully.

Steps To Reproduce:
Just following the quick start guide.

Installed RKE2:

Expected behavior:
The nodes to join successfully.

Actual behavior:

-- Logs begin at Thu 2022-08-04 16:33:11 UTC, end at Thu 2022-08-04 18:46:35 UTC. --
Aug 04 18:15:41 pos02.prd systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Aug 04 18:15:41 pos02.prd sh[48430]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Aug 04 18:15:41 pos02.prd sh[48431]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=warning msg="not running in CIS mode"
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=info msg="Starting rke2 agent v1.23.9+rke2r1 (2d206eba8d018>
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=info msg="Running load balancer rke2-agent-load-balancer 12>
Aug 04 18:15:41 pos02.prd rke2[48435]: time="2022-08-04T18:15:41Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:43 pos02.prd rke2[48435]: time="2022-08-04T18:15:43Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:45 pos02.prd rke2[48435]: time="2022-08-04T18:15:45Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:47 pos02.prd rke2[48435]: time="2022-08-04T18:15:47Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:49 pos02.prd rke2[48435]: time="2022-08-04T18:15:49Z" level=error msg="token CA hash does not match the Cluster CA cert>
Aug 04 18:15:51 pos02.prd rke2[48435]: time="2022-08-04T18:15:51Z" level=error msg="token CA hash does not match the Cluster CA

Additional context / logs:
The logs above are from the output from the beginning of journactl from the 1st worker node ONLY. The full error message is:
Aug 04 18:47:11 pos02.prd rke2[48435]: time="2022-08-04T18:47:11Z" level=error msg="token CA hash does not match the Cluster CA certificate hash: 537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868 != 5dc21fa3143eea7ca48f468cdfb5c36553b8b0e184c7fbb9db9d1777d66bfed1"

The text was updated successfully, but these errors were encountered:

brandond · 2022-08-04T18:51:58Z

What is the --server URL set to on the agents? Did you put a SSL-terminating load-balancer in front of your server, or does the URL point directly at the server node?

jordan-lumley · 2022-08-04T18:53:30Z

What is the --server URL set to on the agents? Did you put a SSL-terminating load-balancer in front of your server, or does the URL point directly at the server node?

config yaml:

server: https://192.168.100.39:9345 
token: K10537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868::server:57b288f04abf15074b85b2fdf5c17403

This is pointing directly to the main server node.

jordan-lumley · 2022-08-04T19:02:30Z

And also to note, over the entirety of the week and the multiple attempts, either node no. 2 will fail to join or no. 3 will fail to join with errors such as this one. Node no. 3 joined perfectly fine in this case this time.

jordan-lumley · 2022-08-04T19:11:04Z

Not sure if this will help but here is the output of curl -kv https://192.168.100.39:9345 (server node)

From worker 1(failing to join):

*   Trying 192.168.100.39:9345...
* TCP_NODELAY set
* Connected to 192.168.100.39 (192.168.100.39) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Mar  4 15:48:59 2022 GMT
*  expire date: Mar  4 15:48:59 2023 GMT
*  issuer: CN=rke2-server-ca@1646408939
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
> GET / HTTP/1.1
> Host: 192.168.100.39:9345
> User-Agent: curl/7.68.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 04 Aug 2022 19:08:50 GMT
< Content-Length: 19
< 
404 page not found
* Connection #0 to host 192.168.100.39 left intact

From worker node 2(successful join):

*   Trying 192.168.100.39:9345...
* TCP_NODELAY set
* Connected to 192.168.100.39 (192.168.100.39) port 9345 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_CHACHA20_POLY1305_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: O=rke2; CN=rke2
*  start date: Aug  4 18:01:20 2022 GMT
*  expire date: Aug  4 18:01:20 2023 GMT
*  issuer: CN=rke2-server-ca@1659636080
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x557fe121dd80)
> GET / HTTP/2
> Host: 192.168.100.39:9345
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 404 
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< content-length: 19
< date: Thu, 04 Aug 2022 19:08:46 GMT
< 
404 page not found
* Connection #0 to host 192.168.100.39 left intact

brandond · 2022-08-04T19:16:32Z

Do you have the exact same token on both agents? Did you copy it directly from the server node? Does the token match the contents of /var/lib/rancher/rke2/server/token on the server?

What do you get from curl -ks https://192.168.100.39:9345/cacerts | sha256sum on both agents? The sha256sum should match the bit immediately following K10 in the token value: 537e7956a582a6ac14a6e0c5eb7961030efd8c9590550b5b580230e213237868 - if it does not, then you don't have the correct token for your server.

jordan-lumley · 2022-08-04T19:19:39Z

I used the same config.yaml between both worker nodes. But the result of the curl command you had asked me to run is different.

Screenshots:
Worker 1

Worker 2

How is that possible?

brandond · 2022-08-04T19:26:43Z

Are these VMs deployed to different environments or something? Are you sure 192.168.100.39 reaches the same host from both nodes? Do you have a HTTP proxy or something else that might be interfering?

jordan-lumley · 2022-08-04T19:30:08Z

These are VM's, yes. What way would you like me to check to verify that they're hitting the same host? Willing to try whatever you need to me to try in order to troubleshoot. We have a proxy that faces external to handle basic web traffic but this is just a local <-> to local connection so the proxy is avoided at this point.

brandond · 2022-08-04T19:33:13Z

That's a bit beyond what I can help you troubleshoot; it seems pretty clear that something is wrong with your environment - but you might try just the basics:

What is the complete output of curl -vks https://192.168.100.39:9345/cacerts - is there a bit in there about * Uses proxy env variable HTTPS_PROXY == ?
If you ssh to that IP from both of the agents, do you end up on the same host?

jordan-lumley · 2022-08-04T19:43:35Z

I destroyed the 1st worker node and going to try and rebuild the VM real quick. I'll report back in 5-10 minutes.

jordan-lumley · 2022-08-04T19:59:36Z

So, surpsingly enough, after the double digits worth of times I deleted proxmox VM's and restarted the process over again, this time it worked. However, the proxy wont stay open :( why meeeee lol. Not sure if the join was actually correct if its showing this now

From my machine running kubectl proxy:

Server:

Worker 1:

jordan-lumley · 2022-08-04T20:14:35Z

jordan-lumley · 2022-08-04T20:14:53Z

Should I open a separate issue for this next issue im running into?

brandond · 2022-08-04T20:24:40Z

no, it looks like another aspect of the same thing. Is something going on with the network between the VMs?

dsyorkd · 2022-08-29T22:52:54Z

@jordan-lumley did you figure this out? I'm seeing much the same, but the only difference I can tell, is that whenever I use node-ip in the server config, pods like coredns and nginx aren't able to ever complete, leaving out node-ip, and my cluster starts but using the wrong adapter/ip for communication.

CagiHub · 2022-08-31T19:53:18Z

This one helped me:
ssh into the agent machine and run command:

systemctl restart rke2-agent.service

Hope it helps someone :)

jostmart · 2022-10-06T21:55:30Z

I think there is a clarification problem in the documentation. If you add token to your initial servers /etc/rancher/rke2/config.yaml, that token is not picked up by RKE. In fact, another token will be generated and added to /var/lib/rancher/rke2/server/node-token.

172.20.10.100 loadbalancer pointing towards 172.20.10.15

Example from my initial master:

cat /var/lib/rancher/rke2/server/node-token
K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2

curl -ks https://172.20.10.100:9345/cacerts | sha256sum
c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953  -

curl -ks https://172.20.10.15:9345/cacerts | sha256sum
c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953  -

cat /etc/rancher/rke2/config.yaml
tls-san:
  - 172.20.10.100

And this is from my second master:

cat /etc/rancher/rke2/config.yaml
server: https://172.20.10.100:9345
token: K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2
tls-san:
  - 172.20.10.100

When using curl, to compare cacerts like above you need to remember that even an error output will be piped to sha256sum. Example below look like a faulty token, but its in fact an empty reply from a non existing service:

curl -ks https://172.20.10.16:9345/cacerts | sha256sum
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  -

I'd like to know how to force the token on my initial master, because I'm doing Ansible templating with config.yml, and deploying multiple clusters. I've been doing like jordan-lumley reinstalling vms, but more like tripple-digit rollbacks on VM snapshots in order to figure out whats wrong with the installation documentation.

Even after figuring out the issue with the token, I still don't get the same ca-cert on my 3 masters:

curl -ks https://172.20.10.15:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBejCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
6WhkmwtevTAKBggqhkjOPQQDAgNJADBGAiEAyXfeVDXlQO/vwDFYxuERdhOOMUki
Rzf13NaySaDmflACIQDNjjhHxFS1Dv5VrzH6zSZFULDAqcPXVqKdVbkXVbqbqg==
-----END CERTIFICATE-----

curl -ks https://172.20.10.16:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBejCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
6WhkmwtevTAKBggqhkjOPQQDAgNJADBGAiEAyXfeVDXlQO/vwDFYxuERdhOOMUki
Rzf13NaySaDmflACIQDNjjhHxFS1Dv5VrzH6zSZFULDAqcPXVqKdVbkXVbqbqg==
-----END CERTIFICATE-----

And the last one is different:

curl -ks https://172.20.10.17:9345/cacerts
-----BEGIN CERTIFICATE-----
MIIBeDCCAR+gAwIBAgIBADAKBggqhkjOPQQDAjAkMSIwIAYDVQQDDBlya2UyLXNl
...
XmnwTGSh3DAKBggqhkjOPQQDAgNHADBEAiAo1QsU6Ga4QwtJww9pp8mZvQ12PC8f
0cJ6vQeLXOLwVgIgcQzXiMYwLr1iaSuzwv8/xJdtbwOyYqmVRbGoc9Qst0E=
-----END CERTIFICATE-----

Looking at the tokens again on the failing master:

cat /var/lib/rancher/rke2/server/node-token
K10509ef8bd4ce536d2a800c5400051dc0717edad0b52fa4181a259792c8a84dc24::server:1025fb77676a7200f6953e9d0630d618


grep token /etc/rancher/rke2/config.yml
token: K10c7d825d6fece904bc2755bd226b99f829cd12a23c7884dd095a6a02f7a03d953::server:7ccd56357e2b6663b846a7493a7124c2

Why are /var/lib/rancher/rke2/server/node-token populated with another token? I am 100% certain I initiated config.yml with the proper token before starting anything on the master.

This is how I initialized the last master:

4  mkdir -p /etc/rancher/rke2
5  vim /etc/rancher/rke2/config.yaml
6  curl -sfL https://get.rke2.io | sh -
7  cat /etc/rancher/rke2/config.yaml
8  systemctl enable rke2-server.service
9  systemctl start rke2-server.service

Step 5, this is where I cut and paste content from a working server.
Step 6 work fine
Step 7 I verify that the content have not changed since step 5
Step 9 fails

After the failed step, journalctl tells me:

Oct 07 09:32:42 k8s-rancher3 sh[1870]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 07 09:32:42 rancher3 sh[1870]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=warning msg="not running in CIS mode"
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=info msg="Starting rke2 v1.24.6+rke2r1 (473cc354adecd235f00f0d80828611b0556b83e5)"
Oct 07 09:32:42 k8s-rancher3 rke2[1881]: time="2022-10-07T09:32:42+02:00" level=fatal msg="starting kubernetes: preparing server: CA cert validation failed: Get \"https://172.20.10.100:9345/cacerts\": x509: ce
Oct 07 09:32:42 k8s-rancher3 systemd[1]: rke2-server.service: main process exited, code=exited, status=1/FAILURE
Oct 07 09:32:42 k8s-rancher3 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

I'm not using network manager, so that complaint about nm-cloud-setup is caused by that. But the complaint about cacerts is probably a reason to why my certificate doesn't match.

After a while, the installation starts anyway and I see how the third master initialize and joins the cluster of masters.

jostmart · 2022-10-24T21:02:57Z

This error still exist in: rke2_version: v1.25.3+rke2r1 on RPM based OS. @brandond could you give me some insight in why the token from the configuration file are not picked up by the initial server?

brandond · 2022-10-24T21:16:32Z

why the token from the configuration file are not picked up by the initial server?

You can't change the passphrase portion of the token after the fact. I can't really follow the sequence that you described above, but if you have a short (passphrase-only) token set in the config file at the initial startup of the cluster, that token will be honored. If you want to later update the short token to a full K10 format token after the CA hash has been calculated, that's fine, but it doesn't really make a difference.

Note that it only makes sense to set the short token on the first server, not a full K10 format token. Since the first server to start up generates new CA certificates, and the full token includes a hash of the CA certificates, there is no way that your manually generated K10 token's embedded hash is ever going to match the actual hash of the certificates.

jostmart · 2022-10-26T06:58:56Z

So after changing into a passphrase instead of token, for the token parameter, things work. Thanks for the clarification about the difference between a K10 format token, and passphrase for the token parameter!

caroline-suse-rancher · 2023-06-28T21:12:31Z

Closing as this does not appear to be a bug with rke2

caroline-suse-rancher closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

jordan-lumley commented Aug 4, 2022 •

edited

Loading

brandond commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

brandond commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022

dsyorkd commented Aug 29, 2022

CagiHub commented Aug 31, 2022 •

edited

Loading

jostmart commented Oct 6, 2022 •

edited

Loading

jostmart commented Oct 24, 2022 •

edited

Loading

brandond commented Oct 24, 2022 •

edited

Loading

jostmart commented Oct 26, 2022

caroline-suse-rancher commented Jun 28, 2023

Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

Token CA hash does not match the Cluster CA certificate hash for Quick Start Guide.. #3214

Comments

jordan-lumley commented Aug 4, 2022 • edited Loading

brandond commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022 • edited Loading

brandond commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022 • edited Loading

jordan-lumley commented Aug 4, 2022

jordan-lumley commented Aug 4, 2022

brandond commented Aug 4, 2022

dsyorkd commented Aug 29, 2022

CagiHub commented Aug 31, 2022 • edited Loading

jostmart commented Oct 6, 2022 • edited Loading

jostmart commented Oct 24, 2022 • edited Loading

brandond commented Oct 24, 2022 • edited Loading

jostmart commented Oct 26, 2022

caroline-suse-rancher commented Jun 28, 2023

jordan-lumley commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

brandond commented Aug 4, 2022 •

edited

Loading

brandond commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

brandond commented Aug 4, 2022 •

edited

Loading

jordan-lumley commented Aug 4, 2022 •

edited

Loading

CagiHub commented Aug 31, 2022 •

edited

Loading

jostmart commented Oct 6, 2022 •

edited

Loading

jostmart commented Oct 24, 2022 •

edited

Loading

brandond commented Oct 24, 2022 •

edited

Loading