-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting container logs has TLS issues, reconfiguration breaks etcd, situation unclear #6679
Comments
Adding
|
While researching more I think #3551 (comment) is similar. Now the question would be how to actually migrate existing clusters to use all those arguments… |
I tried to migrate the current setup to the new setup by stopping 2 of 3 control-plane nodes, then changing the commandline args on the last one for using Sadly control-plane-0 (also with node-ip/advertise-ip/external could not join the cluster anymore with error:
So it looks like cp-1 was still part of the etcd and did not get removed… but before cp-0 can join the etcd again it needs to be in a healthy state. |
Next thing I'll probably try is to completely kill the control plane and restore a etcd backup and then try to make it HA again. |
K3s issues kubelet certificates using the addresses that it is aware of when the node starts up. If you have something like the hetzner cloud provider that is adding or changing the node addresses while initializing the node, the kubelet cert obviously won't be valid for those, as they were not known at the time the certificate was requested. You should ensure that all the addresses are provided as --node-ip or --node-external-ip so that the SANs on the certs will match the addresses set by the cloud provider. |
Jeah well, it somehow all worked with k3s 1.21 apparently because the setup around did not change. Something in 1.22+ changed how IP adresses are used. But I'll try to migrate the setup to one which explicitly sets the ip addresses via cli params. The only hurdle currently is the migration of the etcd but I'll try out changing it to a single-node one using |
Okay, so I tested and played around with this a lot now. Good thing is that k3s is not too complex. Migration strategy from "wrong" IP-setup with HA control plane looks roughly like that:
In the end etcd should talk via private IPs and the certs will also contain the private IPs. |
Environmental Info
K3s Version: k3s version v1.23.15+k3s1 (50cab3b) (on control-plane 0 and v1.23.14+k3s1 on other control-plane 1+2)
Update: I could reproduce the container logs issue also on v1.22 control-plane clusters already
Node(s) CPU architecture, OS, and Version:
Linux core-control-plane-0 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 Servers
11 Agents
Each server has private and public IPv4 and IPv6. It's Hetzner Cloud Servers and k3s is installed with cloud-init and
curl -sfL https://get.k3s.io
with the "right" parameters.The Bug/Story
I'm not sure if it's one and where exactly it is hiding… but here's the story so far:
I have a weird situation wich involves TLS certificates, Node-IPs and etcd.
This setup was now working pretty well for some time, but now I noticed some strange behaviour:
The three control plane servers have private IPs (10.0.1.1, 10.0.1.2, 10.0.1.3) next to the public ipv4s and ipv6s.
Getting container logs from control-plane servers leads to TLS issue
Yesterday while I was upgrading from v1.22 to v1.23.15 I noticed some strange behavior when using kubectl to get container logs from pods running on control-plane servers. This issue doesn't exist on the agents (I updated one agent to 1.23 to test it):
So I was digging and found out that I only pass
--node-ip
to the agent servers on startup but not on the control-plane servers. Maybe that leads to the problem that they don’t use that IP for their certificates?Doing a
openssl s_client -connect 10.0.1.1:10250 < /dev/null | openssl x509 -text
shows me that the SANs are indeed onlyDNS:core-control-plane-0, DNS:localhost, IP Address:127.0.0.1, IP Address:<public-ipv4>, IP Address:2A01:4F8:1C1C:E8D2:0:0:0:1
So I thought… Aha, let’s add
--node-ip
to the startup of the control-plane servers. Then the next strange thing happened:Adding node-ip to an existing control-plane server makes it unable to join etcd again
So I added
--node-ip 10.0.1.1
to the commandline args (systemctl edit k3s.service --full
and restarting it).On
control-plane-1
I can now see this:On the
control-plane-0
(where I added the node-ip) I see those logs:(which repeat multiple times.)
Full startup logs until repeating error
Removing node-ip from the k3s args and restarting makes it join the etcd cluster again.
So that's my first big question: Is there a migration path to using
--node-ip
on servers?--tls-san doesn't help
I also tried adding
--tls-san
to that servers startup commands but that didn't fix getting the logs. Maybe that is only evaluated on cluster-init?Recreation instead of Reconfiguration fails differently
I also tried to add
control-plane-0
as a fresh member to the cluster. (Deleting the server and creating it again but using--node-ip
now already and also--advertise-address
(both pointing to the private ipv4).Full log of control-plane-0
The interesting part on
control-plane-1
logs looks like that:So I wonder why does the control-plane-0 register in the etcd with its public-ipv4. I found this comment which says it should not be the case #2533 (comment)
So I guess the setup somehow things that the public IPs are the private ones?
The text was updated successfully, but these errors were encountered: