Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-backwards compatible change: CP using NLB #90

Open
bellis-ai opened this issue Aug 8, 2023 · 9 comments
Open

Non-backwards compatible change: CP using NLB #90

bellis-ai opened this issue Aug 8, 2023 · 9 comments

Comments

@bellis-ai
Copy link

In a recent update, the control plane was changed to use a NLB instead of a classic load balancer. Those upgrading the module to the latest version will find the following error take place

  • CP Loadbalancer is upgraded to NLB
  • Target groups are created for the NLB
  • Autoscaling group has "ignore_changes" on "load_balancer" and "target_groups" property, and ignores the changes for the latest load balancer.
  • Because the autoscaling group does not have the correct load balancer settings, instances are not automatically assigned to the NLB and the control plane fails.

Not sure how to fix.

@bellis-ai
Copy link
Author

Looks like it's a matter of just changing the Autoscaling group to use the new NLB, importing it back into state, and then adding the security groups (ends in -cp) to each control plane instance

@adamacosta
Copy link
Collaborator

adamacosta commented Aug 11, 2023

I haven't yet worked out how to handle a migration gracefully to the NLB, but beware that if you just update in-place, the new load balancer will have a different DNS name from the old one and that will invalidate the server certificate being served by the Kube api-server, which will have the DNS name of the old load balancer in its SAN list, placed in there automatically by our module.

I believe, but have not yet tested, that what you have to do is:

  • Create the new NLB outside of Terraform first
  • Grab the DNS name from AWS and add it to the TLS san list in the /etc/rke2/config.yaml files on all of the control plane nodes
  • Cycle rke2 on the control plane nodes to generate a new certificate that will include this
  • Import the load balancer into Terraform state
  • Then do the rest of what you're saying above

Alternatively, if you have a custom URL and DNS record for the api-server and already included that in the TLS san list, none of this will matter.

@bellis-ai
Copy link
Author

Thank you so much! I was just encountering this problem when trying to cycle out the old master nodes -- none were joining the cluster! I'll try this now.

@bellis-ai
Copy link
Author

@adamacosta When you say
Cycle rke2 on the control plane nodes to generate a new certificate that will include this
What do you mean exactly? Restart the systemctl service? How do I cycle rke2? I am not very experienced in manual deployment of RKE2, so I'd like to know what I need to restart

@adamacosta
Copy link
Collaborator

adamacosta commented Aug 11, 2023

Yes, run systemctl restart rke2-server on each control plane node, after editing the config.yaml file. That should generate a new certificate with the added TLS san for the new load balancer in it. Then new nodes should be able to join after that.

@bellis-ai
Copy link
Author

I feel like something's missing. Any ping to 9345 after the config change and rke2-server restart results in a TLS error for "SSL23_GET_SERVER_HELLO". (pings to 6443 still go through). I feel like there's a cert I'm missing here...

@bellis-ai
Copy link
Author

So it looks like the changes are indeed propagated to serving-kube-apiserver.crt, but whatever cert is being used for the supervisor does not change out. I have no idea how to force change it.

@bellis-ai
Copy link
Author

Figured it out. You have to invalidate the cached certificate data by deleting /var/lib/rancher/rke2/server/tls/dynamic-cert.json. No idea why this isn't done automatically when the certificate data is different.

@adamacosta
Copy link
Collaborator

Hey, thanks for figuring that out. Apologies for not following this better. I did get around to trying this out and it worked fine for me in terms of hitting the api server, but I only ran it on a single host, so the supervisor process would have been unused anyway. I'm not going to close this right away because we should put this in a real migration doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants