Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

Closed
tschuering opened this issue Oct 31, 2023 · 11 comments
Closed

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

tschuering opened this issue Oct 31, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@tschuering
Copy link
Contributor

Description

Hey guys, I am new to kube-hetzner project and first I want to say: Great work, we can finally now go away from spending too much amount of money on EKS! :)

I like to automate certain things, and that is why we facing currently the following issue. I am setting up a super simple k3s cluster up via your example kube.tf and I face the issue, that randomly, the deployment hangs with "Job for k3s.service failed because the control process exited with error code." after for example I destroyed the cluster and redeployed it again. Always checked that everything in the Hetzner cloud project was deleted. This happens purely random at the moment. Sometimes it works and the deploy goes through without one error, sometimes just not.

The goal is that we have our own Terraform wrapper module for internal reusage of creating clusters in Hetzner cloud in our wanted way with your module and attach later different helm stacks behind that. So we like to automate that module by actually running deploy and destroy when things were merged into that internal module in GitLab.

Also the helm stacks later will use that internal module as requirement and should automatically deploy even if we have different branches to for testing.

You see also in the screenshot the error which causes the system service not to start on the first cluster node. I am using freshest v2.9.0 release. If you have any hint where otherwise to look when this error happens, let me know, i am pretty new to k3s.

Posted the part of module "kube-hetzner", as the rest is the same standard stuff from the kube.tf

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  ssh_public_key = file("...")
  ssh_private_key = file("...")
  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cax21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cax21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    },
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  ingress_controller = "nginx"
  initial_k3s_channel = "v1.27"
  cluster_name = "test"
  restrict_outbound_traffic = false
  cni_plugin = "cilium"
  enable_cert_manager = true
  dns_servers = ["1.1.1.1", "1.0.0.1"]
}

Screenshots

Bildschirmfoto 2023-10-31 um 04 25 49

Platform

Mac

@tschuering tschuering added the bug Something isn't working label Oct 31, 2023
@Silvest89
Copy link
Contributor

@mysticaltech any idea? It seems it cannot connect to the main control plane node

@mysticaltech
Copy link
Collaborator

@Silvest89 No idea. That seems like a simple enough cluster, it should work.

@tschuering Please see the debug section in the readme, basically you need to ssh into the failing node and have look at the k3s logs.

Please do so and let us know.

@tschuering
Copy link
Contributor Author

@mysticaltech @Silvest89 Hey, I oversaw something when I stripped down my kube.tf to post it here, and this was also the cause of this error, although I do not yet understand, why this is an issue and it is also an important setting for our use-case.

I had set:
firewall_kube_api_source = ["X.X.X.X/32"]

This is our own company related VPN. If I unset this it works fine and I can delete, deploy and it works everytime, if I set this it fails like above, although a second run after initial failed deploy goes through. So maybe this is enough of a hint for you, cause the k3s logs of the control-plane did not output more than what I posted above. The VPN IP is 100% correct, after I deployed it without the setting, and redeploy it with the setting I have only access to the Kube API if I am connected via my VPN.

And tbh I am not deep enough k3s to understand why this caused an issue. Maybe try it in your side by setting this setting for an initial deployment and see what happens.

@mysticaltech
Copy link
Collaborator

@tschuering So during deployment we need to use kubectl, that's why. However just after deployment you can change it to your own vpn ip and that's it.

To be extra clear, you need to change and apply again after the initial deployment, that will reconfigure the firewall for you. You never need to touch that one manually.

@tschuering
Copy link
Contributor Author

Yeah, I totally understand the second part. But I do not yet understand why, because when you use kubectl I thought you are working on the control plane machine via SSH in that case, which is not affected by the firewall at all or not or are you querying kubectl via the external IP there and not the internal?. Probably I get some misconception here.

@tschuering
Copy link
Contributor Author

@mysticaltech @Silvest89 The same error also appears when having autoscaler_nodepools predefined and you run a first deployment with empty Hetzner cloud project. Commented mine now out, but yeah. I do get the feeling that SSH Firewalling here is not the issue, more or less some sort of order when Terraform things are executed and certain variables are set in the config. 😃

@tschuering
Copy link
Contributor Author

tschuering commented Nov 3, 2023

I might have found the issue. When k3s.service is started on the first node, the server value in /etc/rancher/k3s/config.yaml is set to the second control node, which surely he cannot connect because it is not yet running? Cause k3s.service is not started yet there? Am I right or wrong? When I remove server value and start on the first one k3s.service gets started.

To what should this value be set? Cause on the other nodes it is set to the value of the first node with .255 at the end. But when this is set on the first node, the service does not start, but when I remove the value, the service starts fine.

@mysticaltech @Silvest89 Any idea? Please know, that I grepped out token when cat the config. From what I see the server value is also not set in your code anywhere? See second screenshot. And in the second Terraform run he also detects that there is a server value set and removes that, I am very confused.

From https://docs.k3s.io/cli/server
--server value, -s value (cluster) Server to connect to, used to join a cluster [$K3S_URL]

Bildschirmfoto 2023-11-03 um 03 55 37

Bildschirmfoto 2023-11-03 um 04 00 12

@mysticaltech
Copy link
Collaborator

mysticaltech commented Nov 3, 2023

@tschuering Thanks for your debug. This is correct, the first control plane first start alone, then its config.yaml is changed to point to the second control plane if available, to provide HA. This just used for re-initialization if ever needed. The changes you did in the PR seem good. If that fixes your issues then perfect. Will test and release as soon as validated.

@andinger
Copy link

andinger commented Nov 6, 2023

I'm facing the same issue ... can't wait for the fix to be released :-) have to create a new cluster ...

@gor181
Copy link

gor181 commented Nov 6, 2023

Yap, same issue here. Will wait for the release. Appreciate the hard work guys.

mysticaltech added a commit that referenced this issue Nov 6, 2023
@mysticaltech
Copy link
Collaborator

Fix by @tschuering was just released in v2.9.2.

terraform init -upgrade and run again to deploy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants