[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

tschuering · 2023-10-31T03:55:23Z

Description

Hey guys, I am new to kube-hetzner project and first I want to say: Great work, we can finally now go away from spending too much amount of money on EKS! :)

I like to automate certain things, and that is why we facing currently the following issue. I am setting up a super simple k3s cluster up via your example kube.tf and I face the issue, that randomly, the deployment hangs with "Job for k3s.service failed because the control process exited with error code." after for example I destroyed the cluster and redeployed it again. Always checked that everything in the Hetzner cloud project was deleted. This happens purely random at the moment. Sometimes it works and the deploy goes through without one error, sometimes just not.

The goal is that we have our own Terraform wrapper module for internal reusage of creating clusters in Hetzner cloud in our wanted way with your module and attach later different helm stacks behind that. So we like to automate that module by actually running deploy and destroy when things were merged into that internal module in GitLab.

Also the helm stacks later will use that internal module as requirement and should automatically deploy even if we have different branches to for testing.

You see also in the screenshot the error which causes the system service not to start on the first cluster node. I am using freshest v2.9.0 release. If you have any hint where otherwise to look when this error happens, let me know, i am pretty new to k3s.

Posted the part of module "kube-hetzner", as the rest is the same standard stuff from the kube.tf

Kube.tf file

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  source = "kube-hetzner/kube-hetzner/hcloud"
  ssh_public_key = file("...")
  ssh_private_key = file("...")
  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cax21",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cax21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cax21",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 3
    },
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "fsn1"

  ingress_controller = "nginx"
  initial_k3s_channel = "v1.27"
  cluster_name = "test"
  restrict_outbound_traffic = false
  cni_plugin = "cilium"
  enable_cert_manager = true
  dns_servers = ["1.1.1.1", "1.0.0.1"]
}

Screenshots

Platform

Mac

Silvest89 · 2023-11-01T04:43:08Z

@mysticaltech any idea? It seems it cannot connect to the main control plane node

mysticaltech · 2023-11-01T12:24:54Z

@Silvest89 No idea. That seems like a simple enough cluster, it should work.

@tschuering Please see the debug section in the readme, basically you need to ssh into the failing node and have look at the k3s logs.

Please do so and let us know.

tschuering · 2023-11-01T15:45:59Z

@mysticaltech @Silvest89 Hey, I oversaw something when I stripped down my kube.tf to post it here, and this was also the cause of this error, although I do not yet understand, why this is an issue and it is also an important setting for our use-case.

I had set:
firewall_kube_api_source = ["X.X.X.X/32"]

This is our own company related VPN. If I unset this it works fine and I can delete, deploy and it works everytime, if I set this it fails like above, although a second run after initial failed deploy goes through. So maybe this is enough of a hint for you, cause the k3s logs of the control-plane did not output more than what I posted above. The VPN IP is 100% correct, after I deployed it without the setting, and redeploy it with the setting I have only access to the Kube API if I am connected via my VPN.

And tbh I am not deep enough k3s to understand why this caused an issue. Maybe try it in your side by setting this setting for an initial deployment and see what happens.

mysticaltech · 2023-11-01T16:06:03Z

@tschuering So during deployment we need to use kubectl, that's why. However just after deployment you can change it to your own vpn ip and that's it.

To be extra clear, you need to change and apply again after the initial deployment, that will reconfigure the firewall for you. You never need to touch that one manually.

tschuering · 2023-11-01T16:29:25Z

Yeah, I totally understand the second part. But I do not yet understand why, because when you use kubectl I thought you are working on the control plane machine via SSH in that case, which is not affected by the firewall at all or not or are you querying kubectl via the external IP there and not the internal?. Probably I get some misconception here.

tschuering · 2023-11-03T00:16:43Z

@mysticaltech @Silvest89 The same error also appears when having autoscaler_nodepools predefined and you run a first deployment with empty Hetzner cloud project. Commented mine now out, but yeah. I do get the feeling that SSH Firewalling here is not the issue, more or less some sort of order when Terraform things are executed and certain variables are set in the config. 😃

tschuering · 2023-11-03T02:54:54Z

I might have found the issue. When k3s.service is started on the first node, the server value in /etc/rancher/k3s/config.yaml is set to the second control node, which surely he cannot connect because it is not yet running? Cause k3s.service is not started yet there? Am I right or wrong? When I remove server value and start on the first one k3s.service gets started.

To what should this value be set? Cause on the other nodes it is set to the value of the first node with .255 at the end. But when this is set on the first node, the service does not start, but when I remove the value, the service starts fine.

@mysticaltech @Silvest89 Any idea? Please know, that I grepped out token when cat the config. From what I see the server value is also not set in your code anywhere? See second screenshot. And in the second Terraform run he also detects that there is a server value set and removes that, I am very confused.

From https://docs.k3s.io/cli/server
--server value, -s value (cluster) Server to connect to, used to join a cluster [$K3S_URL]

mysticaltech · 2023-11-03T16:03:04Z

@tschuering Thanks for your debug. This is correct, the first control plane first start alone, then its config.yaml is changed to point to the second control plane if available, to provide HA. This just used for re-initialization if ever needed. The changes you did in the PR seem good. If that fixes your issues then perfect. Will test and release as soon as validated.

andinger · 2023-11-06T14:36:29Z

I'm facing the same issue ... can't wait for the fix to be released :-) have to create a new cluster ...

gor181 · 2023-11-06T19:32:44Z

Yap, same issue here. Will wait for the release. Appreciate the hard work guys.

Fix missing dependencies for #1073

mysticaltech · 2023-11-07T02:27:43Z

Fix by @tschuering was just released in v2.9.2.

terraform init -upgrade and run again to deploy.

tschuering added the bug Something isn't working label Oct 31, 2023

tschuering mentioned this issue Nov 3, 2023

Fix missing dependencies for #1073 #1077

Merged

2 tasks

mysticaltech added a commit that referenced this issue Nov 6, 2023

Merge pull request #1077 from tschuering/fix-missing-dependency-for-1073

1b9bc7d

Fix missing dependencies for #1073

mysticaltech closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

tschuering commented Oct 31, 2023

Silvest89 commented Nov 1, 2023

mysticaltech commented Nov 1, 2023

tschuering commented Nov 1, 2023

mysticaltech commented Nov 1, 2023

tschuering commented Nov 1, 2023

tschuering commented Nov 3, 2023

tschuering commented Nov 3, 2023 •

edited

Loading

mysticaltech commented Nov 3, 2023 •

edited

Loading

andinger commented Nov 6, 2023

gor181 commented Nov 6, 2023

mysticaltech commented Nov 7, 2023

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

[Bug]: Randomly Job for k3s.service failed when terraform deploy #1073

Comments

tschuering commented Oct 31, 2023

Description

Kube.tf file

Screenshots

Platform

Silvest89 commented Nov 1, 2023

mysticaltech commented Nov 1, 2023

tschuering commented Nov 1, 2023

mysticaltech commented Nov 1, 2023

tschuering commented Nov 1, 2023

tschuering commented Nov 3, 2023

tschuering commented Nov 3, 2023 • edited Loading

mysticaltech commented Nov 3, 2023 • edited Loading

andinger commented Nov 6, 2023

gor181 commented Nov 6, 2023

mysticaltech commented Nov 7, 2023

tschuering commented Nov 3, 2023 •

edited

Loading

mysticaltech commented Nov 3, 2023 •

edited

Loading