Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k0sctl is not handling node removal correctly #603

Closed
danielskowronski opened this issue Dec 4, 2023 · 10 comments
Closed

k0sctl is not handling node removal correctly #603

danielskowronski opened this issue Dec 4, 2023 · 10 comments

Comments

@danielskowronski
Copy link

Versions

  • k0s: v1.28.3+k0s.0
  • k0sctl: v0.16.0 (7e8c272)

References

Summary

k0sctl silently does not support situation where controller entry disappears from k0sctl.yaml spec.hosts. Additionally, k0sctl treats k0sctl.yaml as truth source and does not check every controller. In worst-case scenario, it leads to split-brain: 2 clusters exist under mandatory HA for Control Plane. It is especially visible when workflows like Terraform expect controller removal to be a single run.

Details

All situations assume that the starting point is a set of 3 VMs for controllers (c01, c02, c03) and 4 VMs for workers (w01, ...) that have static IPs assigned. All VMs are fresh EC2s started from latest Ubuntu AMIs before every scenario is started. Static IPs ensure that ETCD problems are immediately visible. First controller is always targetted to ensure any issues with leadership are immediately visible.

Cluster health and membership can be verified by running k0s etcd member-list on each VM or checking if each controller can use k0s kubectl to obtain information about the cluster.

Scenario 1: procedure to leave ETCD is followed -> works

A cluster is created using k0sctl apply and verified to be working. If you follow the procedure to k0s etcd leave targetting leader and execute k0sctl apply right away, it'll fail, but the cluster is left intact.

After that, if you complete the full procedure to finish controller removal by executing k0s stop; k0s reset; reboot and leader and then run k0sctl apply, it'll work as expected by adding c01 back to the cluster. All operations are verified to be OK.

Scenario 2: controller is removed from k0sctl.yaml without any additional operations -> fails, but not catastrophically

A cluster is created using k0sctl apply and verified to be working. If you remove leader c01 from YAML file and run k0sctl apply it seems like it was removed from the cluster (e.g. log says, "2 controllers in total").

However, all 3 controllers remain in cluster and there's no outage.

If you re-add c01 entry to the hosts list in the same form as before and run k0sctl apply, it is "added back" to the cluster (log says "3 controllers in total"). Nothing changed and the cluster is still having 3 controllers and working fine.

Scenario 3: controller is externally wiped and k0sctl runs on unchanged file -> breaks cluster

A cluster is created using k0sctl apply and verified to be working. Leader VM is destroyed by external means, user may not be aware of that. Some new fresh VM exists with the same IP address (hostname may be different). k0sctl apply is executed on YAML file that wasn't changed or changes were made, but ssh section remains the same (i.e. hostname, environment could have been changed, for example by Terraform when EC2 is rebuilt). The effect is the following:

  • c01, which is VM that was destroyed previously, is still considered by k0sctl as the leader, since it's empty, k0sctl installs a new cluster there and attempts to join c02+c03; however, it fails; etcd installed there only recognizes c01 as a member; kubectl shows no workers
  • c02 and c03 are still a cluster which has registered old c01 membership registered; this is easy to diagnose if you set a hostname to be globally unique upon OS is installed (e.g. EC2 resource ID or VM birthdate); cluster on c02+c03 can't talk to c01 as etcd clusters are different; kubectl shows all workers
  • for entire operation of cluster set up following https://docs.k0sproject.io/v1.28.4+k0s.0/high-availability/ we now have some Load Balancer which points to 3 VMs: c01, c02 and c03 - all "healthy", so traffic directed at k8s control plane (both end-user and originating from k0s controllers) is randomly hitting fresh cluster with c01 or old degraded cluster with c02 and c03 -> this is effectively split-brain

This seems to be solved in the current main branch (39674d59b2f9546f83c74127dd64fb9dd553fad5), but only lowers severity to "fails, but not catastrophically". Re-runs of the command do not trigger recently added etcd leave.

Actual problems in one list

  1. k0sctl does not handle controller being removed from spec file
    • it only works if you set the flag "Reset", but it does not match "apply" command description and is completely not compatible with stateful systems like Terraform
  2. k0sctl does not handle changes from the outside world
    • in other words, it only works if it's the only thing which can manipulate any resource related to k0s cluster
    • you have to manually detect drift and apply changes (e.g. k0s etcd leave) in such a way that real-world matches k0sctl.yaml before running "apply"
    • new unreleased version only stops cluster from crashing but does not solve the problem of missing node-replacement capability
  3. k0sctl blindly assumes that what spec says is a leader is always a leader
    • what's missing is the ability to check all controllers what they think the cluster state is
    • with v0.16.0, only the host that used to be leader when k0sctl was last run is validated
  4. k0s/k0sctl rely solely on IP address to form etcd cluster
    • maybe it should use Matadata.MachineID
    • it seems like if you set ssh.address to hostname (which you can make globally unique) then ETCD cannot start
@danielskowronski
Copy link
Author

Additionally, it seems like adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object.

The attached zip has 3 phases: bootstrap, reset leader and remove leader. All with input k0sctl.yaml, logs from k0stcl apply and kubectl get ControlNode -o yaml. Plus the final state of etcd memberships.

k0sctl_603_reset.zip

@kke
Copy link
Contributor

kke commented Dec 13, 2023

k0sctl blindly assumes that what spec says is a leader is always a leader

It just goes through all controllers in the config and picks the first that has k0s running and isn't marked to be reset. If one can't be found, the first controller is used as "leader". There shouldn't be any special treatment for the leader, it's just a "randomly" picked controller that is used for running commands that need to be run on a controller.

		// Pick the first controller that reports to be running and persist the choice
		for _, h := range controllers {
			if !h.Reset && h.Metadata.K0sBinaryVersion != nil && h.Metadata.K0sRunningVersion != nil {
				s.k0sLeader = h
				break
			}
		}

		// Still nil?  Fall back to first "controller" host, do not persist selection.
		if s.k0sLeader == nil {
			return controllers.First()
		}

with v0.16.0, only the host that used to be leader when k0sctl was last run is validated

Hmm, validated how?

adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object

The ControlNode objects are autopilot's, so it seems deleting a kubernetes node does not trigger a removal from autopilot, I don't know how autopilot manages removed nodes.

I think k0sctl should maybe do etcd leave before/after kubectl delete node or maybe k0s reset should do that on its own?


This is btw automatically done when needed:

      environment:
        ETCD_UNSUPPORTED_ARCH: arm

Your arch seems to be arm64, 64-bit arm is supported on etcd 3.5.0+ which is included in k0s v1.22.1+k0s.0 and newer

@pschichtel
Copy link

I just removed a controller node by:

  1. adding reset: true
  2. running apply

result: it removed the node from k8s, reset the host, but it did not remove the etcd member.

I manually removed the member afterwards with etcdctl member remove <id>

@kke
Copy link
Contributor

kke commented May 15, 2024

c01, which is VM that was destroyed previously, is still considered by k0sctl as the leader, since it's empty

It should pick a controller that is already running as the leader unless none of them are running, in which case the first is chosen:

// Pick the first controller that reports to be running and persist the choice
for _, h := range controllers {
if !h.Reset && h.Metadata.K0sBinaryVersion != nil && h.Metadata.K0sRunningVersion != nil {
s.k0sLeader = h
break
}
}
// Still nil? Fall back to first "controller" host, do not persist selection.
if s.k0sLeader == nil {
return controllers.First()
}

you have to manually detect drift and apply changes (e.g. k0s etcd leave) in such a way that real-world matches k0sctl.yaml before running "apply"

k0sctl reset is trying to perform a leave when resetting a non-leader controller:

if !p.NoLeave {
log.Debugf("%s: leaving etcd...", h)
etcdAddress := h.SSH.Address
if h.PrivateAddress != "" {
etcdAddress = h.PrivateAddress
}
if err := h.Exec(h.Configurer.K0sCmdf("etcd leave --peer-address %s --datadir %s", etcdAddress, h.K0sDataDir()), exec.Sudo(h)); err != nil {
log.Warnf("%s: failed to leave etcd: %s", h, err.Error())
}
log.Debugf("%s: leaving etcd completed", h)
}
but that of course doesn't apply to a node that was just randomly wiped.

I just removed a controller node by:
adding reset: true
running apply
result: it removed the node from k8s, reset the host, but it did not remove the etcd member.

I manually removed the member afterwards with etcdctl member remove

This could be a bug, does the k0s etcd leave (from above) not work or is it not done when performing a reset 🤔

k0s/k0sctl rely solely on IP address to form etcd cluster
maybe it should use Matadata.MachineID
it seems like if you set ssh.address to hostname (which you can make globally unique) then ETCD cannot start

k0s just removed usage of machine id: k0sproject/k0s#4230 and that is reflected in k0sctl: #697 - so no more MachineID.

For the root problem of performing apply when a fresh machine has replaced one that existed previously with the same IP, there needs to be some check for this. K0sctl should get a list (or maintain a list on its own that is distributed to all of the controllers?) of controllers known to exist and either error out and refuse to apply or try to somehow solve the situation (just do a member remove?) when apply encounters a host that has no k0s running when the controller list says there should be one.

@twz123
Copy link
Member

twz123 commented May 16, 2024

I think k0sctl should maybe do etcd leave before/after kubectl delete node or maybe k0s reset should do that on its own?

Letting k0s reset doing it seems tempting. The challenge here is that k0s reset doesn't have a way to reach out to the cluster it belonged to without starting its own etcd/apiserver yet again, which is probably not what you would expect. K0s configures the etcd client endpoints (port 2379) to listen on loopback interfaces, only the etcd peer endpoints (port 2380) are bound to all interfaces. To my knowledge, there's no way to send client requests to etcd via peer endpoints. We could, however, try to connect to the other API server endpoints and use the new EtcdMember custom resource. K0s would need to cache that API server endpoints on disk, though. We're already doing such a thing on the workers if NLLB is enabled. Anyhow, adding leave support to reset is not something that can be done casually. It's probably easier for cluster management automation (k0smotron, k0sctl and so on) to do this, since they know better about the cluster than a single k0s controller that's no longer running.

@kke
Copy link
Contributor

kke commented May 17, 2024

k0sctl already does k0s etcd leave before running k0s reset and there's a --no-leave option to skip that, this has been there for a long time. But this doesn't help with hosts that have been wiped from existence.

@kke
Copy link
Contributor

kke commented May 17, 2024

Looks like there's another bug when a new controller is introduced, unsure if it's always or only when it replaces a previously existing one. It seems the apply fails with "empty content on file write" when k0sctl writes k0s configs on the still existing controllers.

@danielskowronski
Copy link
Author

The issue is still there on k0s v1.30.1+k0s.0 installed with k0sctl v0.17.8.

Install and prepare for tests

Local installation via bootloose (hence weird ssh addresses) with haproxy LB for control plane set as before; k0sctl.yaml as follows:

---
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0sbootloose
spec:
  hosts:
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12002
        keyPath: ./cluster-key
      role: controller
      hostname: c0
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12003
        keyPath: ./cluster-key
      role: controller
      hostname: c1
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12004
        keyPath: ./cluster-key
      role: controller
      hostname: c2
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12005
        keyPath: ./cluster-key
      role: worker
      privateAddress: 192.168.67.211
      hostname: w0
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12006
        keyPath: ./cluster-key
      role: worker
      hostname: w1
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12007
        keyPath: ./cluster-key
      role: worker
      hostname: w2
      reset: false
  k0s:
    version: v1.30.1+k0s.0
    dynamicConfig: false
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        creationTimestamp: null
        name: k0sbootloose
      spec:
        api:
          externalAddress: k0s-lb0
          sans:
            - 127.0.0.1
            - k0s-lb0

For reference, this is the localhost port mapping and internal network IP list:

NAME    HOSTNAME  PORTS          IP
k0s-c0  c0        0->{22 12002}  172.18.0.4
k0s-c1  c1        0->{22 12003}  172.18.0.5
k0s-c2  c2        0->{22 12004}  172.18.0.6
k0s-w0  w0        0->{22 12005}  172.18.0.7
k0s-w1  w1        0->{22 12006}  172.18.0.8
k0s-w2  w2        0->{22 12007}  172.18.0.9 

Installation from scratch works fine, verified by re-running k0sctl apply with no actions taken.

Tested cluster health with:

for port in 12002 12003 12004; do ssh [email protected] -p $port -i cluster-key -- "echo -n \$HOSTNAME' '; k0s etcd member-list"; done

Reported expected results:

c0 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

Then, the first controller that's usually a leader (k0s-c0 in this instance) was destroyed (manual docker rm -f).

Scenario 1 - controller VM reappears, attempt to use k0sctl on the same yaml

Immediately after removal, VM is re-created (using bootloose create which only created the missing container, nothing else was touched).

After that, the first k0stcl apply with config unchanged was run. It fails with:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
==> Running phase: Download k0s on hosts
[ssh] 127.0.0.1:12002: downloading k0s v1.30.1+k0s.0
==> Running phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Configure k0s
[ssh] 127.0.0.1:12002: installing new configuration
* Running clean-up for phase: Acquire exclusive host lock
* Running clean-up for phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: cleaning up k0s binary tempfile
==> Apply failed
apply failed - log file saved to /Users/.../Library/Caches/k0sctl/k0sctl.log: failed on 2 hosts:
 - [ssh] 127.0.0.1:12003: command failed: empty content for write file /tmp/tmp.5o1UwzNPvx
 - [ssh] 127.0.0.1:12004: command failed: empty content for write file /tmp/tmp.isCY4kY55Z

Test command shows that c0 has no etcd running, but c1 and c2 still expect c0 as cluster member:

c0 Error: can't list etcd cluster members: open /var/lib/k0s/pki/apiserver-etcd-client.crt: no such file or directory
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

After that, k0stcl apply is run again. This time, it bootstraps a new cluster on c0 leaving c1+c2 as-were:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12002: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Install controllers
[ssh] 127.0.0.1:12002: validating api connection to https://k0s-lb0:6443
[ssh] 127.0.0.1:12003: generating token
[ssh] 127.0.0.1:12002: writing join token
[ssh] 127.0.0.1:12002: installing k0s controller
[ssh] 127.0.0.1:12002: updating service environment
[ssh] 127.0.0.1:12002: starting service
[ssh] 127.0.0.1:12002: waiting for the k0s service to start
[ssh] 127.0.0.1:12002: waiting for kubernetes api to respond
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 27s
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig

Split brain is further confirmed by etcd test command:

c0 {"members":{"c0":"https://172.18.0.4:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

Scenario 2 - controller VM reappears, attempt to use k0sctl with reset flag

Immediately after removal, VM is re-created (using bootloose create which only created the missing container, nothing else was touched).

After that, the config is changed to swap reset flag on re-created controller to be true. k0sctl apply claims success:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12006: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12003: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Reset controllers
[ssh] 127.0.0.1:12002: reset
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 3s
There were nodes that got uninstalled during the apply phase. Please remove them from your k0sctl config file
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig

That log still mentions only removal from k0stcl config and not https://docs.k0sproject.io/stable/remove_controller, so the operator could assume everything is fine. However, c0 is NOT removed from etcd cluster that runs on c1+c2:

c0 bash: line 1: k0s: command not found
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

After reverting reset flag, k0stcl apply fails:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12006: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
==> Running phase: Download k0s on hosts
[ssh] 127.0.0.1:12002: downloading k0s v1.30.1+k0s.0
==> Running phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Configure k0s
[ssh] 127.0.0.1:12002: installing new configuration
* Running clean-up for phase: Acquire exclusive host lock
* Running clean-up for phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: cleaning up k0s binary tempfile
==> Apply failed
apply failed - log file saved to /Users/.../Library/Caches/k0sctl/k0sctl.log: failed on 2 hosts:
 - [ssh] 127.0.0.1:12004: command failed: empty content for write file /tmp/tmp.3cDp2qCBy3
 - [ssh] 127.0.0.1:12003: command failed: empty content for write file /tmp/tmp.Eum6GBeL8J

Cluster status in unchanged:

c0 Error: can't list etcd cluster members: open /var/lib/k0s/pki/apiserver-etcd-client.crt: no such file or directory
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

An attempt to re-run k0sctl apply now leads to the same results as scenario 1:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12002: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Install controllers
[ssh] 127.0.0.1:12002: validating api connection to https://k0s-lb0:6443
[ssh] 127.0.0.1:12003: generating token
[ssh] 127.0.0.1:12002: writing join token
[ssh] 127.0.0.1:12002: installing k0s controller
[ssh] 127.0.0.1:12002: updating service environment
[ssh] 127.0.0.1:12002: starting service
[ssh] 127.0.0.1:12002: waiting for the k0s service to start
[ssh] 127.0.0.1:12002: waiting for kubernetes api to respond
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 27s
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig
c0 {"members":{"c0":"https://172.18.0.4:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

@kke
Copy link
Contributor

kke commented Jun 4, 2024

#714 is an attempt to fix some of that.

@danielskowronski
Copy link
Author

v0.18.0 solves the issue with --force required on k0sctl apply.

Additionally, without --force reasonable message is emitted:

FATA apply failed - log file saved to .../k0sctl.log: controller [ssh] 127.0.0.1:12002 is listed as an existing etcd member but k0s is not found installed on it, the host may have been replaced. check the host and use `k0s etcd leave --peer-address 172.18.0.4 on a controller or re-run apply with --force 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants