k0sctl is not handling node removal correctly #603

danielskowronski · 2023-12-04T17:53:58Z

Versions

k0s: v1.28.3+k0s.0
k0sctl: v0.16.0 (7e8c272)

References

k0sctl node removal via Reset flag introduced here: remove nodes using k0sctl #383 and Remove and reset nodes during apply by setting reset: true #417
k0s docs on controller removal: https://docs.k0sproject.io/v1.28.4+k0s.0/remove_controller/#remove-a-controller
issue that seems to be related: validation phase #572

Summary

k0sctl silently does not support situation where controller entry disappears from k0sctl.yaml spec.hosts. Additionally, k0sctl treats k0sctl.yaml as truth source and does not check every controller. In worst-case scenario, it leads to split-brain: 2 clusters exist under mandatory HA for Control Plane. It is especially visible when workflows like Terraform expect controller removal to be a single run.

Details

All situations assume that the starting point is a set of 3 VMs for controllers (c01, c02, c03) and 4 VMs for workers (w01, ...) that have static IPs assigned. All VMs are fresh EC2s started from latest Ubuntu AMIs before every scenario is started. Static IPs ensure that ETCD problems are immediately visible. First controller is always targetted to ensure any issues with leadership are immediately visible.

Cluster health and membership can be verified by running k0s etcd member-list on each VM or checking if each controller can use k0s kubectl to obtain information about the cluster.

Scenario 1: procedure to leave ETCD is followed -> works

A cluster is created using k0sctl apply and verified to be working. If you follow the procedure to k0s etcd leave targetting leader and execute k0sctl apply right away, it'll fail, but the cluster is left intact.

After that, if you complete the full procedure to finish controller removal by executing k0s stop; k0s reset; reboot and leader and then run k0sctl apply, it'll work as expected by adding c01 back to the cluster. All operations are verified to be OK.

Scenario 2: controller is removed from k0sctl.yaml without any additional operations -> fails, but not catastrophically

A cluster is created using k0sctl apply and verified to be working. If you remove leader c01 from YAML file and run k0sctl apply it seems like it was removed from the cluster (e.g. log says, "2 controllers in total").

However, all 3 controllers remain in cluster and there's no outage.

If you re-add c01 entry to the hosts list in the same form as before and run k0sctl apply, it is "added back" to the cluster (log says "3 controllers in total"). Nothing changed and the cluster is still having 3 controllers and working fine.

Scenario 3: controller is externally wiped and k0sctl runs on unchanged file -> breaks cluster

A cluster is created using k0sctl apply and verified to be working. Leader VM is destroyed by external means, user may not be aware of that. Some new fresh VM exists with the same IP address (hostname may be different). k0sctl apply is executed on YAML file that wasn't changed or changes were made, but ssh section remains the same (i.e. hostname, environment could have been changed, for example by Terraform when EC2 is rebuilt). The effect is the following:

c01, which is VM that was destroyed previously, is still considered by k0sctl as the leader, since it's empty, k0sctl installs a new cluster there and attempts to join c02+c03; however, it fails; etcd installed there only recognizes c01 as a member; kubectl shows no workers
c02 and c03 are still a cluster which has registered old c01 membership registered; this is easy to diagnose if you set a hostname to be globally unique upon OS is installed (e.g. EC2 resource ID or VM birthdate); cluster on c02+c03 can't talk to c01 as etcd clusters are different; kubectl shows all workers
for entire operation of cluster set up following https://docs.k0sproject.io/v1.28.4+k0s.0/high-availability/ we now have some Load Balancer which points to 3 VMs: c01, c02 and c03 - all "healthy", so traffic directed at k8s control plane (both end-user and originating from k0s controllers) is randomly hitting fresh cluster with c01 or old degraded cluster with c02 and c03 -> this is effectively split-brain

This seems to be solved in the current main branch (39674d59b2f9546f83c74127dd64fb9dd553fad5), but only lowers severity to "fails, but not catastrophically". Re-runs of the command do not trigger recently added etcd leave.

Actual problems in one list

k0sctl does not handle controller being removed from spec file
- it only works if you set the flag "Reset", but it does not match "apply" command description and is completely not compatible with stateful systems like Terraform
k0sctl does not handle changes from the outside world
- in other words, it only works if it's the only thing which can manipulate any resource related to k0s cluster
- you have to manually detect drift and apply changes (e.g. k0s etcd leave) in such a way that real-world matches k0sctl.yaml before running "apply"
- new unreleased version only stops cluster from crashing but does not solve the problem of missing node-replacement capability
k0sctl blindly assumes that what spec says is a leader is always a leader
- what's missing is the ability to check all controllers what they think the cluster state is
- with v0.16.0, only the host that used to be leader when k0sctl was last run is validated
k0s/k0sctl rely solely on IP address to form etcd cluster
- maybe it should use Matadata.MachineID
- it seems like if you set ssh.address to hostname (which you can make globally unique) then ETCD cannot start

The text was updated successfully, but these errors were encountered:

danielskowronski · 2023-12-11T14:31:32Z

Additionally, it seems like adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object.

The attached zip has 3 phases: bootstrap, reset leader and remove leader. All with input k0sctl.yaml, logs from k0stcl apply and kubectl get ControlNode -o yaml. Plus the final state of etcd memberships.

k0sctl_603_reset.zip

kke · 2023-12-13T10:03:31Z

k0sctl blindly assumes that what spec says is a leader is always a leader

It just goes through all controllers in the config and picks the first that has k0s running and isn't marked to be reset. If one can't be found, the first controller is used as "leader". There shouldn't be any special treatment for the leader, it's just a "randomly" picked controller that is used for running commands that need to be run on a controller.

		// Pick the first controller that reports to be running and persist the choice
		for _, h := range controllers {
			if !h.Reset && h.Metadata.K0sBinaryVersion != nil && h.Metadata.K0sRunningVersion != nil {
				s.k0sLeader = h
				break
			}
		}

		// Still nil?  Fall back to first "controller" host, do not persist selection.
		if s.k0sLeader == nil {
			return controllers.First()
		}

with v0.16.0, only the host that used to be leader when k0sctl was last run is validated

Hmm, validated how?

adding reset flag, running apply and then removing the controller from YAML followed by apply does not trigger working node removal from ETCD and ControlNode object

The ControlNode objects are autopilot's, so it seems deleting a kubernetes node does not trigger a removal from autopilot, I don't know how autopilot manages removed nodes.

I think k0sctl should maybe do etcd leave before/after kubectl delete node or maybe k0s reset should do that on its own?

This is btw automatically done when needed:

      environment:
        ETCD_UNSUPPORTED_ARCH: arm

Your arch seems to be arm64, 64-bit arm is supported on etcd 3.5.0+ which is included in k0s v1.22.1+k0s.0 and newer

pschichtel · 2024-02-15T17:38:50Z

I just removed a controller node by:

adding reset: true
running apply

result: it removed the node from k8s, reset the host, but it did not remove the etcd member.

I manually removed the member afterwards with etcdctl member remove <id>

kke · 2024-05-15T06:56:31Z

c01, which is VM that was destroyed previously, is still considered by k0sctl as the leader, since it's empty

It should pick a controller that is already running as the leader unless none of them are running, in which case the first is chosen:

k0sctl/pkg/apis/k0sctl.k0sproject.io/v1beta1/cluster/spec.go

Lines 58 to 69 in 0fb74be

    
           // Pick the first controller that reports to be running and persist the choice 
        
           for _, h := range controllers { 
        
           	if !h.Reset && h.Metadata.K0sBinaryVersion != nil && h.Metadata.K0sRunningVersion != nil { 
        
           		s.k0sLeader = h 
        
           		break 
        
           	} 
        
           } 
        
           // Still nil?  Fall back to first "controller" host, do not persist selection. 
        
           if s.k0sLeader == nil { 
        
           	return controllers.First() 
        
           }

you have to manually detect drift and apply changes (e.g. k0s etcd leave) in such a way that real-world matches k0sctl.yaml before running "apply"

k0sctl reset is trying to perform a leave when resetting a non-leader controller:

k0sctl/phase/reset_controllers.go

Lines 90 to 101 in 0fb74be

    
           if !p.NoLeave { 
        
           	log.Debugf("%s: leaving etcd...", h) 
        
           	etcdAddress := h.SSH.Address 
        
           	if h.PrivateAddress != "" { 
        
           		etcdAddress = h.PrivateAddress 
        
           	} 
        
           	if err := h.Exec(h.Configurer.K0sCmdf("etcd leave --peer-address %s --datadir %s", etcdAddress, h.K0sDataDir()), exec.Sudo(h)); err != nil { 
        
           		log.Warnf("%s: failed to leave etcd: %s", h, err.Error()) 
        
           	} 
        
           	log.Debugf("%s: leaving etcd completed", h) 
        
           }

but that of course doesn't apply to a node that was just randomly wiped.

I just removed a controller node by:
adding reset: true
running apply
result: it removed the node from k8s, reset the host, but it did not remove the etcd member.

I manually removed the member afterwards with etcdctl member remove

This could be a bug, does the k0s etcd leave (from above) not work or is it not done when performing a reset 🤔

k0s/k0sctl rely solely on IP address to form etcd cluster
maybe it should use Matadata.MachineID
it seems like if you set ssh.address to hostname (which you can make globally unique) then ETCD cannot start

k0s just removed usage of machine id: k0sproject/k0s#4230 and that is reflected in k0sctl: #697 - so no more MachineID.

For the root problem of performing apply when a fresh machine has replaced one that existed previously with the same IP, there needs to be some check for this. K0sctl should get a list (or maintain a list on its own that is distributed to all of the controllers?) of controllers known to exist and either error out and refuse to apply or try to somehow solve the situation (just do a member remove?) when apply encounters a host that has no k0s running when the controller list says there should be one.

twz123 · 2024-05-16T15:32:03Z

I think k0sctl should maybe do etcd leave before/after kubectl delete node or maybe k0s reset should do that on its own?

Letting k0s reset doing it seems tempting. The challenge here is that k0s reset doesn't have a way to reach out to the cluster it belonged to without starting its own etcd/apiserver yet again, which is probably not what you would expect. K0s configures the etcd client endpoints (port 2379) to listen on loopback interfaces, only the etcd peer endpoints (port 2380) are bound to all interfaces. To my knowledge, there's no way to send client requests to etcd via peer endpoints. We could, however, try to connect to the other API server endpoints and use the new EtcdMember custom resource. K0s would need to cache that API server endpoints on disk, though. We're already doing such a thing on the workers if NLLB is enabled. Anyhow, adding leave support to reset is not something that can be done casually. It's probably easier for cluster management automation (k0smotron, k0sctl and so on) to do this, since they know better about the cluster than a single k0s controller that's no longer running.

kke · 2024-05-17T06:54:51Z

k0sctl already does k0s etcd leave before running k0s reset and there's a --no-leave option to skip that, this has been there for a long time. But this doesn't help with hosts that have been wiped from existence.

kke · 2024-05-17T11:54:56Z

Looks like there's another bug when a new controller is introduced, unsure if it's always or only when it replaces a previously existing one. It seems the apply fails with "empty content on file write" when k0sctl writes k0s configs on the still existing controllers.

danielskowronski · 2024-06-04T10:08:12Z

The issue is still there on k0s v1.30.1+k0s.0 installed with k0sctl v0.17.8.

Install and prepare for tests

Local installation via bootloose (hence weird ssh addresses) with haproxy LB for control plane set as before; k0sctl.yaml as follows:

---
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0sbootloose
spec:
  hosts:
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12002
        keyPath: ./cluster-key
      role: controller
      hostname: c0
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12003
        keyPath: ./cluster-key
      role: controller
      hostname: c1
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12004
        keyPath: ./cluster-key
      role: controller
      hostname: c2
      environment:
        ETCD_UNSUPPORTED_ARCH: arm
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12005
        keyPath: ./cluster-key
      role: worker
      privateAddress: 192.168.67.211
      hostname: w0
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12006
        keyPath: ./cluster-key
      role: worker
      hostname: w1
      reset: false
    - ssh:
        address: 127.0.0.1
        user: root
        port: 12007
        keyPath: ./cluster-key
      role: worker
      hostname: w2
      reset: false
  k0s:
    version: v1.30.1+k0s.0
    dynamicConfig: false
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        creationTimestamp: null
        name: k0sbootloose
      spec:
        api:
          externalAddress: k0s-lb0
          sans:
            - 127.0.0.1
            - k0s-lb0

For reference, this is the localhost port mapping and internal network IP list:

NAME    HOSTNAME  PORTS          IP
k0s-c0  c0        0->{22 12002}  172.18.0.4
k0s-c1  c1        0->{22 12003}  172.18.0.5
k0s-c2  c2        0->{22 12004}  172.18.0.6
k0s-w0  w0        0->{22 12005}  172.18.0.7
k0s-w1  w1        0->{22 12006}  172.18.0.8
k0s-w2  w2        0->{22 12007}  172.18.0.9

Installation from scratch works fine, verified by re-running k0sctl apply with no actions taken.

Tested cluster health with:

for port in 12002 12003 12004; do ssh [email protected] -p $port -i cluster-key -- "echo -n \$HOSTNAME' '; k0s etcd member-list"; done

Reported expected results:

c0 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

Then, the first controller that's usually a leader (k0s-c0 in this instance) was destroyed (manual docker rm -f).

Scenario 1 - controller VM reappears, attempt to use k0sctl on the same yaml

Immediately after removal, VM is re-created (using bootloose create which only created the missing container, nothing else was touched).

After that, the first k0stcl apply with config unchanged was run. It fails with:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
==> Running phase: Download k0s on hosts
[ssh] 127.0.0.1:12002: downloading k0s v1.30.1+k0s.0
==> Running phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Configure k0s
[ssh] 127.0.0.1:12002: installing new configuration
* Running clean-up for phase: Acquire exclusive host lock
* Running clean-up for phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: cleaning up k0s binary tempfile
==> Apply failed
apply failed - log file saved to /Users/.../Library/Caches/k0sctl/k0sctl.log: failed on 2 hosts:
 - [ssh] 127.0.0.1:12003: command failed: empty content for write file /tmp/tmp.5o1UwzNPvx
 - [ssh] 127.0.0.1:12004: command failed: empty content for write file /tmp/tmp.isCY4kY55Z

Test command shows that c0 has no etcd running, but c1 and c2 still expect c0 as cluster member:

c0 Error: can't list etcd cluster members: open /var/lib/k0s/pki/apiserver-etcd-client.crt: no such file or directory
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

After that, k0stcl apply is run again. This time, it bootstraps a new cluster on c0 leaving c1+c2 as-were:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12002: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Install controllers
[ssh] 127.0.0.1:12002: validating api connection to https://k0s-lb0:6443
[ssh] 127.0.0.1:12003: generating token
[ssh] 127.0.0.1:12002: writing join token
[ssh] 127.0.0.1:12002: installing k0s controller
[ssh] 127.0.0.1:12002: updating service environment
[ssh] 127.0.0.1:12002: starting service
[ssh] 127.0.0.1:12002: waiting for the k0s service to start
[ssh] 127.0.0.1:12002: waiting for kubernetes api to respond
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 27s
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig

Split brain is further confirmed by etcd test command:

c0 {"members":{"c0":"https://172.18.0.4:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

Scenario 2 - controller VM reappears, attempt to use k0sctl with reset flag

Immediately after removal, VM is re-created (using bootloose create which only created the missing container, nothing else was touched).

After that, the config is changed to swap reset flag on re-created controller to be true. k0sctl apply claims success:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12006: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12003: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Reset controllers
[ssh] 127.0.0.1:12002: reset
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 3s
There were nodes that got uninstalled during the apply phase. Please remove them from your k0sctl config file
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig

That log still mentions only removal from k0stcl config and not https://docs.k0sproject.io/stable/remove_controller, so the operator could assume everything is fine. However, c0 is NOT removed from etcd cluster that runs on c1+c2:

c0 bash: line 1: k0s: command not found
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

After reverting reset flag, k0stcl apply fails:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12006: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12004: is a container, applying a fix
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
==> Running phase: Download k0s on hosts
[ssh] 127.0.0.1:12002: downloading k0s v1.30.1+k0s.0
==> Running phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Configure k0s
[ssh] 127.0.0.1:12002: installing new configuration
* Running clean-up for phase: Acquire exclusive host lock
* Running clean-up for phase: Install k0s binaries on hosts
[ssh] 127.0.0.1:12002: cleaning up k0s binary tempfile
==> Apply failed
apply failed - log file saved to /Users/.../Library/Caches/k0sctl/k0sctl.log: failed on 2 hosts:
 - [ssh] 127.0.0.1:12004: command failed: empty content for write file /tmp/tmp.3cDp2qCBy3
 - [ssh] 127.0.0.1:12003: command failed: empty content for write file /tmp/tmp.Eum6GBeL8J

Cluster status in unchanged:

c0 Error: can't list etcd cluster members: open /var/lib/k0s/pki/apiserver-etcd-client.crt: no such file or directory
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

An attempt to re-run k0sctl apply now leads to the same results as scenario 1:

==> Running phase: Connect to hosts
[ssh] 127.0.0.1:12003: connected
[ssh] 127.0.0.1:12002: connected
[ssh] 127.0.0.1:12004: connected
[ssh] 127.0.0.1:12005: connected
[ssh] 127.0.0.1:12007: connected
[ssh] 127.0.0.1:12006: connected
==> Running phase: Detect host operating systems
[ssh] 127.0.0.1:12007: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12004: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12006: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12005: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12003: is running Ubuntu 22.04.3 LTS
[ssh] 127.0.0.1:12002: is running Ubuntu 22.04.3 LTS
==> Running phase: Acquire exclusive host lock
==> Running phase: Prepare hosts
[ssh] 127.0.0.1:12002: updating environment
[ssh] 127.0.0.1:12003: updating environment
[ssh] 127.0.0.1:12004: updating environment
[ssh] 127.0.0.1:12006: is a container, applying a fix
[ssh] 127.0.0.1:12007: is a container, applying a fix
[ssh] 127.0.0.1:12005: is a container, applying a fix
[ssh] 127.0.0.1:12004: reconnecting to apply new environment
[ssh] 127.0.0.1:12002: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: reconnecting to apply new environment
[ssh] 127.0.0.1:12003: is a container, applying a fix
[ssh] 127.0.0.1:12002: is a container, applying a fix
[ssh] 127.0.0.1:12004: is a container, applying a fix
==> Running phase: Gather host facts
[ssh] 127.0.0.1:12006: using w1 from configuration as hostname
[ssh] 127.0.0.1:12007: using w2 from configuration as hostname
[ssh] 127.0.0.1:12003: using c1 from configuration as hostname
[ssh] 127.0.0.1:12005: using w0 from configuration as hostname
[ssh] 127.0.0.1:12002: using c0 from configuration as hostname
[ssh] 127.0.0.1:12004: using c2 from configuration as hostname
[ssh] 127.0.0.1:12004: discovered eth0 as private interface
[ssh] 127.0.0.1:12002: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered eth0 as private interface
[ssh] 127.0.0.1:12006: discovered eth0 as private interface
[ssh] 127.0.0.1:12007: discovered eth0 as private interface
[ssh] 127.0.0.1:12003: discovered 172.18.0.5 as private address
[ssh] 127.0.0.1:12002: discovered 172.18.0.4 as private address
[ssh] 127.0.0.1:12006: discovered 172.18.0.8 as private address
[ssh] 127.0.0.1:12004: discovered 172.18.0.6 as private address
[ssh] 127.0.0.1:12007: discovered 172.18.0.9 as private address
==> Running phase: Validate hosts
==> Running phase: Gather k0s facts
[ssh] 127.0.0.1:12002: found existing configuration
[ssh] 127.0.0.1:12003: found existing configuration
[ssh] 127.0.0.1:12004: found existing configuration
[ssh] 127.0.0.1:12003: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12004: is running k0s controller version v1.30.1+k0s.0
[ssh] 127.0.0.1:12007: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w2 has joined
[ssh] 127.0.0.1:12005: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w0 has joined
[ssh] 127.0.0.1:12006: is running k0s worker version v1.30.1+k0s.0
[ssh] 127.0.0.1:12003: checking if worker w1 has joined
==> Running phase: Validate facts
[ssh] 127.0.0.1:12002: validating configuration
[ssh] 127.0.0.1:12003: validating configuration
[ssh] 127.0.0.1:12004: validating configuration
==> Running phase: Install controllers
[ssh] 127.0.0.1:12002: validating api connection to https://k0s-lb0:6443
[ssh] 127.0.0.1:12003: generating token
[ssh] 127.0.0.1:12002: writing join token
[ssh] 127.0.0.1:12002: installing k0s controller
[ssh] 127.0.0.1:12002: updating service environment
[ssh] 127.0.0.1:12002: starting service
[ssh] 127.0.0.1:12002: waiting for the k0s service to start
[ssh] 127.0.0.1:12002: waiting for kubernetes api to respond
==> Running phase: Release exclusive host lock
==> Running phase: Disconnect from hosts
==> Finished in 27s
k0s cluster version v1.30.1+k0s.0 is now installed
Tip: To access the cluster you can now fetch the admin kubeconfig using:
     k0sctl kubeconfig

c0 {"members":{"c0":"https://172.18.0.4:2380"}}
c1 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}
c2 {"members":{"c0":"https://172.18.0.4:2380","c1":"https://172.18.0.5:2380","c2":"https://172.18.0.6:2380"}}

kke · 2024-06-04T11:20:08Z

#714 is an attempt to fix some of that.

danielskowronski · 2024-06-17T08:28:22Z

v0.18.0 solves the issue with --force required on k0sctl apply.

Additionally, without --force reasonable message is emitted:

FATA apply failed - log file saved to .../k0sctl.log: controller [ssh] 127.0.0.1:12002 is listed as an existing etcd member but k0s is not found installed on it, the host may have been replaced. check the host and use `k0s etcd leave --peer-address 172.18.0.4 on a controller or re-run apply with --force

This was referenced Dec 4, 2023

Node removal not supported, however it "works" in unexpected manner alessiodionisi/terraform-provider-k0s#95

Open

Raw SSH key support, k0sctl.yaml config output, read phase retries alessiodionisi/terraform-provider-k0s#96

Open

danielskowronski mentioned this issue Dec 11, 2023

ControlNode improvements k0sproject/k0s#3808

Closed

4 tasks

kke mentioned this issue May 15, 2024

Check new controllers against etcd member-list to detect replaced hosts #714

Merged

danielskowronski closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k0sctl is not handling node removal correctly #603

k0sctl is not handling node removal correctly #603

danielskowronski commented Dec 4, 2023

danielskowronski commented Dec 11, 2023

kke commented Dec 13, 2023

pschichtel commented Feb 15, 2024

kke commented May 15, 2024 •

edited

Loading

twz123 commented May 16, 2024

kke commented May 17, 2024

kke commented May 17, 2024

danielskowronski commented Jun 4, 2024

kke commented Jun 4, 2024

danielskowronski commented Jun 17, 2024

k0sctl is not handling node removal correctly #603

k0sctl is not handling node removal correctly #603

Comments

danielskowronski commented Dec 4, 2023

Versions

References

Summary

Details

Scenario 1: procedure to leave ETCD is followed -> works

Scenario 2: controller is removed from k0sctl.yaml without any additional operations -> fails, but not catastrophically

Scenario 3: controller is externally wiped and k0sctl runs on unchanged file -> breaks cluster

Actual problems in one list

danielskowronski commented Dec 11, 2023

kke commented Dec 13, 2023

pschichtel commented Feb 15, 2024

kke commented May 15, 2024 • edited Loading

twz123 commented May 16, 2024

kke commented May 17, 2024

kke commented May 17, 2024

danielskowronski commented Jun 4, 2024

Install and prepare for tests

Scenario 1 - controller VM reappears, attempt to use k0sctl on the same yaml

Scenario 2 - controller VM reappears, attempt to use k0sctl with reset flag

kke commented Jun 4, 2024

danielskowronski commented Jun 17, 2024

kke commented May 15, 2024 •

edited

Loading