Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken plan status symmetry being reported by autopilot #4718

Closed
4 tasks done
ricardomaraschini opened this issue Jul 4, 2024 · 2 comments
Closed
4 tasks done

broken plan status symmetry being reported by autopilot #4718

ricardomaraschini opened this issue Jul 4, 2024 · 2 comments
Labels
bug Something isn't working component/autopilot

Comments

@ricardomaraschini
Copy link
Contributor

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Version

v1.28.10+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "00e28c1ef61aa37947ea76c0529928521f9feebf778ff180313c99c09fa18a92" (from machine) (pass)
Total memory: 3.8 GiB (pass)
Disk space available for /var/lib/k0s: 64.1 GiB (pass)
Name resolution: localhost: [127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.8.0-35-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: built-in (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

I have started an upgrade from v1.28.10 to v1.29.5 using the following autopilot plan:

- apiVersion: autopilot.k0sproject.io/v1beta2
  kind: Plan
  metadata:
    annotations:
      embedded-cluster.replicated.com/installation-name: "20240704151416"
    creationTimestamp: "2024-07-04T15:14:16Z"
    generation: 1
    name: autopilot
    resourceVersion: "5528"
    uid: 2be1bb16-3c89-4d95-bffb-0c844fb13daa
  spec:
    commands:
    - k0supdate:
        platforms:
          linux-amd64:
            sha256: 88c8342b715b38d64fb219ad0d208a8848f123e9a310fadfc7e54aaa627a4360
            url: https://staging.replicated.app/embedded-cluster-public-files/k0s-binaries/v1.29.5+k0s.0-ec.0
        targets:
          controllers:
            discovery:
              static:
                nodes:
                - node-0
                - node-1
                - node-2
            limits:
              concurrent: 1
          workers:
            discovery:
              static: {}
            limits:
              concurrent: 1
        version: v1.29.5+k0s.0-ec.0
    id: b2768f74-70ad-4b25-89ea-11b50edc862e
    timestamp: now

The first node seems to have been upgraded successfully:

NAME     STATUS   ROLES           AGE   VERSION
node-0   Ready    control-plane   44m   v1.29.5+k0s
node-1   Ready    control-plane   38m   v1.28.10+k0s
node-2   Ready    control-plane   36m   v1.28.10+k0s

After that nothing else happened, the plan is as follow for over 30 minutes:

- apiVersion: autopilot.k0sproject.io/v1beta2
  kind: Plan
  metadata:
    annotations:
      embedded-cluster.replicated.com/installation-name: "20240704151416"
    creationTimestamp: "2024-07-04T15:14:16Z"
    generation: 1
    name: autopilot
    resourceVersion: "5528"
    uid: 2be1bb16-3c89-4d95-bffb-0c844fb13daa
  spec:
    commands:
    - k0supdate:
        platforms:
          linux-amd64:
            sha256: 88c8342b715b38d64fb219ad0d208a8848f123e9a310fadfc7e54aaa627a4360
            url: https://staging.replicated.app/embedded-cluster-public-files/k0s-binaries/v1.29.5+k0s.0-ec.0
        targets:
          controllers:
            discovery:
              static:
                nodes:
                - node-0
                - node-1
                - node-2
            limits:
              concurrent: 1
          workers:
            discovery:
              static: {}
            limits:
              concurrent: 1
        version: v1.29.5+k0s.0-ec.0
    id: b2768f74-70ad-4b25-89ea-11b50edc862e
    timestamp: now
  status:
    commands:
    - id: 0
      k0supdate:
        controllers:
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-0
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-1
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-2
          state: SignalPending
      state: Schedulable
    - id: 0
      k0supdate:
        controllers:
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-0
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-1
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:16Z"
          name: node-2
          state: SignalPending
      state: SchedulableWait
    - id: 0
      k0supdate:
        controllers:
        - lastUpdatedTimestamp: "2024-07-04T15:14:17Z"
          name: node-0
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:17Z"
          name: node-1
          state: SignalPending
        - lastUpdatedTimestamp: "2024-07-04T15:14:17Z"
          name: node-2
          state: SignalPending
      state: SchedulableWait
    state: SchedulableWait

What is strange here is the fact that we have three plan statuses but only one command. We could see the following in one of the nodes journal for k0scontroller:

Jul 04 15:14:17 node-2 k0s[1052]: time="2024-07-04 15:14:17" level=error msg="Unable to process state controller handler: broken plan status symmetry [#cmd=1, #status=2]" component=autopilot controller=schedulablewait leadermode=true

Steps to reproduce

I think this isn't deterministic otherwise this would have been caught earlier but this is how I reproduced it.

  1. Install v1.28.10.
  2. Create a plan similar to the one mentioned above.
  3. Wait.

Expected behavior

Upgrade succeeds.

Actual behavior

The first node is upgraded but the second and third aren't and we can see that "symmetry error" in one of the node logs. After 30 minutes on the same state I decided to open this ticket.

Screenshots and logs

No response

Additional context

No response

@ricardomaraschini ricardomaraschini added the bug Something isn't working label Jul 4, 2024
@ricardomaraschini
Copy link
Contributor Author

ricardomaraschini commented Jul 4, 2024

Maybe the solution would be to just check if a status for a given command id already exists before adding a new one here. Inspecting the logs I can see that all three nodes added, at some point, statuses:

root@node-0:~# journalctl -u k0scontroller | grep "Adding new status for plan"
Jul 04 15:14:16 node-0 k0s[1062]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Jul 04 15:14:16 node-0 k0s[1062]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Jul 04 15:14:16 node-0 k0s[1062]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Jul 04 15:14:17 node-0 k0s[1062]: time="2024-07-04 15:14:17" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
root@node-0:~#
root@node-1:~# journalctl -u k0scontroller | grep "Adding new status for plan"
Jul 04 15:14:16 node-1 k0s[1230]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Jul 04 15:14:16 node-1 k0s[1230]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Jul 04 15:14:16 node-1 k0s[1230]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
root@node-1:~#
root@node-2:~# journalctl -u k0scontroller | grep "Adding new status for plan"
Jul 04 15:14:16 node-2 k0s[1052]: time="2024-07-04 15:14:16" level=info msg="Adding new status for plan 'K0sUpdate' (index=0)" component=inithandler controller=plans leadermode=true
root@node-2:~#

I am not sure all nodes should be processing this very same object, all around the same time. Giving that we have only 3 statuses in the Plan but much more "Adding new status" log lines I suspected that some of these additions were dropped due to conflict and indeed I could find these lines in the log:

Jul 04 15:14:16 node-2 k0s[1052]: time="2024-07-04 15:14:16" level=error msg="Reconciler error" Plan="{\"name\":\"autopilot\"}" component=controller-runtime controller=plan controllerGroup=autopilot.k0sproject.io controllerKind=Plan error="unable to update plan '/autopilot' with status: Operation cannot be fulfilled on plans.autopilot.k0sproject.io \"autopilot\": the object has been modified; please apply your changes to the latest version and try again" name=autopilot namespace= reconcileID="\"40368d4a-b127-406a-8380-8745ff2dc7f5\""

Clearly smells like a concurrency issue. Was this operation of adding the status supposed to happen only in one node ? I guess this is what we expect.

@ricardomaraschini
Copy link
Contributor Author

Oh, I think it has to do with the machine-id. Those three VMs were cloned out of the same image so on v1.28 they are all having their holderIdentity set to the same value:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-07-04T15:01:27Z"
  name: k0s-autopilot-controller
  namespace: k0s-autopilot
  resourceVersion: "446357"
  uid: 9fa9df27-c8ed-48c3-b776-865ad93ebc25
spec:
  acquireTime: "2024-07-05T12:33:21.082083Z"
  holderIdentity: 00e28c1ef61aa37947ea76c0529928521f9feebf778ff180313c99c09fa18a92
  leaseDurationSeconds: 60
  leaseTransitions: 11
  renewTime: "2024-07-05T12:36:39.599608Z"

I have seen this this has already been addressed on #4230, for v1.30 IIUC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/autopilot
Projects
None yet
Development

No branches or pull requests

2 participants