OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target #3967

cgwalters · 2023-10-11T17:28:51Z

The primary motivation here is to stop pulling
container images Before=network-online.target because it creates complicated dependency loops.

This is aiming to fix
https://issues.redhat.com/browse/OCPBUGS-15087

A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had Before=kubelet.service but not Before=crio.service.

systemd .target units are explicitly designed for this situation.

We introduce a new kubelet-dependencies.target - both crio.service and kubelet.service are After+Requires=kubelet-dependencies.target. And units which are needed for kubelet should now be both Before + RequiredBy=kubelet-dependencies.target.

Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against machine-config-daemon-pull.service or poking into the implementation details of the firstboot process with ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json.

Create a new firstboot-osupdate.target that succeds after the machine-config-daemon-firstboot.service today. Then most of the "infrastructure workload" that must run only on the second boot (such as gcp-hostname.service, openshift-azure-routes.path etc) can cleanly order after that.

This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting.

(cherry picked from commit 2141f4b)

The primary motivation here is to stop pulling container images `Before=network-online.target` because it creates complicated dependency loops. This is aiming to fix https://issues.redhat.com/browse/OCPBUGS-15087 A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had `Before=kubelet.service` but not `Before=crio.service`. systemd .target units are explicitly designed for this situation. We introduce a new `kubelet-dependencies.target` - both `crio.service` and `kubelet.service` are `After+Requires=kubelet-dependencies.target`. And units which are needed for kubelet should now be both `Before + RequiredBy=kubelet-dependencies.target`. Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against `machine-config-daemon-pull.service` or poking into the implementation details of the firstboot process with `ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json`. Create a new `firstboot-osupdate.target` that succeds after the `machine-config-daemon-firstboot.service` today. Then most of the "infrastructure workload" that must run only on the second boot (such as `gcp-hostname.service`, `openshift-azure-routes.path` etc) can cleanly order after that. This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting. (cherry picked from commit 2141f4b)

openshift-ci-robot · 2023-10-11T17:29:00Z

@cgwalters: This pull request references Jira Issue OCPBUGS-20418, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
dependent bug Jira Issue OCPBUGS-15087 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Jira Issue OCPBUGS-15087 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
bug has dependents

Requesting review from QA contact:
/cc @mike-nguyen

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The primary motivation here is to stop pulling
container images Before=network-online.target because it creates complicated dependency loops.

This is aiming to fix
https://issues.redhat.com/browse/OCPBUGS-15087

A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had Before=kubelet.service but not Before=crio.service.

systemd .target units are explicitly designed for this situation.

We introduce a new kubelet-dependencies.target - both crio.service and kubelet.service are After+Requires=kubelet-dependencies.target. And units which are needed for kubelet should now be both Before + RequiredBy=kubelet-dependencies.target.

Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against machine-config-daemon-pull.service or poking into the implementation details of the firstboot process with ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json.

Create a new firstboot-osupdate.target that succeds after the machine-config-daemon-firstboot.service today. Then most of the "infrastructure workload" that must run only on the second boot (such as gcp-hostname.service, openshift-azure-routes.path etc) can cleanly order after that.

This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting.

(cherry picked from commit 2141f4b)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2023-10-11T17:58:16Z

/payload 4.14 nightly blocking

openshift-ci · 2023-10-11T17:58:50Z

@sdodson: trigger 8 job(s) of type blocking for the nightly release of OCP 4.14

periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c16d1f40-685f-11ee-9a34-c857487b6e1b-0

cybertron · 2023-10-11T20:38:50Z

/test e2e-metal-ipi
/test e2e-vsphere-upi

cgwalters · 2023-10-12T11:53:22Z

Not totally sure what to make of the payload run - the failures offhand look like flakes/failures on the Prow hosting cluster or something?

cgwalters · 2023-10-12T11:53:28Z

/retest

openshift-ci · 2023-10-12T14:33:37Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-upi-zones	`f7b0772`	link	false	`/test e2e-vsphere-upi-zones`
ci/prow/okd-scos-e2e-aws-ovn	`f7b0772`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-openstack	`f7b0772`	link	false	`/test e2e-openstack`
ci/prow/okd-scos-e2e-gcp-op	`f7b0772`	link	false	`/test okd-scos-e2e-gcp-op`
ci/prow/e2e-gcp-rt-op	`f7b0772`	link	false	`/test e2e-gcp-rt-op`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cybertron · 2023-10-13T16:14:24Z

/test e2e-hypershift

The failures don't seem to have anything to do with deployment for the nodes, so they're past the point where this change would affect anything. The on-prem jobs passed so from my perspective:
/lgtm

openshift-ci · 2023-10-13T16:14:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, cybertron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rioliu-rh · 2023-10-16T06:19:20Z

/hold for QE pre-merge testing
/cc @sergiordlr

sergiordlr · 2023-10-16T12:52:38Z

In order to verify this PR we executed the following steps:

Install using IPI and UPI on azure
Install using IPI on OSP

Create a machineset using a 4.6 cloud image
Create 20 workers using this 4.6 cloud image

Install using IPI on vsphere

Create a machineset using a 4.6 cloud image
Create 20 workers using this 4.6 cloud image

Install using UPI on vsphere

Create 20 new workers using a 4.6 cloud image

No issues were found.

We can safely assume that the rest of the platforms are tested by the prow jobs required to merge the PR.

Because of the way the CI images are stored, we cannot pre-merge execute the "scale" e2e test cases. Hence, even if this PR has the qe-approved label we will have to execute those "scale" e2e test cases post-merge before fully verifying the jira ticket.

We can add the qe-approved label

/label qe-approved

Thank you very much for this fix!!

openshift-ci-robot · 2023-10-16T12:53:14Z

@cgwalters: This pull request references Jira Issue OCPBUGS-20418, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
dependent bug Jira Issue OCPBUGS-15087 is in the state Verified, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Jira Issue OCPBUGS-15087 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
bug has dependents

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

The primary motivation here is to stop pulling
container images Before=network-online.target because it creates complicated dependency loops.

This is aiming to fix
https://issues.redhat.com/browse/OCPBUGS-15087

A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had Before=kubelet.service but not Before=crio.service.

systemd .target units are explicitly designed for this situation.

We introduce a new kubelet-dependencies.target - both crio.service and kubelet.service are After+Requires=kubelet-dependencies.target. And units which are needed for kubelet should now be both Before + RequiredBy=kubelet-dependencies.target.

Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against machine-config-daemon-pull.service or poking into the implementation details of the firstboot process with ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json.

Create a new firstboot-osupdate.target that succeds after the machine-config-daemon-firstboot.service today. Then most of the "infrastructure workload" that must run only on the second boot (such as gcp-hostname.service, openshift-azure-routes.path etc) can cleanly order after that.

This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting.

(cherry picked from commit 2141f4b)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sergiordlr · 2023-10-16T13:29:17Z

/label cherry-pick-approved
/unhold

cgwalters · 2023-10-16T19:14:27Z

Just to repeat I think we should give this a bit more time before we add the backport-risk-assessed label...so far I am not aware of any fallout in 4.15 but it's still early.

sinnykumari · 2023-11-02T16:09:09Z

This should be good to get merged.
/hold cancel
I think it is ok to merge #4001 separately after this PR and we will take care of reverting both when needed.
As per offline conversation with Colin, #3979 fixes OKD issue.

sinnykumari · 2023-11-02T16:10:27Z

/label backport-risk-assessed

openshift-ci-robot · 2023-11-02T18:55:04Z

@cgwalters: Jira Issue OCPBUGS-20418: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3967

Jira Issue OCPBUGS-20418 has been moved to the MODIFIED state.

In response to this:

The primary motivation here is to stop pulling
container images Before=network-online.target because it creates complicated dependency loops.

This is aiming to fix
https://issues.redhat.com/browse/OCPBUGS-15087

A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had Before=kubelet.service but not Before=crio.service.

systemd .target units are explicitly designed for this situation.

We introduce a new kubelet-dependencies.target - both crio.service and kubelet.service are After+Requires=kubelet-dependencies.target. And units which are needed for kubelet should now be both Before + RequiredBy=kubelet-dependencies.target.

Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against machine-config-daemon-pull.service or poking into the implementation details of the firstboot process with ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json.

Create a new firstboot-osupdate.target that succeds after the machine-config-daemon-firstboot.service today. Then most of the "infrastructure workload" that must run only on the second boot (such as gcp-hostname.service, openshift-azure-routes.path etc) can cleanly order after that.

This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting.

(cherry picked from commit 2141f4b)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot · 2023-11-04T04:18:27Z

Fix included in accepted release 4.14.0-0.nightly-2023-11-03-193211

sinnykumari · 2023-11-21T21:14:32Z

This is needed in 4.13 as well.
/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-11-21T21:15:17Z

@sinnykumari: new pull request created: #4043

In response to this:

This is needed in 4.13 as well.
/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When overwriting a systemd unit with new content, we need to account for the case where the new unit content has a different `[Install]` section. If it does, then simply overwriting will leak the previous enablement symlinks and become node state. That's OK most of the time, but this can cause real issues as we've seen with the combination of openshift#3967 which does exactly that (changing `[Install]` sections) and openshift#4213 which assumed that those symlinks were cleaned up. More details on that cocktail in: https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003 Fix this by always checking if the unit is currently enabled, and if so, running `systemctl disable` *before* overwriting its contents. The unit will then be re-enabled (or not) based on the MachineConfig.

When overwriting a systemd unit with new content, we need to account for the case where the new unit content has a different `[Install]` section. If it does, then simply overwriting will leak the previous enablement symlinks and become node state. That's OK most of the time, but this can cause real issues as we've seen with the combination of openshift#3967 which does exactly that (changing `[Install]` sections) and openshift#4213 which assumed that those symlinks were cleaned up. More details on that cocktail in: https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003 Fix this by always checking if the unit is currently enabled, and if so, running `systemctl disable` *before* overwriting its contents. The unit will then be re-enabled (or not) based on the MachineConfig. Fixes: https://issues.redhat.com/browse/OCPBUGS-33694

Original description: daemon/update: disable systemd unit before overwriting When overwriting a systemd unit with new content, we need to account for the case where the new unit content has a different `[Install]` section. If it does, then simply overwriting will leak the previous enablement symlinks and become node state. That's OK most of the time, but this can cause real issues as we've seen with the combination of openshift#3967 which does exactly that (changing `[Install]` sections) and openshift#4213 which assumed that those symlinks were cleaned up. More details on that cocktail in: https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003 Fix this by always checking if the unit is currently enabled, and if so, running `systemctl disable` *before* overwriting its contents. The unit will then be re-enabled (or not) based on the MachineConfig. Fixes: https://issues.redhat.com/browse/OCPBUGS-33694

openshift-ci bot requested review from mike-nguyen, Darth-Mera and jcpowermac October 11, 2023 17:29

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2023

cgwalters mentioned this pull request Oct 11, 2023

OCPBUGS-15087: templates: Introduce kubelet-dependencies.target #3865

Merged

openshift-ci bot assigned cybertron Oct 13, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2023

openshift-ci bot requested a review from sergiordlr October 16, 2023 06:19

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2023

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 16, 2023

openshift-ci bot added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 16, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2023

openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Nov 2, 2023

openshift-ci bot assigned anuragthehatter, eurijon, jfrancoa, mburman5, mhanss, rbbratta, reihl, rioliu-rh and rlobillo Nov 2, 2023

openshift-ci bot merged commit 5cadd58 into openshift:release-4.14 Nov 2, 2023
18 of 24 checks passed

openshift-cherrypick-robot mentioned this pull request Nov 21, 2023

[release-4.13] OCPBUGS-23536: Introduce kubelet-dependencies.target and firstboot-osupdate.target #4043

Merged

jlebon mentioned this pull request Jun 21, 2024

OCPBUGS-33694: daemon/update: disable systemd unit before overwriting #4421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target #3967

OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target #3967

cgwalters commented Oct 11, 2023

openshift-ci-robot commented Oct 11, 2023

sdodson commented Oct 11, 2023

openshift-ci bot commented Oct 11, 2023

cybertron commented Oct 11, 2023

cgwalters commented Oct 12, 2023

cgwalters commented Oct 12, 2023

openshift-ci bot commented Oct 12, 2023 •

edited

Loading

cybertron commented Oct 13, 2023

openshift-ci bot commented Oct 13, 2023

rioliu-rh commented Oct 16, 2023

sergiordlr commented Oct 16, 2023 •

edited

Loading

openshift-ci-robot commented Oct 16, 2023

sergiordlr commented Oct 16, 2023

cgwalters commented Oct 16, 2023

sinnykumari commented Nov 2, 2023

sinnykumari commented Nov 2, 2023

openshift-ci-robot commented Nov 2, 2023

openshift-merge-robot commented Nov 4, 2023

sinnykumari commented Nov 21, 2023

openshift-cherrypick-robot commented Nov 21, 2023

OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target #3967

OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target #3967

Conversation

cgwalters commented Oct 11, 2023

openshift-ci-robot commented Oct 11, 2023

sdodson commented Oct 11, 2023

openshift-ci bot commented Oct 11, 2023

cybertron commented Oct 11, 2023

cgwalters commented Oct 12, 2023

cgwalters commented Oct 12, 2023

openshift-ci bot commented Oct 12, 2023 • edited Loading

cybertron commented Oct 13, 2023

openshift-ci bot commented Oct 13, 2023

rioliu-rh commented Oct 16, 2023

sergiordlr commented Oct 16, 2023 • edited Loading

openshift-ci-robot commented Oct 16, 2023

sergiordlr commented Oct 16, 2023

cgwalters commented Oct 16, 2023

sinnykumari commented Nov 2, 2023

sinnykumari commented Nov 2, 2023

openshift-ci-robot commented Nov 2, 2023

openshift-merge-robot commented Nov 4, 2023

sinnykumari commented Nov 21, 2023

openshift-cherrypick-robot commented Nov 21, 2023

openshift-ci bot commented Oct 12, 2023 •

edited

Loading

sergiordlr commented Oct 16, 2023 •

edited

Loading