-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-15087: templates: Introduce kubelet-dependencies.target #3865
OCPBUGS-15087: templates: Introduce kubelet-dependencies.target #3865
Conversation
e7fb1cf
to
e7ad4e3
Compare
Skipping CI for Draft Pull Request. |
Totally untested pre-PR, so /test e2e-aws-ovn |
/test all |
Need to verify this works with upgrades too, this will be the first time we've added a .target unit into Ignition configs etc. |
Another way to say this: I don't think e.g. |
Hmm, I don't understand the build controller unit failures?
Not immediately reproducing this locally... |
Some thoughts on this, but the TLDR is that I think we should try it. I have concerns with reaching network-online before ovs-configuration has run since that will cause a network blip when it runs and reconfigures things. If something does rely on network-online to know when networking is ready, it may start before ovs-configuration runs and then be broken when the bridge changes are made. However, I think this is mitigated by a couple of things:
|
I did some testing with this locally and it seems to do what we want. I updated the ovs-configuration and node-valid-hostname dependencies to use this target and then injected an artificial delay into nodeip-configuration. It no longer blocked network-online:
|
The other factor is that (It's really just this "pre-kubelet" stuff where we haven't really ingrained that mindset!) |
e7ad4e3
to
1ca451f
Compare
OK rebased 🏄 on master, and I added another commit to actually do the binding of ovs-configuration and node-valid-hostname. |
@cgwalters: This pull request references Jira Issue OCPBUGS-15087, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
I've mirrored to quay.io/cgwalters/release:4.14.0-0.ci.test-2023-10-04-145907-ci-ln-4illsfk-latest-x86_64 and kicked off a cluster upgrade |
Ohhhh right; actually |
We have installed the fix using in IPI on OSP and we have been able to create 10 new workers using a 4.6 cloud image. We have installed the fix using UPI on Vsphere and we have been able to create 10 new workers using a 4.6 cloud image.
The only problem that we found happens when using 4.1 cloud images. The problem happens because the podman version is too old and the pre-merge images cannot be pulled correctly using this version. This problem should not happen with the nightly/official builds. Since we cannot test it pre-merge, what we will do is to test it post-merge using the nightly builds. ~~We can add the qe-approved label. /label qe-approved~~ PS: Finally the azure problem in the installation seems related to our fix. |
/remove-label qe-approved While investigating the azure error I found that maybe it is related to our fix. Please, could you have a look at this journal log? |
The primary motivation here is to stop pulling container images `Before=network-online.target` because it creates complicated dependency loops. This is aiming to fix https://issues.redhat.com/browse/OCPBUGS-15087 A lot of our services are "explicitly coupled" with ordering relationships; e.g. some had `Before=kubelet.service` but not `Before=crio.service`. systemd .target units are explicitly designed for this situation. We introduce a new `kubelet-dependencies.target` - both `crio.service` and `kubelet.service` are `After+Requires=kubelet-dependencies.target`. And units which are needed for kubelet should now be both `Before + RequiredBy=kubelet-dependencies.target`. Similarly, we had a lot of entangling of the "node services" and the firstboot OS updates, with things explicitly ordering against `machine-config-daemon-pull.service` or poking into the implementation details of the firstboot process with `ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json`. Create a new `firstboot-osupdate.target` that succeds after the `machine-config-daemon-firstboot.service` today. Then most of the "infrastructure workload" that must run only on the second boot (such as `gcp-hostname.service`, `openshift-azure-routes.path` etc) can cleanly order after that. This also aids with the coming work for bare metal installs to do OS udpates at install time, because then we will "finalize" the OS update and continue booting.
Good catch! Latest commit should fix it. |
(hmm we should add a test case to the e2e that verifies there are no ordering cycles across the board) |
7c35b05
to
2141f4b
Compare
/test e2e-azure |
The e2e-azure run has no ordering cycles in the journal. |
We installed it in UPI on GCP, IPI on GCP, UPI on Azure, IPI on Azure In IPI on OSP
In UPI on Vsphrere
No problems were found. The only problem was the 4.1 image in test case "63894-Scaleup using 4.1 cloud image". It will have to be tested post-merge. We can add the qe-approeved label /label qe-approved |
I think the current presubmit failures were transient flake |
@cgwalters: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I think we're good for someone to lgtm this? |
Let's get this in. We have good signal from payload testing and our QE testing. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rioliu-rh @sergiordlr Is there still anything we need before remove hold on the PR? |
@sinnykumari we can safely remove the /hold, no problem. /unhold |
@cgwalters: Jira Issue OCPBUGS-15087: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-15087 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: #3865 failed to apply on top of branch "release-4.14":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Cherry pick is in #3967 - just trivial conflicts. |
The primary motivation here is to eventually stop pulling container images
Before=network-online.target
because it creates complicated dependency loops.We introduce a new
kubelet-dependencies.target
- bothcrio.service
andkubelet.service
areAfter+Requires=kubelet-dependencies.target
.This also makes it cleaner for pre-kubelet services, which can just order themselves
Before=kubelet-dependencies.target
.This is just a small preparatory PR to introduce the target unit. The real bigger change will come when we move services like
ovs-configuration.service
andnode-valid-hostname.service
to actually beAfter=network-online.target
+Before=kubelet-dependendies.target
.Crucially, this will unblock services like
machine-config-daemon-pull.service
that want to fetch containers beforekubelet.service
but want to beAfter=network-online.target
.