-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service #3858
Conversation
@ori-amizur: This pull request references MCO-731 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @sinnykumari |
@@ -3,8 +3,6 @@ enabled: {{if eq .NetworkType "OVNKubernetes" "OpenShiftSDN"}}true{{else}}false{ | |||
contents: | | |||
[Unit] | |||
Description=Configures OVS with proper host networking configuration | |||
# Removal of this file signals firstboot completion | |||
ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep the condition here, but instead just add After=machine-config-daemon-firstboot.service
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to implement. Got the following errors:
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found ordering cycle on machine-config-daemon-pull.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on network-online.target/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on ovs-configuration.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on machine-config-daemon-firstboot.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Job machine-config-daemon-pull.service/start deleted to break ordering cycle starting with machine-config-daemon-firstboot.service/start
Aug 15 13:56:04 test-infra-cluster-087a552f-master-2 systemd[1]: Configures OVS with proper host networking configuration was skipped because of an unmet condition check (ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json).
Any idea how to address these errors ? What is the right ordering for the services ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Right. Our systemd units are just a huge mess in this respect; I hit similar network ordering issues in #3788
I think to fix it properly we need to get anything that depends on pulling containers to After=network-online.target
- we need to introduce a new kubelet-network-online.target
and make just kubelet pull that in.
ExecStart=/usr/local/bin/configure-ovs.sh {{.NetworkType}} | ||
StandardOutput=journal+console | ||
StandardError=journal+console | ||
Restart=on-failure | ||
RestartSec=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polling is not great as a general rule, and I think we can avoid it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I'd say is that instead of polling in ExecStartPre
, we change configure-ovs.sh
to do polling itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that configure-ovs already has a retry loop:
while true; do |
It's intentionally slow though to allow time to debug in between failed attempts, so it may not be a great fit for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
I will modify the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The firstboot unit depends on this unit. If this unit never ends, then the firstboot never starts.
Only the original implementation solves the problem (retry through systemd), unless we would like to
change the dependencies a lot more.
/hold |
FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though. |
/retest |
/unhold |
/test e2e-hypershift |
I started on that in |
ca08e5c
to
bb94431
Compare
/test okd-scos-e2e-aws-ovn |
/test e2e-aws-ovn-upgrade |
Does it handle the problem ? Currently ovs-configuration.service is before machine-config-daemon-firstboot.service. It should be run after it. |
It doesn't yet, I was trying to start with a simple low-risk change and then make it bigger. |
@ori-amizur: This pull request references MCO-371 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ori-amizur: This pull request references MCO-731 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ori-amizur: This pull request references MCO-731 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ori-amizur: This pull request references MCO-731 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/refresh |
NetworkManager-clean-initrd-state.service runs as well when /etc/ignition-machine-config-encapsulated.json is not present. Should we move this as well earlier? There are few more services that has condition |
It seems to me if I am not mistaken that this service cleans up leftovers from previous run of NM. But since it never ran before, then there are no leftovers. |
It says "Cleans NetworkManager state generated by dracut" and this should be the case during initial node bootstrap as well, isn't it? |
I think so. |
/cc @jcaamano thoughts on above? |
/cc @jcaamano |
I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload. I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload? |
There is another PR handling kargs from /proc/cmdline. #3856 |
Oh there is one possible improvement. On firstboot, coreos-teardown-initramfs.sh copies non-default NM profiles from /run/NerworkManager/system-connections to /etc/NerworkManager/system-connections. So if this other NM state cleanup service runs on first boot as well, it might not realize that the profile a device is activated with is the same existing in both places, so not ephemeral, and might de-activate the device unnecessarily. This might not be an issue, as it would just be activated again when NM runs. Also because we only do that if the profiles where generated by nm-initrd-generator, and in the case I described above these profiles would be user generated. But we should be extra cautious and check if the profile a device is activated with is actually present in /etc/NerworkManager/system-connections and is the only only targeting that device and avoid de-activating it in that case. |
/test okd-e2e-aws-ovn |
@vrutkovs: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ori-amizur I did not realize this was already on the merge queue. Another comment I had is that we might need to rely on |
I think we will need to take this into account these concerns while testing this feature. |
Yeah we'll probably just need to remove all conditionals that trigger on the firstboot ignition config. |
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). Since the services have changed since then, ovs-configuration no longer depends on the existence of the firstboot file, so we should be able to untangle this dependency.
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.
This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.
The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.
- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it
- Description for the changelog
Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service