MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service #3858

ori-amizur · 2023-08-15T10:51:31Z

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

Regression
Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
- Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

openshift-ci-robot · 2023-08-15T10:51:35Z

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists. The reason is that there is an assumption that reboot will be done if the file exists. When we want to skip reboot, we need to verify that the service is not skipped. Therefore, the service will retry to configure until the file does not exist.

- What I did
Changed unit file to retry when the file exists
- How to verify it

Regression

Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
- Description for the changelog

Retry ovs-configuration.service if file /etc/ignition-machine-config-encapsulated.json exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ori-amizur · 2023-08-15T10:51:52Z

/assign @sinnykumari

cgwalters · 2023-08-15T11:22:24Z

templates/common/_base/units/ovs-configuration.service.yaml

@@ -3,8 +3,6 @@ enabled: {{if eq .NetworkType "OVNKubernetes" "OpenShiftSDN"}}true{{else}}false{
 contents: |
  [Unit]
  Description=Configures OVS with proper host networking configuration
-  # Removal of this file signals firstboot completion
-  ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json


I think we can keep the condition here, but instead just add After=machine-config-daemon-firstboot.service.

Doesn't seem to work

Tried to implement. Got the following errors:

Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found ordering cycle on machine-config-daemon-pull.service/start Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on network-online.target/start Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on ovs-configuration.service/start Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on machine-config-daemon-firstboot.service/start Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Job machine-config-daemon-pull.service/start deleted to break ordering cycle starting with machine-config-daemon-firstboot.service/start Aug 15 13:56:04 test-infra-cluster-087a552f-master-2 systemd[1]: Configures OVS with proper host networking configuration was skipped because of an unmet condition check (ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json).

Any idea how to address these errors ? What is the right ordering for the services ?

Ugh. Right. Our systemd units are just a huge mess in this respect; I hit similar network ordering issues in #3788

I think to fix it properly we need to get anything that depends on pulling containers to After=network-online.target - we need to introduce a new kubelet-network-online.target and make just kubelet pull that in.

cgwalters · 2023-08-15T11:23:52Z

templates/common/_base/units/ovs-configuration.service.yaml

  ExecStart=/usr/local/bin/configure-ovs.sh {{.NetworkType}}
  StandardOutput=journal+console
  StandardError=journal+console
+  Restart=on-failure
+  RestartSec=5


Polling is not great as a general rule, and I think we can avoid it here.

I think what I'd say is that instead of polling in ExecStartPre, we change configure-ovs.sh to do polling itself.

Note that configure-ovs already has a retry loop:

machine-config-operator/templates/common/_base/files/configure-ovs-network.yaml

Line 964 in cf8e35e

while true; do

It's intentionally slow though to allow time to debug in between failed attempts, so it may not be a great fit for this.

OK.
I will modify the script.

The firstboot unit depends on this unit. If this unit never ends, then the firstboot never starts.
Only the original implementation solves the problem (retry through systemd), unless we would like to
change the dependencies a lot more.

ori-amizur · 2023-08-15T14:16:28Z

/hold

cgwalters · 2023-08-15T16:56:54Z

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

ori-amizur · 2023-08-15T18:49:05Z

/retest

ori-amizur · 2023-08-15T18:50:48Z

/unhold

ori-amizur · 2023-08-15T18:52:42Z

/test e2e-hypershift

cgwalters · 2023-08-15T21:19:23Z

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

I started on that in

#3865

ori-amizur · 2023-08-16T18:07:32Z

/test okd-scos-e2e-aws-ovn

ori-amizur · 2023-08-17T13:16:23Z

/test e2e-aws-ovn-upgrade

ori-amizur · 2023-08-17T13:45:28Z

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

I started on that in

#3865

Does it handle the problem ? Currently ovs-configuration.service is before machine-config-daemon-firstboot.service. It should be run after it.

cgwalters · 2023-08-17T14:28:58Z

Does it handle the problem ? Currently ovs-configuration.service is before machine-config-daemon-firstboot.service. It should be run after it.

It doesn't yet, I was trying to start with a simple low-risk change and then make it bigger.

openshift-ci-robot · 2023-08-22T20:32:58Z

@ori-amizur: This pull request references MCO-371 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

Regression

Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
- Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-22T21:23:51Z

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

Regression

Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
- Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-22T21:25:09Z

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-08-22T21:28:59Z

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ori-amizur · 2023-08-22T21:29:22Z

/refresh

sinnykumari · 2023-08-29T20:20:29Z

NetworkManager-clean-initrd-state.service runs as well when /etc/ignition-machine-config-encapsulated.json is not present. Should we move this as well earlier?

There are few more services that has condition !/etc/ignition-machine-config-encapsulated.json but they are cloud specific and since skipping reboot will only be a case on baremetal platform, should be fine for now.

ori-amizur · 2023-09-06T15:39:21Z

NetworkManager-clean-initrd-state.service runs as well when /etc/ignition-machine-config-encapsulated.json is not present. Should we move this as well earlier?

There are few more services that has condition !/etc/ignition-machine-config-encapsulated.json but they are cloud specific and since skipping reboot will only be a case on baremetal platform, should be fine for now.

It seems to me if I am not mistaken that this service cleans up leftovers from previous run of NM. But since it never ran before, then there are no leftovers.

sinnykumari · 2023-09-06T15:57:58Z

It seems to me if I am not mistaken that this service cleans up leftovers from previous run of NM. But since it never ran before, then there are no leftovers.

It says "Cleans NetworkManager state generated by dracut" and this should be the case during initial node bootstrap as well, isn't it?

ori-amizur · 2023-09-06T16:14:26Z

It says "Cleans NetworkManager state generated by dracut" and this should be the case during initial node bootstrap as well, isn't it?

I think so.
Anyway, if it performs cleanup before NM invocation, maybe it can be invoked unconditionally ?

sinnykumari · 2023-09-06T19:17:20Z

/cc @jcaamano thoughts on above?

sinnykumari · 2023-09-07T12:27:06Z

/cc @jcaamano

jcaamano · 2023-09-07T14:21:14Z

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

ori-amizur · 2023-09-07T16:44:42Z

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

There is another PR handling kargs from /proc/cmdline. #3856

jcaamano · 2023-09-07T18:49:15Z

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

Oh there is one possible improvement.

On firstboot, coreos-teardown-initramfs.sh copies non-default NM profiles from /run/NerworkManager/system-connections to /etc/NerworkManager/system-connections. So if this other NM state cleanup service runs on first boot as well, it might not realize that the profile a device is activated with is the same existing in both places, so not ephemeral, and might de-activate the device unnecessarily.

This might not be an issue, as it would just be activated again when NM runs. Also because we only do that if the profiles where generated by nm-initrd-generator, and in the case I described above these profiles would be user generated.

But we should be extra cautious and check if the profile a device is activated with is actually present in /etc/NerworkManager/system-connections and is the only only targeting that device and avoid de-activating it in that case.

vrutkovs · 2023-09-11T06:49:12Z

/test okd-e2e-aws-ovn

openshift-ci · 2023-09-11T06:49:15Z

@vrutkovs: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-alibabacloud-ovn
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-upgrade
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack-techpreview
/test e2e-openstack-externallb
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test e2e-vsphere-upi-zones
/test e2e-vsphere-zones
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-bootstrap-unit
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test okd-e2e-aws-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jcaamano · 2023-09-18T13:40:40Z

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.
I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

Oh there is one possible improvement.

On firstboot, coreos-teardown-initramfs.sh copies non-default NM profiles from /run/NerworkManager/system-connections to /etc/NerworkManager/system-connections. So if this other NM state cleanup service runs on first boot as well, it might not realize that the profile a device is activated with is the same existing in both places, so not ephemeral, and might de-activate the device unnecessarily.

This might not be an issue, as it would just be activated again when NM runs. Also because we only do that if the profiles where generated by nm-initrd-generator, and in the case I described above these profiles would be user generated.

But we should be extra cautious and check if the profile a device is activated with is actually present in /etc/NerworkManager/system-connections and is the only only targeting that device and avoid de-activating it in that case.

@ori-amizur I did not realize this was already on the merge queue.

Another comment I had is that we might need to rely on NetworkManager-clean-initrd-state.service for all platforms if ignition payload is going to carry network customizations while reboot is skipped. If that's the case, do you guys plan on any testing of such scenario?

ori-amizur · 2023-09-19T20:03:36Z

Another comment I had is that we might need to rely on NetworkManager-clean-initrd-state.service for all platforms if ignition payload is going to carry network customizations while reboot is skipped. If that's the case, do you guys plan on any testing of such scenario?

I think we will need to take this into account these concerns while testing this feature.

cgwalters · 2023-09-21T14:35:46Z

Yeah we'll probably just need to remove all conditionals that trigger on the firstboot ignition config.

This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). Since the services have changed since then, ovs-configuration no longer depends on the existence of the firstboot file, so we should be able to untangle this dependency.

This was originally reordered such that ovs-configuration was able to run when the reboot was skipped. See: openshift#3858 This broke ARO since they require the network to be ready for the pull to happen (and generally, probably best for the network to be ready before attempting to pull the new OS image). This fix aims to apply just to Azure, which is not ideal since it drifts the service definition, but if don't-reboot-on-metal and ARO have different dependency chains, this is the easiest middle ground.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 15, 2023

openshift-ci bot assigned sinnykumari Aug 15, 2023

openshift-ci bot requested review from cdoern and sinnykumari August 15, 2023 10:53

cgwalters reviewed Aug 15, 2023

View reviewed changes

ori-amizur force-pushed the MCO-731 branch from 580c667 to 64ef2c3 Compare August 15, 2023 14:06

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023

ori-amizur force-pushed the MCO-731 branch from 64ef2c3 to 580c667 Compare August 15, 2023 18:47

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023

cgwalters mentioned this pull request Aug 15, 2023

OCPBUGS-15087: templates: Introduce kubelet-dependencies.target #3865

Merged

ori-amizur force-pushed the MCO-731 branch 5 times, most recently from ca08e5c to bb94431 Compare August 16, 2023 13:29

ori-amizur requested review from cybertron and cgwalters August 16, 2023 13:30

ori-amizur force-pushed the MCO-731 branch from bb94431 to a6fd76c Compare August 17, 2023 08:08

openshift-ci bot requested a review from jcaamano September 7, 2023 12:27

openshift-merge-robot merged commit 1b54cbc into openshift:master Sep 18, 2023

This was referenced Jun 21, 2024

OCPBUGS-35300: MCD-pull: run after network-online.target again #4422

Closed

OCPBUGS-35300: MCD-pull: run after network-online.target in Azure #4423

Merged

MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service #3858

MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service #3858

Conversation

ori-amizur commented Aug 15, 2023 • edited Loading

openshift-ci-robot commented Aug 15, 2023 • edited by openshift-ci bot Loading

ori-amizur commented Aug 15, 2023

Choose a reason for hiding this comment

ori-amizur Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ori-amizur commented Aug 15, 2023

cgwalters commented Aug 15, 2023

ori-amizur commented Aug 15, 2023

ori-amizur commented Aug 15, 2023

ori-amizur commented Aug 15, 2023

cgwalters commented Aug 15, 2023

ori-amizur commented Aug 16, 2023

ori-amizur commented Aug 17, 2023

ori-amizur commented Aug 17, 2023

cgwalters commented Aug 17, 2023

openshift-ci-robot commented Aug 22, 2023 • edited by openshift-ci bot Loading

openshift-ci-robot commented Aug 22, 2023 • edited by openshift-ci bot Loading

openshift-ci-robot commented Aug 22, 2023 • edited by openshift-ci bot Loading

openshift-ci-robot commented Aug 22, 2023 • edited by openshift-ci bot Loading

ori-amizur commented Aug 22, 2023

sinnykumari commented Aug 29, 2023

ori-amizur commented Sep 6, 2023

sinnykumari commented Sep 6, 2023

ori-amizur commented Sep 6, 2023

sinnykumari commented Sep 6, 2023

sinnykumari commented Sep 7, 2023

jcaamano commented Sep 7, 2023 • edited Loading

ori-amizur commented Sep 7, 2023

jcaamano commented Sep 7, 2023

vrutkovs commented Sep 11, 2023

openshift-ci bot commented Sep 11, 2023

jcaamano commented Sep 18, 2023

ori-amizur commented Sep 19, 2023

cgwalters commented Sep 21, 2023

ori-amizur commented Aug 15, 2023 •

edited

Loading

openshift-ci-robot commented Aug 15, 2023 •

edited by openshift-ci bot

Loading

ori-amizur Aug 15, 2023 •

edited

Loading

openshift-ci-robot commented Aug 22, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 22, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 22, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 22, 2023 •

edited by openshift-ci bot

Loading

jcaamano commented Sep 7, 2023 •

edited

Loading