Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service #3858

Merged
merged 1 commit into from
Sep 18, 2023

Conversation

ori-amizur
Copy link
Contributor

@ori-amizur ori-amizur commented Aug 15, 2023

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

  1. Regression
  2. Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
    - Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 15, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 15, 2023

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists. The reason is that there is an assumption that reboot will be done if the file exists. When we want to skip reboot, we need to verify that the service is not skipped. Therefore, the service will retry to configure until the file does not exist.

- What I did
Changed unit file to retry when the file exists
- How to verify it

  1. Regression
  2. Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
    - Description for the changelog

Retry ovs-configuration.service if file /etc/ignition-machine-config-encapsulated.json exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ori-amizur
Copy link
Contributor Author

/assign @sinnykumari

@@ -3,8 +3,6 @@ enabled: {{if eq .NetworkType "OVNKubernetes" "OpenShiftSDN"}}true{{else}}false{
contents: |
[Unit]
Description=Configures OVS with proper host networking configuration
# Removal of this file signals firstboot completion
ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the condition here, but instead just add After=machine-config-daemon-firstboot.service.

Copy link
Contributor Author

@ori-amizur ori-amizur Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to implement. Got the following errors:

Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found ordering cycle on machine-config-daemon-pull.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on network-online.target/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on ovs-configuration.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Found dependency on machine-config-daemon-firstboot.service/start
Aug 15 13:55:53 test-infra-cluster-087a552f-master-2 systemd[1]: machine-config-daemon-firstboot.service: Job machine-config-daemon-pull.service/start deleted to break ordering cycle starting with machine-config-daemon-firstboot.service/start
Aug 15 13:56:04 test-infra-cluster-087a552f-master-2 systemd[1]: Configures OVS with proper host networking configuration was skipped because of an unmet condition check (ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json).

Any idea how to address these errors ? What is the right ordering for the services ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Right. Our systemd units are just a huge mess in this respect; I hit similar network ordering issues in #3788

I think to fix it properly we need to get anything that depends on pulling containers to After=network-online.target - we need to introduce a new kubelet-network-online.target and make just kubelet pull that in.

ExecStart=/usr/local/bin/configure-ovs.sh {{.NetworkType}}
StandardOutput=journal+console
StandardError=journal+console
Restart=on-failure
RestartSec=5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling is not great as a general rule, and I think we can avoid it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'd say is that instead of polling in ExecStartPre, we change configure-ovs.sh to do polling itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that configure-ovs already has a retry loop:

It's intentionally slow though to allow time to debug in between failed attempts, so it may not be a great fit for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
I will modify the script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The firstboot unit depends on this unit. If this unit never ends, then the firstboot never starts.
Only the original implementation solves the problem (retry through systemd), unless we would like to
change the dependencies a lot more.

@ori-amizur
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023
@cgwalters
Copy link
Member

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

@ori-amizur
Copy link
Contributor Author

/retest

@ori-amizur
Copy link
Contributor Author

/unhold

@ori-amizur
Copy link
Contributor Author

/test e2e-hypershift

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023
@cgwalters
Copy link
Member

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

I started on that in

#3865

@ori-amizur
Copy link
Contributor Author

/test okd-scos-e2e-aws-ovn

@ori-amizur
Copy link
Contributor Author

/test e2e-aws-ovn-upgrade

@ori-amizur
Copy link
Contributor Author

FWIW I'm ok with your original PR, but for everyone's sanity it would really help us all to untangle the systemd unit ordering. I can't say we should block on that though.

I started on that in

#3865

Does it handle the problem ? Currently ovs-configuration.service is before machine-config-daemon-firstboot.service. It should be run after it.

@cgwalters
Copy link
Member

Does it handle the problem ? Currently ovs-configuration.service is before machine-config-daemon-firstboot.service. It should be run after it.

It doesn't yet, I was trying to start with a simple low-risk change and then make it bigger.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 22, 2023

@ori-amizur: This pull request references MCO-371 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

  1. Regression
  2. Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
    - Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ori-amizur ori-amizur changed the title MCO-371: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service MCO-731: Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service Aug 22, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 22, 2023

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists.
The service machine-config-daemon-firstboot.service removes the file after processing.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, these two services are moved before ovs-configuration.service.

- What I did
Moved machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service
- How to verify it

  1. Regression
  2. Verify with Assisted Installer that the installation succeeds with reboot skipped and network type is OVN
    - Description for the changelog

Move services machine-config-daemon-pull.service and machine-config-daemon-firstboot.service before ovs-configuration.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 22, 2023

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 22, 2023

@ori-amizur: This pull request references MCO-731 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ori-amizur
Copy link
Contributor Author

/refresh

@sinnykumari
Copy link
Contributor

NetworkManager-clean-initrd-state.service runs as well when /etc/ignition-machine-config-encapsulated.json is not present. Should we move this as well earlier?

There are few more services that has condition !/etc/ignition-machine-config-encapsulated.json but they are cloud specific and since skipping reboot will only be a case on baremetal platform, should be fine for now.

@ori-amizur
Copy link
Contributor Author

NetworkManager-clean-initrd-state.service runs as well when /etc/ignition-machine-config-encapsulated.json is not present. Should we move this as well earlier?

There are few more services that has condition !/etc/ignition-machine-config-encapsulated.json but they are cloud specific and since skipping reboot will only be a case on baremetal platform, should be fine for now.

It seems to me if I am not mistaken that this service cleans up leftovers from previous run of NM. But since it never ran before, then there are no leftovers.

@sinnykumari
Copy link
Contributor

It seems to me if I am not mistaken that this service cleans up leftovers from previous run of NM. But since it never ran before, then there are no leftovers.

It says "Cleans NetworkManager state generated by dracut" and this should be the case during initial node bootstrap as well, isn't it?

@ori-amizur
Copy link
Contributor Author

It says "Cleans NetworkManager state generated by dracut" and this should be the case during initial node bootstrap as well, isn't it?

I think so.
Anyway, if it performs cleanup before NM invocation, maybe it can be invoked unconditionally ?

@sinnykumari
Copy link
Contributor

/cc @jcaamano thoughts on above?

@sinnykumari
Copy link
Contributor

/cc @jcaamano

@openshift-ci openshift-ci bot requested a review from jcaamano September 7, 2023 12:27
@jcaamano
Copy link
Contributor

jcaamano commented Sep 7, 2023

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

@ori-amizur
Copy link
Contributor Author

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

There is another PR handling kargs from /proc/cmdline. #3856

@jcaamano
Copy link
Contributor

jcaamano commented Sep 7, 2023

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.

I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

Oh there is one possible improvement.

On firstboot, coreos-teardown-initramfs.sh copies non-default NM profiles from /run/NerworkManager/system-connections to /etc/NerworkManager/system-connections. So if this other NM state cleanup service runs on first boot as well, it might not realize that the profile a device is activated with is the same existing in both places, so not ephemeral, and might de-activate the device unnecessarily.

This might not be an issue, as it would just be activated again when NM runs. Also because we only do that if the profiles where generated by nm-initrd-generator, and in the case I described above these profiles would be user generated.

But we should be extra cautious and check if the profile a device is activated with is actually present in /etc/NerworkManager/system-connections and is the only only targeting that device and avoid de-activating it in that case.

@vrutkovs
Copy link
Member

/test okd-e2e-aws-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2023

@vrutkovs: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test e2e-gcp-op-single-node
  • /test e2e-hypershift
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-dualstack-techpreview
  • /test e2e-openstack-externallb
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test e2e-vsphere-upi-zones
  • /test e2e-vsphere-zones
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-bootstrap-unit
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test okd-e2e-aws-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 1b54cbc into openshift:master Sep 18, 2023
@jcaamano
Copy link
Contributor

I don't think there should be a problem running this unconditionally. It would be nice to have a test though with custom network config in the ignition payload.
I am curious though, is skipping this reboot optional? If not, are we not going to support kargs from the ignition payload?

Oh there is one possible improvement.

On firstboot, coreos-teardown-initramfs.sh copies non-default NM profiles from /run/NerworkManager/system-connections to /etc/NerworkManager/system-connections. So if this other NM state cleanup service runs on first boot as well, it might not realize that the profile a device is activated with is the same existing in both places, so not ephemeral, and might de-activate the device unnecessarily.

This might not be an issue, as it would just be activated again when NM runs. Also because we only do that if the profiles where generated by nm-initrd-generator, and in the case I described above these profiles would be user generated.

But we should be extra cautious and check if the profile a device is activated with is actually present in /etc/NerworkManager/system-connections and is the only only targeting that device and avoid de-activating it in that case.

@ori-amizur I did not realize this was already on the merge queue.

Another comment I had is that we might need to rely on NetworkManager-clean-initrd-state.service for all platforms if ignition payload is going to carry network customizations while reboot is skipped. If that's the case, do you guys plan on any testing of such scenario?

@ori-amizur
Copy link
Contributor Author

Another comment I had is that we might need to rely on NetworkManager-clean-initrd-state.service for all platforms if ignition payload is going to carry network customizations while reboot is skipped. If that's the case, do you guys plan on any testing of such scenario?

I think we will need to take this into account these concerns while testing this feature.

@cgwalters
Copy link
Member

Yeah we'll probably just need to remove all conditionals that trigger on the firstboot ignition config.

yuqi-zhang added a commit to yuqi-zhang/machine-config-operator that referenced this pull request Jun 21, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

Since the services have changed since then, ovs-configuration no longer
depends on the existence of the firstboot file, so we should be able to
untangle this dependency.
yuqi-zhang added a commit to yuqi-zhang/machine-config-operator that referenced this pull request Jun 21, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

This fix aims to apply just to Azure, which is not ideal since it drifts
the service definition, but if don't-reboot-on-metal and ARO have
different dependency chains, this is the easiest middle ground.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 4, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

This fix aims to apply just to Azure, which is not ideal since it drifts
the service definition, but if don't-reboot-on-metal and ARO have
different dependency chains, this is the easiest middle ground.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 4, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

This fix aims to apply just to Azure, which is not ideal since it drifts
the service definition, but if don't-reboot-on-metal and ARO have
different dependency chains, this is the easiest middle ground.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 4, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

This fix aims to apply just to Azure, which is not ideal since it drifts
the service definition, but if don't-reboot-on-metal and ARO have
different dependency chains, this is the easiest middle ground.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 16, 2024
This was originally reordered such that ovs-configuration was able to
run when the reboot was skipped. See: openshift#3858

This broke ARO since they require the network to be ready for the pull
to happen (and generally, probably best for the network to be ready
before attempting to pull the new OS image).

This fix aims to apply just to Azure, which is not ideal since it drifts
the service definition, but if don't-reboot-on-metal and ARO have
different dependency chains, this is the easiest middle ground.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants