Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

ignition-udev-service: add support for root on multipath #183

Closed
wants to merge 2 commits into from
Closed

ignition-udev-service: add support for root on multipath #183

wants to merge 2 commits into from

Conversation

darkmuggle
Copy link
Contributor

This provides support for multipath without having to re-order
everything or provide a new target. We need to trigger udev to update
the device symlinks in /dev/disk/by-{id,label,uuid} for multipath
devices after multipathd has run.

This also reorders some services to run in "RO" before the multipathd target
is setup and then moves the "RW" after device-mapper targets have been
found.

Signed-off-by: Ben Howard [email protected]

@darkmuggle darkmuggle requested a review from jlebon May 8, 2020 22:35
@darkmuggle
Copy link
Contributor Author

@jlebon This more or less is what you suggested and seems to work. It breaks the Fedora CoreOS Config Growpart code, but I can look at fixing that a bit later.

This removes the need for #179

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to merging mostly as is but some comments:

After=ignition-setup-base.service

# Wait for multipathd
Wants=multipathd.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing related to this...it'd be nice if we only started multipathd.service if it was actually in use; today we start it unconditionally in the initramfs and it just spews an error and then nothing happens. (And the same with iscsi)

IOW I think this should only be started if there's a dm.multipath=1 karg or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but I guess this whole service is only started if something like multipath is in use?

If that's the case, bikeshed naming: ignition-diskful-complex-blockdev.service ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latest push is a bit better in this regard.

Type=oneshot
RemainAfterExit=yes
# Have to shell out since we need shell expansion of the *
ExecStart=-/bin/sh -c '/usr/bin/udevadm trigger -w -v /dev/dm*'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I find confusing here is I'd really expect there to be an explicit command for this somewhere else. Is there not similar code in dracut or multipathd?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my confusion as well. I had hoped that we'd get this for free and all we really had to do was just make sure that the units we have which need ro access to /boot run early enough that they can't interfere with multipath rewiring the symlinks. Poking around on this to see how it happens on traditional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multipath systemd unit triggers systemd-udev-trigger. The problem we have is we need to wait until the symlinks are updated (which won't happen if the the disk is mounted) and during testing we had a racing condition. That's why I have this ugly workaround.

dracut/30ignition/module-setup.sh Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

My understanding is that on most traditional systems that run dracut on the client side, what ends up happening often is that the multipath code in the initramfs is omitted because multipathd isn't installed. But since we have it installed on the server side it's always enabled - hence we have a need to change "enabled because the binary exists" to "enabled via kernel argument" or so.

@jlebon
Copy link
Member

jlebon commented May 11, 2020

How are you testing this BTW? Are you just using e.g. kola spawn --qemu-multipath?

@jlebon
Copy link
Member

jlebon commented May 11, 2020

Hmm, so I added rd.multipath=default to the kargs list in image.yaml for testing and built an FCOS with this PR locally, then spawned with cosa run --qemu-multipath -c, but I still don't see the root filesystem on multipath. Am I missing another key bit here for testing this?

Ben Howard added 2 commits May 11, 2020 16:33
This provides support for multipath without having to re-order
everything or provide a new target. We need to trigger udev to update
the device symlinks in `/dev/disk/by-{id,label,uuid}` for multipath
devices after multipathd has run.

This reorders some services to run in "RO" before the multipathd target
is setup and then moves the "RW" after device-mapper targets have been
found.

Signed-off-by: Ben Howard <[email protected]>
Rename coreos-teardown-initramfs-network.* to
coreos-teardown-initramfs.* and add function to propigating automatic
configuration through to boot.
@darkmuggle
Copy link
Contributor Author

darkmuggle commented May 11, 2020

Hmm, so I added rd.multipath=default to the kargs list in image.yaml for testing and built an FCOS with this PR locally, then spawned with cosa run --qemu-multipath -c, but I still don't see the root filesystem on multipath. Am I missing another key bit here for testing this?

You'll need to test with upsteam dracut. I'm waiting on cranking the upstream dracut in FCOS for this to land first.

@darkmuggle
Copy link
Contributor Author

darkmuggle commented May 11, 2020

The latest push works but fails on a few things ONLY when Multipath is enabled; it does not affect single pathed systems.

[systemd]
Failed Units: 3
  rpm-ostreed.service
  systemd-fsck@dev-disk-by\x2dlabel-boot.service
  systemd-fsck@dev-disk-by\x2dlabel-EFI\x2dSYSTEM.service

(which will be addressed in another PR)

@jlebon jlebon self-assigned this May 12, 2020
@jlebon
Copy link
Member

jlebon commented May 12, 2020

OK, been playing with the multipath patches a lot now.

So to recap and reword, there are two issues:

  1. We need to better order things that read /boot read-only wrt to multipathd.
  2. We need to wait until multipathd did its pathing before trying to mount /sysroot.

Fixing 1 is pretty simple. This works:

diff --git a/dracut/30ignition/ignition-setup-user.service b/dracut/30ignition/ignition-setup-user.service
index 17ec3c4..85fb519 100644
--- a/dracut/30ignition/ignition-setup-user.service
+++ b/dracut/30ignition/ignition-setup-user.service
@@ -9,6 +9,10 @@ Before=local-fs-pre.target
 Before=ignition-disks.service
 Before=ignition-files.service

+# We want to make sure we're not racing with multipath taking ownership of the
+# boot device.
+Before=multipathd.service
+
 # On diskful boots, ignition-generator adds Requires/After on
 # dev-disk-by\x2dlabel-boot.device.
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/15coreos-firstboot-network/coreos-copy-firstboot-network.service b/overlay.d/05cor
e/usr/lib/dracut/modules.d/15coreos-firstboot-network/coreos-copy-firstboot-network.service
index 6e2bf37..4f05ca3 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/15coreos-firstboot-network/coreos-copy-firstboot-network.service
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/15coreos-firstboot-network/coreos-copy-firstboot-network.service
@@ -36,6 +36,10 @@ After=dracut-cmdline.service
 Requires=dev-disk-by\x2dlabel-boot.device
 After=dev-disk-by\x2dlabel-boot.device

+# We want to make sure we're not racing with multipath taking ownership of the
+# boot device.
+Before=multipathd.service
+
 [Service]
 Type=oneshot
 RemainAfterExit=yes

Fixing 2 is more tricky. When ignition-ostree-mount-firstboot-sysroot.service runs, it's able to race with multipathd playing with the symlinks. It's easy to see this with the following patch:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-firstboot-sysroot.service b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-firstboot-sysroot.service
index 3ba677d..e7c05ae 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-firstboot-sysroot.service
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-firstboot-sysroot.service
@@ -22,4 +22,5 @@ Before=ostree-prepare-root.service ignition-remount-sysroot.service
 [Service]
 Type=oneshot
 RemainAfterExit=yes
+ExecStart=/usr/bin/sh -xc "for i in {0..10}; do readlink -v /dev/disk/by-label/root; sleep 0.5; done"
 ExecStart=/usr/sbin/ignition-ostree-mount-sysroot

Then, looking at the output:

[    5.837722] systemd[1]: Starting Ignition OSTree: Mount (firstboot) /sysroot...
[    5.844059] sh[619]: + for i in {0..10}
[    5.854042] sh[619]: + echo 0
[    5.865356] sh[619]: 0
[    5.921826] sh[619]: + readlink -v /dev/disk/by-label/root
[    5.924428] sh[619]: ../../sda4
[    5.927071] sh[619]: + sleep 0.5
[    6.390959] sh[619]: + for i in {0..10}
[    6.397104] sh[619]: + echo 1
[    6.402602] sh[619]: 1
[    6.440865] sh[619]: + readlink -v /dev/disk/by-label/root
[    6.443159] sh[619]: ../../dm-4
[    6.444700] sh[619]: + sleep 0.5
...

The surface issue is that multipathd.service is considered "started" by systemd well before pathing has been set up. So a simple After=multipathd.service doesn't work. This is by design though, since the whole point of multipath is to react to changing availability conditions.

The suggested service here (ignition-diskful-complex-mpath.service) doesn't close that race for me (and IIUC, I think the general issue isn't just that we need to wait for udev events to be processed, but that multipathing itself may take time to emit the events in the first place).

There's basically no easy race-free way to determine if /dev/disk/by-label/root is ready or if we need to wait for multipathing to occur (I played with multipath -c $(realpath /dev/disk/by-label/root) a lot, but in the end, that races with multipathd getting set up).

I also played with this:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-sysroot.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-sysroot.sh
index 047ba2d..7ddce38 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-sysroot.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-mount-sysroot.sh
@@ -12,6 +12,16 @@ if ! [ -b "${rootpath}" ]; then
   echo "ignition-ostree-mount-sysroot: Failed to find ${rootpath}" 1>&2
   exit 1
 fi
+
+if [ -f /etc/multipath.conf ]; then
+  if ! multipathd show status | grep -qE '^paths: 0$'; then
+    while ! ls /dev/mapper/mpath? &>/dev/null; do
+      echo "waiting for multipathed devices to finish set up"
+      sleep 1
+    done
+  fi
+fi
+
 eval $(blkid -o export ${rootpath})
 mountflags=
 if [ "${TYPE}" == "xfs" ]; then

It works, but it's not great. Notably, it means we wait for multipathd even if it's to set up a mount point like /var/lib/containers or whatever and not the rootfs. I think it's also theoretically possible for it to race with multipathd initializing.

My vote now is to simply use an explicit root= karg for this. I.e. users have to pass both e.g. rd.multipath=default and root=/dev/mapper/mpath.... Then there's no ambiguity, we just need to wait for the devmapper device to show up. And actually doing this already Just Works right now because RHCOS used this too, so our sysroot mount units know to get out of the way and let systemd do the mount when the multipathed device shows up.

Using kargs in general of course is tricky because we don't support firstboot kargs yet in the cloud case. I think RAID is more relevant to the cloud than multipath, and we'll hit up against the same issues there, so we'd definitely have to address it eventually if we go the kargs path for that too (and LUKS too once we move that to Ignition).

Thoughts?

@jlebon
Copy link
Member

jlebon commented May 12, 2020

To clarify, all these together give me working multipath (modulo the growpart service + those in #183 (comment) which will need to be adapted):

@darkmuggle
Copy link
Contributor Author

darkmuggle commented May 13, 2020

@jlebon thanks you for playing with this -- and I'll dig in on your comments tomorrow.

coreos/fedora-coreos-config#392 is the second-half of the this which will be needed to complete the boot

@jlebon
Copy link
Member

jlebon commented May 13, 2020

Had a chat with @darkmuggle about this.

While we're agreed the root argument is nicer from an implementation point of view (because it's explicit and doesn't require any guessing), there were some concerns about the UX. Ben brought up that the /dev/mapper/mpatha path is not actually stable, and the more stable one with the WWN, e.g. /dev/disk/by-id/dm-uuid-part4-mpath-0xd99c6a329fd4a0fb, is unwieldy.

Ben suggested adding a udev rule which creates a nicer symlink for devicemapper devices if the underlying filesystem label is root. E.g. /dev/disk/by-label/dm-root. This would help with root on RAID1 too.

@jlebon
Copy link
Member

jlebon commented May 13, 2020

I guess the way I would frame this is that we expand our API from "rootfs is /dev/disk/by-label/root" to "rootfs defaults to /dev/disk/by-label/root but is overridable by the root= karg". The udev rule is simply to make it nicer to use.

@cgwalters
Copy link
Member

I guess the way I would frame this is that we expand our API from "rootfs is /dev/disk/by-label/root" to "rootfs defaults to /dev/disk/by-label/root but is overridable by the root= karg". The udev rule is simply to make it nicer to use.

Yep exactly. This is also what coreos/fedora-coreos-tracker#465 was about.

@jlebon
Copy link
Member

jlebon commented May 13, 2020

Yep exactly. This is also what coreos/fedora-coreos-tracker#465 was about.

Right, yeah. Though I think I'm proposing a middle ground, where we only require a root arg if needed.

@cgwalters
Copy link
Member

Right, yeah. Though I think I'm proposing a middle ground, where we only require a root arg if needed.

I think a guiding principle here for us is "keep the cloud and metal cases as symmetric as possible", while avoiding paying too much cost in cloud for metal things - something like that?

So we ship a single initramfs which has multipath, but it doesn't make sense to start it by default - enable via karg in coreos-installer seems sane to me.

But IMO that leaves us neutral on "root karg by default"; it's a completely free thing to do in cloud, and if we end up with it on a nontrivial percentage of metal installs, then we're making things more symmetric.

Although you had a comment here coreos/fedora-coreos-tracker#465 (comment) around integrating with the bootloader that I want to chase down a bit more.

cgwalters added a commit to cgwalters/coreos-assembler that referenced this pull request May 13, 2020
Right now we allocate a random UUID on each build,
but everyone starting from a particular build will have
the same one...so they're not unique.

This will pair with a new version of
coreos/ignition-dracut#183
so that we can regenerate them on firstboot and ensure
that they are actually unique (as much as possible)
across running systems.
openshift-merge-robot pushed a commit to coreos/coreos-assembler that referenced this pull request May 13, 2020
Right now we allocate a random UUID on each build,
but everyone starting from a particular build will have
the same one...so they're not unique.

This will pair with a new version of
coreos/ignition-dracut#183
so that we can regenerate them on firstboot and ensure
that they are actually unique (as much as possible)
across running systems.
jlebon added a commit to jlebon/ignition-dracut that referenced this pull request May 13, 2020
This is required because the unit needs read-only access to `/boot` in
order to extract any baked in user-provided config and multipathd might
claim ownership.

See: coreos#183
@jlebon
Copy link
Member

jlebon commented May 13, 2020

I split out the patch needed for ignition-setup-user.service into #185. I think we can close this PR now, right?

@darkmuggle
Copy link
Contributor Author

@cgwalters something to note: /dev/disk/by-[label,uuid,id} are not stable. They can and do change; it just depends on when (both device-mapper and physical will use it) you hit the end point.

@darkmuggle
Copy link
Contributor Author

@cgwalters @jlebon thank you most kindly for the good discussion and direction

I concur that its time to close this since the race condition still exists.

@darkmuggle darkmuggle closed this May 13, 2020
jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request May 14, 2020
This is required because the unit needs read-only access to `/boot` in
order to extract any baked in network config and multipathd might claim
ownership.

See similar patch for `ignition-setup-user.service`:
coreos/ignition-dracut#185

For more information, see:
coreos/ignition-dracut#183 (comment)
@jlebon
Copy link
Member

jlebon commented May 14, 2020

And I've split out from #183 (comment) the bit for coreos-copy-firstboot-network.service into coreos/fedora-coreos-config#398.

jlebon added a commit that referenced this pull request May 19, 2020
This is required because the unit needs read-only access to `/boot` in
order to extract any baked in user-provided config and multipathd might
claim ownership.

See: #183
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants