-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967
rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left some minor comments :-)
@rpluem-vf |
@AkihiroSuda I think it should, because the issue was caused by containerd requesting only But that test is removed, so added a comment to keep it :) |
I haven't tested it and hence I am not sure. In general I agree with the imposed logic to fail if some mount flags are explicitly requested that are not possible with a rootless bind mount due to the underlying mount flags of the source filesystem are the opposite, but it changes the original test case (before #3805). I am not sure if the direct users of runc might not at least explicitly request |
There is a test issue with this I'm still debugging.
Imho this behaviour is a bug. We could special-case it if it breaks something in 1.2-rc1 but silently overwriting an option the user explicitly passed seems wrong. If the user doesn't care if the flag is set, they should omit it. I suspect it's actually a spec violation to not enforce explicitly specified flags (then again, if this breaks users we need to figure out a workaround). |
@cyphar +1. Plus I don't know of any real world example of that use case. |
@cyphar it seems cirrus is failing due to a timeout. Wanna re-push to re-trigger the CI? |
The timeout is the test issue I'm debugging 😉 |
The timeout for cirrus jobs is 30 minutes, and normally they are finished in 5 to 20 minutes, depending on a job (see e.g. https://cirrus-ci.com/build/5948000845955072). Reproduced the issue locally: [kir@kir-rhat runc]$ sudo make localintegration
....
ok 204 userns with inaccessible mount + exec
ok 205 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1
ok 206 runc version
# HANGS HERE Or, for a narrower scope, we can use sudo bats tests/integration/mounts_sshfs.bats Let's see what's going on once it hangs: $ ps axf
...
11841 pts/0 S+ 0:00 | | \_ sudo make localintegration RUNC_USE_SYSTEMD=yes
11849 pts/0 S+ 0:00 | | \_ make localintegration RUNC_USE_SYSTEMD=yes
14138 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
14146 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
14147 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats-format-cat --dummy-flag
14148 pts/0 S+ 0:00 | | \_ cat
... This is bats waiting for something to finish/close. I've seen it before when there is some leftover process that inherits an extra fd from bats. This time the process is recvtty; in fact, two:
Executing Usually, recvtty is killed from It happens because with the newly added cases, due to Here's the fix: diff --git a/tests/integration/mounts_sshfs.bats b/tests/integration/mounts_sshfs.bats
index 09fcced2..921bc063 100644
--- a/tests/integration/mounts_sshfs.bats
+++ b/tests/integration/mounts_sshfs.bats
@@ -8,9 +8,12 @@ function setup() {
}
function teardown() {
- # Some distros do not have fusermount installed
- # as a dependency of fuse-sshfs, and good ol' umount works.
- fusermount -u "$DIR" || umount "$DIR"
+ if [ -v DIR ]; then
+ # Some distros do not have fusermount installed
+ # as a dependency of fuse-sshfs, and good ol' umount works.
+ fusermount -u "$DIR" || umount "$DIR"
+ unset DIR
+ fi
teardown_bundle
} PS this dark corner of bash/bats is yet another example of why rewriting everything all shell tests in Go or Python would be a good idea. As much as I love bash, when things are getting too complicated, it becomes unbearable. |
Thanks @kolyshkin! It looked like a teardown issue, but I was busy with other things so I didn't look into it too deeply. Looks pretty nasty to track down. Also, looking at the tests again, I think
The problem is that writing integration tests becomes more difficult, and there might be some slight differences between the testing environment (being spawned from a That being said, as someone who rewrote large parts of the test running code in bats, that codebase is pretty gnarly and I am surprised it works as well as it does. |
After doing some more digging, we actually have had pretty serious bugs in this code for a long time -- it seems that nobody has actually noticed that requesting "rw" specifically would be ignored and it was a question of random chance whether bind-mounts would clear flags not specified in the options of the mount, and a bunch of other issues. We basically need to revert 97f5ee4 (which, funnily enough, was trying to fix the same issue as #3805) as well. It seems clear to me the correct behaviour should be "give the user the exact set of options they specified, and if not possible then allow the original options to be set if they didn't explicitly request the opposite". At the moment, neither condition is consistently met by runc. |
@opencontainers/runc-maintainers this should be ready to review. The atime behaviour is quite subtle so I've added some (possibly even too many) comments to explain the reasoning. This will need a prominent changelog entry too... |
libcontainer/rootfs_linux.go
Outdated
// other flags, we default to MS_STRICTATIME since that's the only | ||
// setting that makes sense here. Critically, we *always* call mount(2) | ||
// with some MS_*ATIME flag set to ensure we have consistent results. | ||
if m.Flags&mntAtimeFlags == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also handle "nostrictatime" here too. Or maybe "norelatime,nostrictatime" should be rejected as a configuration because it makes no sense (what atime is the user actually requesting?)? Or maybe we should just copy the host flags?
While copying the host flags seems "reasonable", this is at odds with every other flag (which we clear if not explicitly requested). On the other hand, flags like "atime" and "diratime" make very little sense without being based on the original mount's settings. On the other other hand, the same thing could be said about the "dev", "exec", etc options -- they only have an effect in the fallback case.
I suspect that we might want to add an "inherit" flag to allow users to explicitly request inheriting the original mount's settings. Only inheriting some flags will just lead to more confusion.
@cyphar can you label this as impact/changelog and add a changelog entry? (We can update it afterwards if some changes are done during review) |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The atime flags logic is somewhat hard to follow, but (1) it seems to do what the comments describe and (2) I couldn't see a way to make it more readable.
I think what's missing from this PR is a bunch of test cases. From the top of my head, I don't remember any tests covering atime flags.
@kolyshkin The last integration test added has a bunch of arime flag tests, what other kind of tests would you prefer? |
My bad, something mixed up in my mind. Left a couple of nits for tests, otherwise LGTM |
Marked this as a draft because there are ambiguities in runtime-spec as to the right behaviour here. A straight-forward reading is that we need to copy the
From where I'm standing, Also, the above behaviour from I think we need a spec issue for this... |
I have moved the |
Function teardown assumes that every test case will call setup_sshfs. Currently, this assumption is true, but once a test case that won't call setup_sshfs is added (say because it has some extra "requires" or "skip"), it will break bats, so any bats invocation involving such a test case will end up hanging after the last test case is run. The reason is, we have set -u in helpers.bash to help catching the use of undefined variables. In the above scenario, such a variable is DIR, which is referenced in teardown but is only defined after a call to setup_sshfs. As a result, bash that is running the teardown function will exit upon seeing the first $DIR, and thus teardown_bundle won't be run. This, in turn, results in a stale recvtty process, which inherits bats' fd 3. Until that fd is closed, bats waits for test logs. Long story short, the fix is to - check if DIR is set before referencing it; - unset it after unmount. PS it is still not clear why there is no diagnostics about the failed teardown. Signed-off-by: Kir Kolyshkin <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]>
The original reasoning for this option was to avoid having mount options be overwritten by runc. However, adding command-line arguments has historically been a bad idea because it forces strict-runc-compatible OCI runtimes to copy out-of-spec features directly from runc and these flags are usually quite difficult to enable by users when using runc through several layers of engines and orchestrators. A far more preferable solution is to have a heuristic which detects whether copying the original mount's mount options would override an explicit mount option specified by the user. In this case, we should return an error. You only end up in this path in the userns case, if you have a bind-mount source with locked flags. During the course of writing this patch, I discovered that several aspects of our handling of flags for bind-mounts left much to be desired. We have completely botched the handling of explicitly cleared flags since commit 97f5ee4 ("Only remount if requested flags differ from current"), with our behaviour only becoming increasingly more weird with 50105de ("Fix failure with rw bind mount of a ro fuse") and da780e4 ("Fix bind mounts of filesystems with certain options set"). In short, we would only clear flags explicitly request by the user purely by chance, in ways that it really should've been reported to us by now. The most egregious is that mounts explicitly marked "rw" were actually mounted "ro" if the bind-mount source was "ro" and no other special flags were included. In addition, our handling of atime was completely broken -- mostly due to how subtle the semantics of atime are on Linux. Unfortunately, while the runtime-spec requires us to implement mount(8)'s behaviour, several aspects of the util-linux mount(8)'s behaviour are broken and thus copying them makes little sense. Since the runtime-spec behaviour for this case (should mount options for a "bind" mount use the "mount --bind -o ..." or "mount --bind -o remount,..." semantics? Is the fallback code we have for userns actually spec-compliant?) and the mount(8) behaviour (see [1]) are not well-defined, this commit simply fixes the most obvious aspects of the behaviour that are broken while keeping the current spirit of the implementation. NOTE: The handling of atime in the base case is left for a future PR to deal with. This means that the atime of the source mount will be silently left alone unless the fallback path needs to be taken, and any flags not explicitly set will be cleared in the base case. Whether we should always be operating as "mount --bind -o remount,..." (where we default to the original mount source flags) is a topic for a separate PR and (probably) associated runtime-spec PR. So, to resolve this: * We store which flags were explicitly requested to be cleared by the user, so that we can detect whether the userns fallback path would end up setting a flag the user explicitly wished to clear. If so, we return an error because we couldn't fulfil the configuration settings. * Revert 97f5ee4 ("Only remount if requested flags differ from current"), as missing flags do not mean we can skip MS_REMOUNT (in fact, missing flags are how you indicate a flag needs to be cleared with mount(2)). The original purpose of the patch was to fix the userns issue, but as mentioned above the correct mechanism is to do a fallback mount that copies the lockable flags from statfs(2). * Improve handling of atime in the fallback case by: - Correctly handling the returned flags in statfs(2). - Implement the MNT_LOCK_ATIME checks in our code to ensure we produce errors rather than silently producing incorrect atime mounts. * Improve the tests so we correctly detect all of these contingencies, including a general "bind-mount atime handling" test to ensure that the behaviour described here is accurate. This change also inlines the remount() function -- it was only ever used for the bind-mount remount case, and its behaviour is very bind-mount specific. [1]: util-linux/util-linux#2433 Reverts: 97f5ee4 ("Only remount if requested flags differ from current") Fixes: 50105de ("Fix failure with rw bind mount of a ro fuse") Fixes: da780e4 ("Fix bind mounts of filesystems with certain options set") Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar @lifubang this fails in debian testing, with or without sshfs. I was trying to debug while you merge this. The error is not obvious for me when running this as root:
(this run is uninstalling sshfs. With ssfs there was an error on Do you have an idea on what could fix this? |
Opened this issue to tackle this down: #4093 |
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4 ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de ("Fix failure with rw bind mount of a ro fuse") and
da780e4 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see 1) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
Revert 97f5ee4 ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
Improve handling of atime in the fallback case by:
produce errors rather than silently producing incorrect atime
mounts.
Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
Reverts: 97f5ee4 ("Only remount if requested flags differ from current")
Fixes: 50105de ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai [email protected]