Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

devmapper regression #233

Closed
runcom opened this issue Nov 9, 2018 · 18 comments · Fixed by #240
Closed

devmapper regression #233

runcom opened this issue Nov 9, 2018 · 18 comments · Fixed by #240

Comments

@runcom
Copy link
Member

runcom commented Nov 9, 2018

See cri-o/cri-o#1883 for more information, the range of commits which probably broke devmapper in CRI-O is:

# this is the range changed with https://github.com/kubernetes-sigs/cri-o/pull/1814 wihch broke devmapper
git log --oneline --ancestry-path 88d80428f9b146f8f9fe7e2e8cc8688a5aae1a4e..68332c059156eae970a03245cfcd4d717fb66ecd

68332c0 Merge pull request #216 from giuseppe/userns-restore-setuid-setgid
2df72f3 chown: restore SUID and SGID bits
1d49427 Merge pull request #215 from nalind/create-error
2805a43 layerStore.Put(): always check for Create() errors
02db7cb Merge pull request #214 from nalind/hold-locks
3df3c9f Hold the layer store lock while diffing
243c4cd Merge pull request #213 from nalind/rolayer-lock
c8670ef Fix a lock inversion
d0cb010 Merge pull request #212 from zmedico/locker-locked-stub-to-lockfile_windows
587b6cc Implement Locker.Locked() for windows
9fcbb57 Merge pull request #210 from zmedico/lock-sanity-checks-for-save-methods
c7ba574 Add lock sanity checks to Save() methods
17c7d1f Merge pull request #205 from rhatdan/subuid
ad12a70 Add more information in errors about missing uidimappings or gidmappings
956a197 Merge pull request #195 from giuseppe/fuse-overlayfs-shifting
3f55e5a shifting: raise an error if the container needs shifting
06cea37 tests: add test for shifting support
1897396 drivers: inform Mount of the mappings used by the container
140e0b6 Makefile: install ffjson
883ee15 store: use the original image if the driver supports shifting
1e60d8a store: include layerID in the error message
9bb1031 drivers: add new method SupportsShifting() to LayerIDMapUpdater
6480245 overlay: enable support for shifting when FUSE is used
8c814e0 overlay: use full paths to the mount program
9f0144d storage.conf: move ostree_repo and skip_mount_home to the correct section
afdedba Merge pull request #201 from rhatdan/mountopt
2569af9 Fix overlay to handle mountopt properly
7098fdc Merge pull request #176 from rhatdan/mount-options
8b1a0f8 Add default mount options to pass to drivers
cb8d712 Merge pull request #200 from rhatdan/mounted
1538971 Change Mounted to return the number of times mounted
46a8a1c Merge pull request #198 from rhatdan/umount
1075a73 Modify storage to allow callers to determine if a mount point is mounted
124de68 Merge pull request #199 from giuseppe/allow-to-override-conf-file
e933db5 storage: rename fuse_program to mount_program
759aab1 storage: allow to override .fuse_program from the conf file
c4aa7ad store: move configuration file parsing to new function
0ab7541 Merge pull request #197 from TomasTomecek/master
404315a log errors from ApplyUncompressedLayer
90d0a58 Merge pull request #194 from rhatdan/contributing
3a5d1ec Potential contributors need to find the CONTRIBUTOR information.
9cbb6cb Merge pull request #137 from giuseppe/ostree-storage
13af8ce tests: add test for ostree deduplication
d33931a tests: set mtime to the epoch
c7fdad4 ci: install ostree
85bf5a4 Vagrantfile: update to Fedora 28
6c122a2 vagrant: install ostree
f3e7ee3 configuration: new option skip_mount_home
52a3781 vfs: add support for ostree deduplication
0c7cb60 overlay: new option to support ostree deduplication
ffd808a ostree: new package
eca8a17 storage: add ostree files
58f557c Merge pull request #191 from giuseppe/overlay-fuse-program
fc4b862 containers-storage: add new option .fuse_program
51f1f85 Merge pull request #190 from marcov/better-failmsg
d990e2b Improve duplicate name error message on container create

runcom referenced this issue Nov 9, 2018
Add force to umount to force the umount of a container image
Add an interface to indicate whether or not the layer is mounted
Add a boolean return from unmount to indicate when the layer is really unmounted

Signed-off-by: Daniel J Walsh <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Nov 11, 2018

How quickly can kata recreate the issue? Can we start bisecting these to figure out exactly what PR broke Kata?

What is the exact symptom that kata is failing on ?

@runcom
Copy link
Member Author

runcom commented Nov 11, 2018

How quickly can kata recreate the issue? Can we start bisecting these to figure out exactly what PR broke Kata?

I've asked them to do so and they're on it last time we spoke

What is the exact symptom that kata is failing on ?

It's an invalid argument thrown when unmounting the devmapper mount

@rhatdan
Copy link
Member

rhatdan commented Nov 12, 2018

The only argument we ever give to unmount under devicemapper is MNT_DETATCH.

	if err := unix.Unmount(mountPath, unix.MNT_DETACH); err != nil {
		return err
	}

@rhvgoyal Could yo ulook in the kernel when does unmount return EINVAL?

@runcom
Copy link
Member Author

runcom commented Nov 12, 2018

@chavafg @sboeuf this issue is tracking the regression from cri-o/cri-o#1883
Could you assist Dan?

@ghost
Copy link

ghost commented Nov 12, 2018

How quickly can kata recreate the issue? Can we start bisecting these to figure out exactly what PR broke Kata?

I've asked them to do so and they're on it last time we spoke

What is the exact symptom that kata is failing on ?

It's an invalid argument thrown when unmounting the devmapper mount

Looks like same issues can be reproduced by podman then.
https://bugzilla.redhat.com/show_bug.cgi?id=1623944
https://bugzilla.redhat.com/show_bug.cgi?id=1625394

@chavafg
Copy link

chavafg commented Nov 12, 2018

Hi
This can be reproduced very quickly, you just need to setup devicemapper for cri-o and then create and remove a pod.

I use this storage config on /etc/crio/crio.conf:

# Storage driver used to manage the storage of images and containers. Please
# refer to containers-storage.conf(5) to see all available storage drivers.
storage_driver = "devicemapper"

# List to pass options to the storage driver. Please refer to
# containers-storage.conf(5) to see all available storage options.
storage_option = [
        "dm.directlvm_device=/dev/sdb",
        "dm.directlvm_device_force=true",
        "dm.thinp_percent=95",
        "dm.thinp_metapercent=1",
        "dm.thinp_autoextend_threshold=80",
        "dm.thinp_autoextend_percent=20"
]

And then use these steps:

$ sudo crictl runp "$TESTDATA"/sandbox_config.json
$ sudo crictl stopp <pod_id>
$ sudo crictl rmp <pod_id>

I'll try to check which of the commits here is the one that broke devicemapper.

@rhatdan
Copy link
Member

rhatdan commented Nov 12, 2018

Actually if we can get this to happen with podman, should make it easy to diagnose.

@chavafg
Copy link

chavafg commented Nov 13, 2018

So I revendored github.com/containers/storage to commit 124de6810ea8ab6b69e382af281eff2c214bf3e2 in cri-o and everything worked correctly.
The next change is 46a8a1cddbd8486f12bbebc9ffcdd7f65e81b48f from #198 and using this commit the issue can be reproduced.

BTW, I could also use podman to reproduce the issue:

fuentess@crio-dm:~/go/src/github.com/containers/libpod$ sudo podman run -dt -e HTTPD_VAR_RUN=/var/run/httpd -e HTTPD_MAIN_CONF_D_PATH=/etc/httpd/conf.d                   -e HTTPD_MAIN_CONF_PATH=/etc/httpd/conf
                 -e HTTPD_CONTAINER_SCRIPTS_PATH=/usr/share/container-scripts/httpd/                   registry.fedoraproject.org/f27/httpd /usr/bin/run-httpd
ERRO[0000] File descriptor 6 (/dev/mapper/control) leaked on pvdisplay invocation. Parent PID 3619: podman
  Failed to find physical volume "/dev/sdb".
  error="exit status 5"
Trying to pull registry.fedoraproject.org/f27/httpd...Getting image source signatures
Copying blob sha256:ff3dab903f926d26db009e0dcd575b8ccea7635b6b225835c35df765662707ec
 80.73 MB / 80.73 MB [=====================================================] 27s
Copying blob sha256:9347d6e9d8644ec22930b7dd78ba0cfcab8fd9a09d57545d4d74649ca929df0c
 7.30 MB / 7.30 MB [========================================================] 0s
Copying blob sha256:2fc5c44251d42d4e8bdc547a79e7ad1efffdf11f74ceea6a83e2fb6743aeff7d
 44.82 MB / 44.82 MB [=====================================================] 14s
Copying config sha256:18f01f6f77ef941f5b31dd007d113620cbb1939ac33d1922edce7d9d2c9ebad7
 6.55 KB / 6.55 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
3128d42ada831a56614c35dceefbddacf1128d6101e3a62b9b0dd974bb3c1a9e
fuentess@crio-dm:~/go/src/github.com/containers/libpod$ sudo podman ps
CONTAINER ID  IMAGE                                        COMMAND               CREATED         STATUS             PORTS  NAMES
3128d42ada83  registry.fedoraproject.org/f27/httpd:latest  container-entrypo...  26 seconds ago  Up 24 seconds ago         loving_volhard
fuentess@crio-dm:~/go/src/github.com/containers/libpod$ sudo podman stop 3128d42ada83
3128d42ada831a56614c35dceefbddacf1128d6101e3a62b9b0dd974bb3c1a9e
fuentess@crio-dm:~/go/src/github.com/containers/libpod$ sudo podman rm 3128d42ada83
ERRO[0000] devmapper: Error unmounting device 12dad3ad01aa2823cce01e78e35ef02ba6745bb850cbd46c978c3438873bffaa: invalid argument
error removing container 3128d42ada831a56614c35dceefbddacf1128d6101e3a62b9b0dd974bb3c1a9e root filesystem: invalid argument

I used this storage configuration for podman:

fuentess@crio-dm:~/go/src/github.com/containers/libpod$ cat /etc/containers/storage.conf
[storage]
driver = "devicemapper"


[storage.options]

[storage.options.thinpool]

directlvm_device="/dev/sdb"
directlvm_device_force="True"
thinp_percent=95
thinp_metapercent=1
thinp_autoextend_threshold=80
thinp_autoextend_percent=20

@runcom
Copy link
Member Author

runcom commented Nov 13, 2018

So I revendored github.com/containers/storage to commit 124de6810ea8ab6b69e382af281eff2c214bf3e2 in cri-o and everything worked correctly.
The next change is 46a8a1cddbd8486f12bbebc9ffcdd7f65e81b48f from #198 and using this commit the issue can be reproduced.

so, reverting #198 altogether fixes your issue?

@runcom
Copy link
Member Author

runcom commented Nov 14, 2018

This the only suspicious line I've found in #198 which is still unmounting w/o force (so leaving mounts behind) https://github.com/containers/storage/pull/198/files#diff-a281f22627802db5c62d316f5eac7a96R932

@sboeuf
Copy link

sboeuf commented Nov 14, 2018

@runcom which line? The link you provided does not specify any particular line or range.
I'm asking to make sure we're on the same page, and potentially @chavafg could confirm if this is the culprit.

@runcom
Copy link
Member Author

runcom commented Nov 14, 2018

Here it is https://github.com/rhatdan/storage/blob/1075a73cac3054422c5cda2712ec5412eedce573/layers.go#L932

I doubt that can be the issue but it's the only place I see Unmount(..., false) and if you guys can reproduce, the better. I'll take another look at #198 to spot some flaw which could lead to this, always assuming #198 is the culprit.

@sboeuf @chavafg could you guys try reverting that PR altogether to actually narrow down if that's the real issue?

@sboeuf
Copy link

sboeuf commented Nov 14, 2018

@runcom

I doubt that can be the issue but it's the only place I see Unmount(..., false) and if you guys can reproduce, the better.

Yes of course. @chavafg could you please try to set this line to _, err := s.r.Unmount(s.id, true), recompile, and see if you can still reproduce?

@sboeuf @chavafg could you guys try reverting that PR altogether to actually narrow down if that's the real issue?

I don't follow, @chavafg already confirmed that this PR was the problem. He did bisect and found that it was the one causing the failure.

@runcom
Copy link
Member Author

runcom commented Nov 14, 2018

@sboeuf sorry I misread his comment, my bad, then cool yeah, if we know that's the issue, let's narrow it down and fix it, thanks à lot guys for the help

@sboeuf
Copy link

sboeuf commented Nov 14, 2018

@runcom no problem ;)
And yeah thanks @chavafg for spending time bisecting!

@runcom
Copy link
Member Author

runcom commented Nov 15, 2018

alright I may have found the issue, I'm running kata ci in crio to see if that's the issue, I'll open a PR here if that fixes the issue for Kata (cri-o/cri-o#1910)

@runcom
Copy link
Member Author

runcom commented Nov 15, 2018

see #240 and cri-o/cri-o#1910 which are fixing this

@runcom
Copy link
Member Author

runcom commented Nov 15, 2018

ok devmapper is now fixed, but something is failing on setting ulimits in container when kata is used, maybe it's the custom kernel used?

runcom added a commit to runcom/storage that referenced this issue Nov 15, 2018
This check has been wrongly removed with containers#198
The check must stay as it's now part of the Stop/Delete API so reintroduce it back
This fixes containers#233 and the associated CRI-O issues
This PR + cri-o/cri-o#1910 fully fix the issue
I'm going to revendor c/storage in CRI-O to full fix crio after this is merged

Signed-off-by: Antonio Murdaca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants