Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Test Failure under updated kernel 5.x #419

Closed
cevich opened this issue Aug 13, 2019 · 16 comments · Fixed by #439
Closed

CI Test Failure under updated kernel 5.x #419

cevich opened this issue Aug 13, 2019 · 16 comments · Fixed by #439

Comments

@cevich
Copy link
Member

cevich commented Aug 13, 2019

Steps to reproduce (using same VM images as master):

$ hack/get_ci_vm.sh fedora-cloud-base-30-1-2-1556821664
...cut...

Everything works fine, including the package install/update.

[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# contrib/cirrus/build_and_test.sh

The tests pass as normal, then I execute reboot. After waiting a minute, reconnect to the VM and execute tests a second time:

[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# reboot
Connection to 35.226.8.25 closed by remote host.
Connection to 35.226.8.25 closed.
ERROR: (gcloud.compute.ssh) [/bin/ssh] exited with return code [255].

Offering to Delete cevich-fedora-cloud-base-30-1-2-1556821664 (Might take a minute or two)

Note: It's safe to answer N, then re-run script again later.
+  sudo podman run -it --rm -e AS_ID=4179 -e AS_USER=cevich --security-opt label=disable -v /tmp/get_ci_vm.sh_tmpdir_NsUYzF:/home/cevich -v /home/cevich/.config/gcloud:/home/cevich/.config/gcloud -v /home/cevich/.config/gcloud/ssh:/home/cevich/.ssh -v /home/cevich/devel/storage:/home/cevich/devel/storage quay.io/cevich/gcloud_centos:latest --configuration=storage --project=storage-240716 compute instances delete --zone us-central1-b --delete-disks=all cevich-fedora-cloud-base-30-1-2-1556821664
The following instances will be deleted. Any attached disks configured
 to be auto-deleted will be deleted unless they are attached to any 
other instances or the `--keep-disks` flag is given and specifies them
 for keeping. Deleting a disk is irreversible and any data on the disk
 will be lost.
 - [cevich-fedora-cloud-base-30-1-2-1556821664] in [us-central1-b]

Do you want to continue (Y/n)?  n

ERROR: (gcloud.compute.instances.delete) Deletion aborted by user.
[cevich@cevich storage]$ hack/get_ci_vm.sh fedora-cloud-base-30-1-2-1556821664
...cut...
[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# contrib/cirrus/build_and_test.sh

This time the tests fail in the same/similar manner as in #408

Master currently is: 1a0442e

@cevich
Copy link
Member Author

cevich commented Aug 13, 2019

[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# grubby --default-kernel
/boot/vmlinuz-5.2.7-200.fc30.x86_64
[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# grubby --default-index
1
[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# grubby --set-default-index=0
The default is /boot/loader/entries/f241772f3e32496c92975269b5794615-5.0.9-301.fc30.x86_64.conf with index 0 and kernel /boot/vmlinuz-5.0.9-301.fc30.x86_64
[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# reboot
...cut...
[root@cevich-fedora-cloud-base-30-1-2-1556821664 storage]# contrib/cirrus/build_and_test.sh
...cut...
Tests pass!

Conclusion: Kernel /boot/vmlinuz-5.2.7-200.fc30.x86_64 is incompatible with c/storage and/or tests.

@cevich
Copy link
Member Author

cevich commented Aug 13, 2019

On F29:

[root@cevich-fedora-cloud-base-29-1-2-1541789245 storage]# uname -r
4.18.17-300.fc29.x86_64
[root@cevich-fedora-cloud-base-29-1-2-1541789245 storage]# contrib/cirrus/build_and_test.sh 
...tests pass...
[root@cevich-fedora-cloud-base-29-1-2-1541789245 storage]# reboot
...reconnect...
[root@cevich-fedora-cloud-base-29-1-2-1541789245 storage]# contrib/cirrus/build_and_test.sh
...fails...
[root@cevich-fedora-cloud-base-29-1-2-1541789245 storage]# uname -r
5.2.7-100.fc29.x86_64

@cevich
Copy link
Member Author

cevich commented Aug 14, 2019

@nalind @rhatdan PTAL, this is also blocking #418 and basically any future testing (if/when kernel is updated).

@rhatdan
Copy link
Member

rhatdan commented Aug 25, 2019

Does this work on later fedora? We can drop f29?

@cevich
Copy link
Member Author

cevich commented Aug 26, 2019

@rhatdan nope it's broken in F30 as well (same kernel).

@cevich
Copy link
Member Author

cevich commented Aug 26, 2019

oh geeze, my paste-bins of the errors are gone 😠

Here's an example under F30: https://cirrus-ci.com/task/5174065982078976
and one under F29: https://cirrus-ci.com/task/6299965888921600

@cevich
Copy link
Member Author

cevich commented Aug 26, 2019

To be clear, I don't know for-sure this is a kernel problem. It could easily be a race problem in the test which is simply triggered by the different kernel. Probably best way forward is for someone who knows these tests to go in with hack/get_ci_vm.sh and reproduce it (was 100% when I opened the issue).

@cevich
Copy link
Member Author

cevich commented Aug 28, 2019

Thinking more over the last few days, the next step here seems to be getting more data what/why this is failing. I don't think the errors from the test output is detailed enough to pinpoint it. Maybe running the test commands manually would help?

@cevich
Copy link
Member Author

cevich commented Aug 29, 2019

@rhatdan @nalind I remember someone said something like "this is failing because some layer isn't unmounting". Any chance this is at all related to containers/podman#3870 (where I'm also dealing with an updated system encountering test-breakage)?

@cevich
Copy link
Member Author

cevich commented Sep 3, 2019

update: this cannot be related to 3870 since we're not using podman-remote 😖

@cevich cevich changed the title CI Test Failure after rebooting CI Test Failure under updated kernel 5.x Sep 3, 2019
@cevich
Copy link
Member Author

cevich commented Oct 4, 2019

Got frustrated and did some debugging, something is seriously broken with the BATS tests using newer VM images. For example, using hack/get_ci_vm.sh fedora-30-libpod-5816955207942144:

  1. If I run them via contrib/cirrus/build_and_test.sh they die here:
+  make STORAGE_DRIVER=overlay local-test-integration
++ time bats --tap .
1..47
ok 1 absolute-paths
ok 2 applydiff
ok 3 image-data
ok 4 container-data
not ok 5 changes
# (in test file ./changes.bats, line 14)
#   `[ "$status" -eq 0 ]' failed
...cut...
  1. If I run just tests/test_runner.bash (what the Makefile does), they all run fine.
  2. If I run STORAGE_DRIVER=overlay tests/test_runner.bash I get the changes test failure

😖

Hmmm....so something about STORAGE_DRIVER=overlay...

@cevich
Copy link
Member Author

cevich commented Oct 4, 2019

...more manually running tests/test_runner.bash with various options.

Note: I moved /etc/containers/storage.conf out of the way (not sure it made a difference)

This seems to be the pattern on F30 (fully updated):

  • STORAGEDRIVER=overlay
    Many / Most tests fail
  • STORAGE_DRIVER=overlay STORAGE_OPTION=overlay.mount_program=/usr/bin/fuse-overlayfs
    Only import-layer-ostree breaks
  • STORAGEDRIVER=vfs
    Only import-layer-ostree breaks

@rhatdan @nalind any idea why it seems like /usr/bin/fuse-overlayfs is required?

What about the import-layer-ostree test failures? That strikes me like I'm missing a package or something simple like that?

(All of these problems are easily reproducible using the hack script)

@rhatdan
Copy link
Member

rhatdan commented Oct 5, 2019

fuse-overlay would indicate that overlayfs is broken. Are there any storageopts involved? metacopyup?

The import-layer-ostree I have no idea. @giuseppe Any ideas?

@giuseppe
Copy link
Member

giuseppe commented Oct 5, 2019

  • STORAGE_DRIVER=overlay STORAGE_OPTION=overlay.mount_program=/usr/bin/fuse-overlayfs
    Only import-layer-ostree breaks

this seems like the correct configuration.

About the import-layer-ostree, that doesn't depend on overlay (you can see it also fails with vfs), could you point me to the test failure?

giuseppe added a commit to giuseppe/storage that referenced this issue Oct 5, 2019
it was an attempt to use OSTree to deduplicate files, at the time we
already had a dependency on OSTree for system containers in
containers/image.  Since the feature never really took off, let's just
drop it.

Closes: containers#419

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

giuseppe commented Oct 5, 2019

I think we should just drop ostree deduplication. It made sense when we were carrying the dependency from containers/image as it was used for system containers, but at this point the cost of having it is much higher than any benefit. The feature never really took off, requires a specific configuration and has a lot of side effects.

Also when OCI v2 will be a thing, deduplication will surely be done in a different way.

giuseppe added a commit to giuseppe/storage that referenced this issue Oct 5, 2019
it was an attempt to use OSTree to deduplicate files, at the time we
already had a dependency on OSTree for system containers in
containers/image.  Since the feature never really took off, let's just
drop it.

Closes: containers#419

Signed-off-by: Giuseppe Scrivano <[email protected]>
@cevich
Copy link
Member Author

cevich commented Oct 7, 2019

fuse-overlay would indicate that overlayfs is broken. Are there any storageopts involved? metacopyup?

None at all, I even moved /etc/containers/storage out of the way. In CI (using the new image) the failure comes from this line. Remember, this VM Image has a much newer kernel from what's currently used in CI. Is it possible some kernel option default changed WRT what the tests expect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants