Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI flake: podman-remote: no output from container (and "does not exist in database"?) #7195

Closed
edsantiago opened this issue Aug 3, 2020 · 14 comments · Fixed by #7451
Closed
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

This is a bad report. I have no reproducer nor any real sense for what's going on.

I'm seeing consistent flakes in #7111 . The failing test is always "podman run : user namespace preserved root ownership" which is simply a quick loop of podman run commands. The last set of failures all looked like:

[+0136s] # # podman-remote --url ... run --rm --user=100 --userns=keep-id quay.io/libpod/alpine_labels:latest stat -c %u:%g:%n /etc
[+0136s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0136s] # #|     FAIL: run  --user=100 --userns=keep-id (/etc)   <<<--- these flags are not always the same
[+0136s] # #| expected: '0:0:/etc'
[+0136s] # #|   actual: ''
[+0136s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In two of the three most recent failures, in teardown, there's a podman rm -a -f that barfs with:

[+0136s] # Error: container d0ea34aaeffe07cc3e3f7f79933372a2bab825ef97e17c580eb1e1e94b2ac7e7 does not exist in database: no such container

Logs: fedora 32, fedora 31, special testing rootless.

In an even earlier run, in special_testing_rootless, there was a different error in a different test:

[+0257s] not ok 74 podman volume with --userns=keep-id
[+0257s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman.fpYZ0P run --rm -v /tmp/podman_bats.LsZK3v/volume_O8zRoMsGmt:/vol:z quay.io/libpod/alpine_labels:latest stat -c %u:%s /vol/myfile
[+0257s] # read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer
[+0257s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0257s] # #|     FAIL: w/o keep-id: stat(file in container) == root
[+0257s] # #| expected: '0:0'
[+0257s] # #|   actual: 'read unixpacket @->/run/user/23298/libpod/tmp/socket/1bee53f8e19e2b03ff773e504fead7f5994b8be5545b315a73a3d2cef290f567/attach: read: connection reset by peer'
[+0257s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The socket above is a conmon one. I don't know if this is a conmon problem.

The common factor seems to be --userns=keep-id.

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Aug 3, 2020
@edsantiago
Copy link
Member Author

Here's another one, completely different test, but I have a feeling it's related:

# $ podman-remote --url unix:/tmp/podman.n9RjqF start --attach --interactive c5cb2c8d5b2f3e9e04e3bbdab278bd2826fb60e9906d86c18b9e2ef829c7c848
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: output from podman-start on created ctr
# #| expected: '0OeSILw3HySH9MVHmIn5N0aeZpDkC6w1lF5LvQqC'
# #|   actual: ''
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This time there's no --userns=keep-id but the symptom is the same: test expected output, no output was received.

@lsm5 lsm5 self-assigned this Aug 6, 2020
@edsantiago
Copy link
Member Author

I have a hunch, with zero supporting evidence, that this is another instance:

# #|     FAIL: SELinux role should always be system_r
# #| expected: '.*_u:system_r:.*'
# #|   actual: ''

If my hunch is correct, the common factor might be multiple podman run commands in quick succession. After a while the server gets tired and cranky, refusing to return any more stdout until it gets a nap.

@edsantiago
Copy link
Member Author

Here's a sort of reproducer:

# while :;do echo -n .;x=$(podman-remote run alpine echo hi);if [ "$x" != "hi" ]; then echo FAILED: "'$x'";fi;done
...can only attach to created or running containers: container state improper
FAILED: ''
...............can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
......................................can only attach to created or running containers: container state improper
FAILED: ''
............................can only attach to created or running containers: container state improper
FAILED: ''
....can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.^C

I say "sort of" because I haven't before seen the "container state improper" error message. But the symptom sure is the same: I expected output, got empty string.

This is podman @ master with #7312 applied, running server on the systemd socket. I can test on master itself if necessary.

@edsantiago
Copy link
Member Author

Wow! I was going to add a sleep 1 to the BATS tests, in hopes of minimizing the flake, because it's hitting almost every CI run... but that made it worse, and gave me an even better reproducer:

# while :;do x=$(podman-remote run alpine echo hi);if [ "$x" = "hi" ]; then echo -n .;else echo FAILED: "'$x'";fi;sleep 1;done
can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/c5f9b16912deabef1bd471289a4afadeb4b0afe839b8c2670f3103cedc1f6419/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
can only attach to created or running containers: container state improper
FAILED: ''
.......can only attach to created or running containers: container state improper
FAILED: ''
.read unixpacket @->/var/run/libpod/socket/564f09a28aa8fe025e162c96d9f53b04fbb155c5133b163eb610a3f0390f7d23/attach: read: connection reset by peer
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
.can only attach to created or running containers: container state improper
FAILED: ''
read unixpacket @->/var/run/libpod/socket/cbc2d70b3e45cf221f5707ccc38e788c617cbcdf07e71fe6e8341bd5a9745a9d/attach: read: connection reset by peer
FAILED: ''
..can only attach to created or running containers: container state improper
FAILED: ''
......can only attach to created or running containers: container state improper
FAILED: ''

Note that this one easily reproduces the read unixpacket error. HTH.

@edsantiago
Copy link
Member Author

I think this (from #7314) might be another instance of the bug: basically the common factor is "podman-remote, was expecting output, got nothing".

@edsantiago
Copy link
Member Author

cirrus-flake-summarize just reported this:

# # podman-remote --url unix:/tmp/podman.k2CiCr run --rm --pod mypod quay.io/libpod/alpine_labels:latest hostname
# can only attach to created or running containers: container state improper
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: --hostname set the hostname
# #| expected: '6sjaufqrbk.sxjb7jypwe.net'
# #|   actual: 'can only attach to created or running containers: container state improper'
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I can't reproduce on my laptop. I have a hunch that this is the same problem. What I find most interesting is that the podman command did not fail. run_podman would've barfed before the string check. So we're seeing "container state improper" output (no way to know if it's stdout or stderr), none of the expected output, and exit status zero.

@edsantiago
Copy link
Member Author

Here's a ginkgo flake that appears to be the same underlying problem: all three failures are in the same test (run user capabilities), but each one fails in a different grep:

  1. third "podman run" is empty
  2. first ...
  3. fourth ...

I'm posting this because I've been feeling a little sensitive about it being only the system tests flaking. I think the e2e tests are also seeing this, but their one-two-three-retry thing maybe masks it.

@edsantiago
Copy link
Member Author

There are now three instances of podman run --pod mypod ... flaking with "container state improper", as reported above. The instances are:

80 podman pod create - hashtag AllTheOptions

In all cases, exit status is zero.

@edsantiago
Copy link
Member Author

There's something bothering me about this. In reviewing my flake logs I see various instances of flakes in the ginkgo tests that could be this problem -- but I'm not seeing any evidence of the scope we've seen the last two weeks. It has gotten so it's impossible to pass CI, even with me spending my day pressing the Re-run button. What has changed?

@rhatdan
Copy link
Member

rhatdan commented Aug 27, 2020

I am thinking if @mheon does not fix this soon, we need to disable this remote tests, in order to get the back log of PRs merged.

@edsantiago
Copy link
Member Author

Agreed. I nearly suggested that in Planning yesterday; the commitment to fixing flakes dissuaded me.

@mheon
Copy link
Member

mheon commented Aug 27, 2020 via email

@mheon
Copy link
Member

mheon commented Aug 27, 2020

Re-pushed. The exec exit codes issue should theoretically be fixed.

@mheon
Copy link
Member

mheon commented Aug 27, 2020

If it is, this can finally be closed...

mheon added a commit to mheon/libpod that referenced this issue Aug 27, 2020
Our previous flow was to perform a hijack before passing a
connection into Libpod, and then Libpod would attach to the
container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the
attach complete. As soon as we write the header, the client
assumes that all is ready, and sends a Start request. This Start
may be processed *before* we successfully finish attaching,
causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately,
this requires a downright extensive refactor of the Attach and
HTTP Exec StartAndAttach code. I think the result is an
improvement in some places (a lot more errors will be handled
with a proper HTTP error code, before the hijack occurs) but
other parts, like the relocation of printing container logs, are
just *bad*. Still, we need this fixed now to get CI back into
good shape...

Fixes containers#7195

Signed-off-by: Matthew Heon <[email protected]>
edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 28, 2020
 - pause test: enable when rootless + cgroups v2
   (was previously disabled for all rootless)

 - run --pull: now works with podman-remote
   (in containers#7647, thank you @jwhonce)

 - various other run/volumes tests: try reenabling
   It looks like containers#7195 was fixed (by containers#7451? I'm not
   sure if I'm reading the conversation correctly).
   Anyway, remove all the skip()s on 7195. Only time
   will tell if it's really fixed)

Also:

 - new test for podman image tree --whatrequires
   (because TIL). Doesn't work with podman-remote.

Signed-off-by: Ed Santiago <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants