Podman-remote run should wait for exit code #3934

rhatdan · 2019-09-04T14:01:48Z

This change matches what is happening on the podman local side
and should eliminate a race condition.

Signed-off-by: Daniel J Walsh [email protected]

rhatdan · 2019-09-04T14:02:03Z

Hopefully fixes: #3870

rhatdan · 2019-09-04T14:02:20Z

@mheon @baude @edsantiago PTAL

vrothberg

LGTM, any chance we can test that?

mheon · 2019-09-04T14:07:07Z

Code LGTM

openshift-ci-robot · 2019-09-04T14:07:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2019-09-04T14:11:07Z

@edsantiago has a tests that was revealing this error.

edsantiago · 2019-09-04T14:15:18Z

@cevich can you try running this in your setup? (I've still never actually seen the problem)

@cevich can you also look at CI failures?

ERROR: (gcloud) The project property must be set to a valid project ID, not the project name

cevich · 2019-09-04T14:49:55Z

LGTM, any chance we can test that?

Testing negatives/error-cases of a race is very difficult to do reliably in automation. I can manually run the change through the reproducer I used before, however the results cannot be conclusive - since code-change nearly always causes a timing change 😕

cevich · 2019-09-04T14:54:35Z

can you also look at CI failures?

@edsantiago that problem is not happening on master. Cirrus-CI (internally) us suppose to decrypt those [ENCRYPTED[... strings, and pass the result as an env. var. value. Since the value is making it through like that, and there are no changes to the value in this PR, this must be a problem on the Cirrus-CI end of things.

If re-running doesn't clear up the problem, then I'll inform their support.

edsantiago · 2019-09-04T15:01:21Z

Cirrus-CI (internally) us suppose to decrypt those [ENCRYPTED[... strings,

I think that's a red herring. My suspicion is that the problem is elsewhere, but Cirrus is showing ENCRYPTED because, um, it probably shouldn't show the real value in a public log

cevich · 2019-09-04T17:32:28Z

ENCRYPTED because, um, it probably shouldn't show the real value in a public log

It mangles the output with SECRET for encrypted values.

cevich · 2019-09-04T17:36:03Z

@rhatdan There's some problem here, when I run Ed's reproducer using code from this PR on Fedora 30:

...cut...
STEP 4: COMMIT build_test
549471035c7a6927e8dc2c52e2a87456327bc0d1b561869c90ea5d0c1bd79289
+ pr run --rm build_test cat /myfile
+ podman-remote run --rm build_test cat /myfile
hi
Error: Virtual Read failed, 0: EOF
[root@cevich-fedora-29-libpod-5420341990522880 libpod]#

On Fedora 30, it simply hangs after printing 'hi' and never returns (I waited 5-minutes). I'll look at the Cirrus-CI logs next...

cevich · 2019-09-04T17:42:15Z

...ya, similar story in the integration-tests: Timed out after 90.000s. across the board, for all remote-client tests except cgroupv2.

cevich · 2019-09-04T17:47:57Z

Ref: VM Images I'm using are:

fedora-30-libpod-5420341990522880
fedora-29-libpod-5420341990522880
(the same that were failing in #3870) using reproducer that Ed wrote:

#!/bin/bash

set -xe

tmpdir=$(mktemp -d --tmpdir podman-remote-test.XXXXXX)
cat >$tmpdir/Dockerfile <<EOF
FROM quay.io/libpod/alpine_labels:latest
RUN apk add nginx
RUN echo hi >/myfile
EOF

pr() { podman-remote "$@"; }

while :;do
    pr build -t build_test --format=docker $tmpdir
    pr run --rm build_test cat /myfile
    pr rmi -f build_test
done

like this from the repository root:

make podman podman-remote
make install PREFIX=/usr
systemctl enable io.podman.socket
systemctl enable io.podman.service
systemctl restart io.podman.socket
systemctl restart io.podman.service
chmod +x repro.sh
./repro.sh
...cut...

pkg/adapter/containers_remote.go

rhatdan · 2019-09-05T16:12:20Z

@cevich @edsantiago Total rewrite of original patch, turns out the error, I believe, was on the server side. We were not waiting for a full exit, so I think the client side could exit before the container was cleaned up.

While in this code, I figured out why container exit codes were not being propagated.

rhatdan · 2019-09-05T16:17:54Z

@baude @mheon @giuseppe @vrothberg @TomSweeneyRedHat PTAL

TomSweeneyRedHat · 2019-09-05T16:25:34Z

LGTM assuming happy tests.

mheon · 2019-09-05T16:28:58Z

LGTM but I'd like a nod from @baude pre-merge

mheon · 2019-09-05T17:47:32Z

System tests are failing for most remote client jobs. CGroups v2 remote succeeded, though?

cevich · 2019-09-10T18:55:14Z

The major difference between reproducing above, and previously is the dnf update -y && reboot I ran. I'll get a fresh VM and try again, once Matt's done looking at this one.

cevich · 2019-09-11T14:29:47Z

Rebased onto #3985 just to be safe, spun up a VM w/o updating it's packages. Running the reproducer above I'm still getting Error: Virtual Read failed, 0: EOF when the script does podman-remote run --rm build_test cat /myfile.

Perhaps interestingly, this is also some kind of race, because when I change to podman-remote run -it --rm build_test cat /myfile it sometimes works fine for 3-5 iterations before bombing. I also tried rebasing against master (093013b) but the behavior remains 😖

...cut...
STEP 3: RUN echo hi >/myfile
STEP 4: COMMIT build_test
6e2776951ef699a20d1fa766f2de116b88bada4d1dd44eafd6e682c4db83c73f
+ pr run -it --rm build_test cat /myfile
+ podman-remote run -it --rm build_test cat /myfile
hi
Error: Virtual Read failed, 0: EOF

cevich · 2019-09-11T14:39:06Z

@baude Matt took a look at a VM behaving this way, and concluded:

i have no clue what's going on there. early varlink hangup probably

would you mind giving it a go (rebase this PR against master to be safe) when you have a chance?

(note: current CI results here are not helpful until rhatdan rebases the PR)

cevich · 2019-09-11T20:54:32Z

Okay, the EOF issue is partially addressed with #3998 and I was successfully able to confirm the fix in this PR does appear to fix the #3870 problem. So let's get this in.

LGTM

We have leaked the exit number codess all over the code, this patch removes the numbers to constants. Signed-off-by: Daniel J Walsh <[email protected]>

This change matches what is happening on the podman local side and should eliminate a race condition. Also exit commands on the server side should start to return to client. Signed-off-by: Daniel J Walsh <[email protected]>

cevich · 2019-09-13T13:33:47Z

This is a new error in a start test I haven't seen before though it's Ubuntu, so maybe a flake?

cevich · 2019-09-13T14:29:39Z

...digging a bit, the unable to find varlink pid message seems harmless, as it's part of the test cleaning up (test/e2e/libpod_suite_remoteclient_test.go:96). The ID of the test container is in fact 8f7a7fb5aa4f, so the podman ps -aq is def. seeing the container after the expected non-zero podman-remote exit from test/e2e/start_test.go:110. I'm betting this is a simple race, where the remote has not yet gotten to do it's container removal yet.

@mheon If the above is true, this race is in many places in our integration tests. There are even some which assert the presence of remaining containers, which could race with an erroneous removal to produce false-positive results.

Is there a way to detect whether or not the removal process has completed or not for a given container ID (like some file that should not exist or something like that)? This would be the ideal solution in both (positive and negative) test scenarios, to prevent racing on an existence-check. For example, test/utils/utils.go could delay until completion before calling ps := p.PodmanBase([]string{"ps", "-aq"}, true, false).

cevich · 2019-09-13T14:31:31Z

uggg, sorry, I shouldn't tie up Dan's PR with more problems, let me open an issue for that race...

cevich · 2019-09-13T14:55:07Z

...opened #4021

cevich · 2019-09-13T14:55:37Z

(re-ran flaked test)

mheon · 2019-09-13T15:07:57Z

After this PR, we should have a guarantee that the container is gone when the original Podman process exits.

mheon · 2019-09-13T15:10:45Z

Nevermind. We don't have an explicit remove in https://github.com/containers/libpod/blob/82ac0d8925dbb5aa738f1494ecb002eb6daca992/pkg/adapter/containers_remote.go#L462

We probably need to add one

rhatdan · 2019-09-13T15:43:58Z

@mheon the remove happens on the server side in the adapter code.
https://github.com/containers/libpod/pull/3934/files#diff-d6a6befd6e8af99104d67db5ea6786faR106

rhatdan · 2019-09-13T15:44:34Z

I think this is ready to merge.

@mheon @giuseppe @vrothberg @TomSweeneyRedHat @cevich @edsantiago PTAL

mheon · 2019-09-13T15:53:29Z

Are we guaranteed that the client doesn't exit until the server is done removing, though?
LGTM regardless

cevich · 2019-09-13T16:01:15Z

/lgtm

cevich · 2019-09-13T16:02:10Z

Are we guaranteed that the client doesn't exit until the server is done removing, though?

Possibly not, that's why I opened #4021

rhatdan · 2019-09-13T16:17:22Z

I believe that we are guaranteed for the removal to happen before the front end gets signaled.

openshift-ci-robot added the size/S label Sep 4, 2019

vrothberg reviewed Sep 4, 2019

View reviewed changes

openshift-ci-robot requested review from mrunalp and vrothberg September 4, 2019 14:05

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2019

cevich mentioned this pull request Sep 4, 2019

Add support for testing with the latest Ubuntu release #3754

Merged

cevich reviewed Sep 5, 2019

View reviewed changes

pkg/adapter/containers_remote.go Outdated Show resolved Hide resolved

cevich reviewed Sep 5, 2019

View reviewed changes

pkg/adapter/containers_remote.go Outdated Show resolved Hide resolved

rhatdan force-pushed the wait branch from 5816230 to 159f4fb Compare September 5, 2019 16:03

openshift-ci-robot added size/M and removed size/S labels Sep 5, 2019

rhatdan force-pushed the wait branch from 159f4fb to 17ba7e7 Compare September 5, 2019 16:15

cevich mentioned this pull request Sep 11, 2019

Prevent podman varlink socket fight #3998

Merged

rhatdan added 2 commits September 12, 2019 16:20

Use exit code constants

535111b

We have leaked the exit number codess all over the code, this patch removes the numbers to constants. Signed-off-by: Daniel J Walsh <[email protected]>

Podman-remote run should wait for exit code

82ac0d8

This change matches what is happening on the podman local side and should eliminate a race condition. Also exit commands on the server side should start to return to client. Signed-off-by: Daniel J Walsh <[email protected]>

rhatdan force-pushed the wait branch from 26761f7 to 82ac0d8 Compare September 12, 2019 20:20

cevich mentioned this pull request Sep 13, 2019

Suspected podman-remote container-removal completion race in integration tests. #4021

Closed

openshift-ci-robot assigned cevich Sep 13, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2019

openshift-merge-robot merged commit 7875e00 into containers:master Sep 13, 2019

mheon mentioned this pull request Sep 13, 2019

podman-remote rmi: device or resource busy #3870

Closed

edsantiago mentioned this pull request Sep 16, 2019

podman-remote run alpine true -> exit status 126 on f29 #4044

Closed

edsantiago mentioned this pull request Nov 4, 2019

podman-remote: incorrect exit status of 'run --rm' #3808

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023

Podman-remote run should wait for exit code #3934

Podman-remote run should wait for exit code #3934

Conversation

rhatdan commented Sep 4, 2019

rhatdan commented Sep 4, 2019

rhatdan commented Sep 4, 2019

vrothberg left a comment

Choose a reason for hiding this comment

mheon commented Sep 4, 2019

openshift-ci-robot commented Sep 4, 2019

rhatdan commented Sep 4, 2019

edsantiago commented Sep 4, 2019

cevich commented Sep 4, 2019

cevich commented Sep 4, 2019

edsantiago commented Sep 4, 2019

cevich commented Sep 4, 2019

cevich commented Sep 4, 2019

cevich commented Sep 4, 2019

cevich commented Sep 4, 2019 • edited Loading

rhatdan commented Sep 5, 2019

rhatdan commented Sep 5, 2019

TomSweeneyRedHat commented Sep 5, 2019

mheon commented Sep 5, 2019

mheon commented Sep 5, 2019

cevich commented Sep 10, 2019

cevich commented Sep 11, 2019 • edited Loading

cevich commented Sep 11, 2019

cevich commented Sep 11, 2019

cevich commented Sep 13, 2019

cevich commented Sep 13, 2019

cevich commented Sep 13, 2019

cevich commented Sep 13, 2019

cevich commented Sep 13, 2019

mheon commented Sep 13, 2019

mheon commented Sep 13, 2019

rhatdan commented Sep 13, 2019

rhatdan commented Sep 13, 2019

mheon commented Sep 13, 2019

cevich commented Sep 13, 2019

cevich commented Sep 13, 2019

rhatdan commented Sep 13, 2019

cevich commented Sep 4, 2019 •

edited

Loading

cevich commented Sep 11, 2019 •

edited

Loading