Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e authenticated push test: multiple failures #18355

Closed
edsantiago opened this issue Apr 26, 2023 · 14 comments
Closed

e2e authenticated push test: multiple failures #18355

edsantiago opened this issue Apr 26, 2023 · 14 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Just one instance, but right now I'm filing issues for anything unusual on rawhide:

  authenticated push
...
# podman-remote [options] manifest push --tls-verify=false --creds=userLgX5:nrgYTOskoMZzvyh foo localhost:5524/credstest
Error: adding to manifest list foo:
      initializing source containers-storage:[vfs@/tmp/podman_test3191751000/root+/tmp/podman_test3191751000/runroot:vfs.imagestore=/tmp/imagecachedir]@d4f17859f7e20871a2a5fb4598033eb95614573e4b34e24e663fe93ffed9a369:
        opening "docker://localhost:5524/alpine:latest" as image source:
          Get "https://localhost:5524/v2/alpine/manifests/latest": net/http: TLS handshake timeout

Unfortunately, https://github.com/containers/podman/blob/main/hack/podman-registry-go/registry.go does not log anything. (Maybe a good idea to instrument it?) Here is the journal log, test begins at 19:27:43. I don't see anything odd, but I only skimmed and I don't really know what to look for.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Apr 26, 2023
@edsantiago
Copy link
Member Author

Oh, I wonder if this is #18286 ?

@edsantiago
Copy link
Member Author

This ([f38 root remote]](https://api.cirrus-ci.com/v1/artifact/task/5271682227634176/html/int-remote-fedora-38-root-host-boltdb.log.html#t--authenticated-push--1)) is in the same authenticated push test, different symptom, but it's too similar to be unrelated:

[It] authenticated push
...
# podman-remote [options] manifest add --tls-verify=false --creds=userXBmv:MYgefSJpjVqXVFf foo localhost:5378/alpine:latest
Error: adding to manifest list foo: 

That's it. No more output or error, just that. Test fails with exit status 125.

@edsantiago edsantiago changed the title rawhide: e2e push test: TLS handshake timeout e2e push test: TLS handshake timeout May 10, 2023
@edsantiago
Copy link
Member Author

Here's the list so far:

  • fedora-38 : int podman fedora-38 root container sqlite
    • 05-01 13:14 in Podman login and logout [It] podman login and logout with repository with fallback
  • fedora-38 : int podman fedora-38 root host sqlite
  • fedora-38 : int remote fedora-38 root host boltdb [remote]
  • fedora-38 : sys podman fedora-38 rootless host sqlite
  • rawhide : int podman rawhide root host sqlite
  • rawhide : int remote rawhide root host sqlite [remote]
    • 04-25 20:49 in Podman manifest [It] authenticated push

Edited issue title, because this is not rawhide-only (also seeing in f38). podman and podman-remote. root and rootless.

@edsantiago
Copy link
Member Author

Keeps happening, and here's a weird variation (f38 remote root):

  [It] authenticated push
...
# podman-remote [options] manifest push --rm --tls-verify=false --creds=userP3VI:1WvPCVgbsEquhAb foo localhost:5130/rmtest
Getting image list signatures
Copying 1 of 1 images in list
Copying image sha256:aeaa685034b386fe90ac983ff14af965366c8bd3e79cb047fc1a01efb087c054 (1/1)
Getting image source signatures
Copying blob sha256:bb6b46d16e58dbf20dacaeae92aa0d7418ef71a3f80de2b812b3ebf5318b7071
Error: pushing manifest list foo:
       copying image 1/1 from manifest list: trying to reuse blob sha256:bb6b46d16e58dbf20dacaeae92aa0d7418ef71a3f80de2b812b3ebf5318b7071 at destination:
       pinging container registry localhost:5130: StatusCode: 400, Client sent an HTTP request to an HTTPS server.

@edsantiago
Copy link
Member Author

Another weird variation, I'm not even sure if this is the same bug (rawhide rootless):

  [It] push --rm to local directory
...
$ podman [options] manifest add bar quay.io/libpod/testdigest_v2s2
149c9bfb260f9ec1b3e87d4d5225ec920333c505d40694b0e363cb1405673c11
$ podman [options] manifest push --rm bar dir:/tmp/podman_test3533422088/pushed
Getting image list signatures
Copying 1 of 1 images in list
Copying image sha256:755f4d90b3716e2bf57060d249e2cd61c9ac089b1233465c5c2cb2d7ee550fdb (1/1)
Getting image source signatures
Copying blob sha256:51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c
Copying blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
Error: copying image 1/1 from manifest list: reading blob sha256:51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c:
        Get "https://cdn03.quay.io/sha256/51/51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230510%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230510T222316Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=1c80e416db530a9fe483d55b2ee30ec29c78cff4b19a33668a174da842b24376&cf_sign=kidZ9H3WDIPZZBgzaLjgkRsswCGR6HTH80CQHbo%2BsSJIo9Xkkqlg8%2Br08b35XH%2BmY7xRFuswrR5Qhl0G8I1%2B8zw75bcncI%2BsXEJzMEyDXMTke%2FY2lY9MSNHb3Il1mLgsn58vaBjvhFO4NXDUcU1SCV42aHcAfu3M0Av9X%2FdwlOzLXBMCkpIFLNzUZnutdLzB83t5xqK3j7h9tj18W7%2BEce4fID6jd56fQ6eUx9DPXuEWMjJ69lI7kQLv5LyJg%2BqEh2SZmIuJYinin1LgE1obAsJO%2FxvcCiEHRaAXjCJe%2F79hHSx0a7zQkTEMzS4HGQX57okF3DbUYjoKfYS8ui8%2FIA%3D%3D&cf_expiry=1683757996&region=us-east-1":
        remote error: tls: handshake failure

Here the tls handshake is against quay.io (and, sigh, our nemesis cdn03), not localhost. I'm inclined to flag this one as a quay flake, but will leave it uncategorized for now.

@edsantiago
Copy link
Member Author

Another one of the http/https flakes, f38 root:

# podman-remote [options] manifest push --tls-verify=false --creds=userupXh:Y9RaC0hKWXUbwnC foo localhost:5175/credstest
Error: pushing manifest list foo:
      initializing source containers-storage:[vfs@/tmp/podman_test563436422/root+/tmp/podman_test563436422/runroot:vfs.imagestore=/tmp/imagecachedir]@0fe2ee05c30eea9a5a0b93fb61c8d26b541614aaf0974ddf6a262c0cadb61807:
      opening "docker://localhost:5175/alpine:latest" as image source:
      pinging container registry localhost:5175:
      StatusCode: 400, Client sent an HTTP request to an HTTPS server.

Here's the journal log from the registry container:

May 17 21:36:04 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:04.99156017Z" level=info msg="listening on [::]:5000, tls" go.version=go1.16.15 instance.id=2877123b-e3e0-48b4-ae31-807ce949f455 service=registry version="v2.8.1+unknown" 
...
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.216505665Z" level=warning msg="error authorizing context: basic authentication challenge for realm "Registry Realm": invalid authorization credential" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=8e4d184e-f8c2-40f4-a7e6-b6a7887ceec8 http.request.method=GET http.request.remoteaddr="10.88.0.1:47988" http.request.uri="/v2/" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" 
May 17 21:36:07 cirrus-task-6710381951844352 romantic_ritchie[137295]:     1     0 root     R     1560   0%   0   0% top
May 17 21:36:07 cirrus-task-6710381951844352 systemd[1]: libpod-02d7e5bfb79c85d4974ca3bb61244e0b8307af296d055f25228c5c34c9a3c2d6.scope: Deactivated successfully.
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.256297762Z" level=info msg="authorized request" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=4677232b-8191-4faa-997b-681ea2dc6f83 http.request.method=HEAD http.request.remoteaddr="10.88.0.1:47990" http.request.uri="/v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" vars.digest="sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" vars.name=alpine 
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.256997024Z" level=error msg="response completed with error" auth.user.name=userupXh err.code="blob unknown" err.detail=sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0 err.message="blob unknown to registry" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=4677232b-8191-4faa-997b-681ea2dc6f83 http.request.method=HEAD http.request.remoteaddr="10.88.0.1:47990" http.request.uri="/v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=6.164763ms http.response.status=404 http.response.written=157 vars.digest="sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" vars.name=alpine 
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: 10.88.0.1 - - [18/May/2023:02:36:07 +0000] "HEAD /v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0 HTTP/1.1" 404 157 "" "containers/5.25.1-dev (github.com/containers/image)"

Last few days' worth:

  • fedora-38 : int remote fedora-38 root host boltdb [remote]
  • fedora-38 : int remote fedora-38 root host sqlite [remote]

@edsantiago edsantiago changed the title e2e push test: TLS handshake timeout e2e authenticated push test: multiple failures Jun 5, 2023
@edsantiago
Copy link
Member Author

Still seeing "TLS handshake timeout", but the most common failure I'm seeing these days is:


Error: pushing manifest list foo: copying image 1/1 from manifest list: \
    trying to reuse blob sha256:xxx at destination: \
    pinging container registry localhost:5607: \
    StatusCode: 400, Client sent an HTTP request to an HTTPS server.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 7, 2023
First: fix podman-registry script so it preserves the initial $PODMAN,
so all subsequent invocations of ps, logs, and stop will use the
same binary and arguments. Until now we've handled this by requiring
that our caller manage $PODMAN (and keep it the same), but that's
just wrong.

Next, simplify the golang interface: move the $PODMAN setting into
registry.go, instead of requiring e2e callers to set it. (This
could use some work: the local/remote conditional is icky).

IMPORTANT: To prevent registry.go from using the wrong podman binary,
the Start() call is gone. Only StartWithOptions() is valid now.

And, minor cleanup: comments, and add an actual error-message check

Reason for this PR is a recurring flake, containers#18355, whose multiple
failure modes I truly can't understand. I don't think this PR
is going to fix it, but this is still necessary work.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

This is a nasty flake, triggering often in my no-retries PR. There are different failure modes, but my conclusion is that this is a test conflict. Some other test is clobbering the network while this test is running, and that makes the test fail. The common sequence is:

  1. registry gets started
  2. one or more tests pass
  3. journal shows hundreds and hundreds of "kernel podman0 veth NETFILTER port blah blah" lines
  4. the next test fails. Journal shows "http: TLS handshake error from 10.88.0.1:SRCPORT: EOF"

For example, this test log. Look at the associated journal. Search in-page for 176867. You'll see lots of matches, then a bunch of network-blah-blah lines, keep scrolling down, then the TLS handshake error. Nothing more.

I haven't looked at every single test failure... just, enough of them to convince me that this is the cause of the flake.

Obvious fix is to make this test Serial, but I don't like that: I would much rather understand what is going on with the network, why it's getting destroyed, who is destroying it, and make that test Serial. Who can identify the culprit?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still happening; reduced rate is because I'm not running #17831 as frequently. This is almost certainly happening in real CI, but passing on the ginkgo retry.

I spent a good part of a day a few weeks ago trying to get a reproducer. No luck. This one is tricky.

  • fedora-38 : int remote fedora-38 root host boltdb [remote]
  • fedora-38 : int remote fedora-38 root host sqlite [remote]
  • rawhide : int remote rawhide root host sqlite [remote]

@vrothberg
Copy link
Member

vrothberg commented Aug 14, 2023

@edsantiago shall we try updating quay.io/libpod/registry:2.8 to 2.8.2? This could be a server-side fart.

@edsantiago
Copy link
Member Author

Why not? I'll try that in #17831 for a few weeks, then report back. Thanks for the suggestion.

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 24, 2023
...in hopes of addressing flake containers#18355

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

Have not seen this since bumping registry from 2.8 to 2.8.2. I say this in the expectation that the gods will punish me by making it trigger again. But the gods, realizing this is my expectation, will leave me hanging instead, so I can clearly not choose the wine in front of me.

@vrothberg
Copy link
Member

Oh nice!

I happily take the blame and declare this flake fixed 😈

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 19, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

2 participants