e2e authenticated push test: multiple failures #18355

edsantiago · 2023-04-26T13:24:04Z

Just one instance, but right now I'm filing issues for anything unusual on rawhide:

  authenticated push
...
# podman-remote [options] manifest push --tls-verify=false --creds=userLgX5:nrgYTOskoMZzvyh foo localhost:5524/credstest
Error: adding to manifest list foo:
      initializing source containers-storage:[vfs@/tmp/podman_test3191751000/root+/tmp/podman_test3191751000/runroot:vfs.imagestore=/tmp/imagecachedir]@d4f17859f7e20871a2a5fb4598033eb95614573e4b34e24e663fe93ffed9a369:
        opening "docker://localhost:5524/alpine:latest" as image source:
          Get "https://localhost:5524/v2/alpine/manifests/latest": net/http: TLS handshake timeout

Unfortunately, https://github.com/containers/podman/blob/main/hack/podman-registry-go/registry.go does not log anything. (Maybe a good idea to instrument it?) Here is the journal log, test begins at 19:27:43. I don't see anything odd, but I only skimmed and I don't really know what to look for.

The text was updated successfully, but these errors were encountered:

edsantiago · 2023-04-26T14:54:25Z

Oh, I wonder if this is #18286 ?

edsantiago · 2023-05-10T11:57:14Z

This ([f38 root remote]](https://api.cirrus-ci.com/v1/artifact/task/5271682227634176/html/int-remote-fedora-38-root-host-boltdb.log.html#t--authenticated-push--1)) is in the same authenticated push test, different symptom, but it's too similar to be unrelated:

[It] authenticated push
...
# podman-remote [options] manifest add --tls-verify=false --creds=userXBmv:MYgefSJpjVqXVFf foo localhost:5378/alpine:latest
Error: adding to manifest list foo:

That's it. No more output or error, just that. Test fails with exit status 125.

edsantiago · 2023-05-10T12:01:02Z

Here's the list so far:

fedora-38 : int podman fedora-38 root container sqlite
- 05-01 13:14 in Podman login and logout [It] podman login and logout with repository with fallback
fedora-38 : int podman fedora-38 root host sqlite
- 05-05 08:47 in authenticated push
fedora-38 : int remote fedora-38 root host boltdb [remote]
- 05-07 17:10 in authenticated push
fedora-38 : sys podman fedora-38 rootless host sqlite
- PR fix(deps): update github.com/digitalocean/go-qemu digest to f035778 #18393
  - 05-03 06:12 in [sys] podman auto-update with multiple services
rawhide : int podman rawhide root host sqlite
- 05-10 07:17 in authenticated push
rawhide : int remote rawhide root host sqlite [remote]
- 04-25 20:49 in Podman manifest [It] authenticated push

Edited issue title, because this is not rawhide-only (also seeing in f38). podman and podman-remote. root and rootless.

edsantiago · 2023-05-11T12:05:54Z

Keeps happening, and here's a weird variation (f38 remote root):

  [It] authenticated push
...
# podman-remote [options] manifest push --rm --tls-verify=false --creds=userP3VI:1WvPCVgbsEquhAb foo localhost:5130/rmtest
Getting image list signatures
Copying 1 of 1 images in list
Copying image sha256:aeaa685034b386fe90ac983ff14af965366c8bd3e79cb047fc1a01efb087c054 (1/1)
Getting image source signatures
Copying blob sha256:bb6b46d16e58dbf20dacaeae92aa0d7418ef71a3f80de2b812b3ebf5318b7071
Error: pushing manifest list foo:
       copying image 1/1 from manifest list: trying to reuse blob sha256:bb6b46d16e58dbf20dacaeae92aa0d7418ef71a3f80de2b812b3ebf5318b7071 at destination:
       pinging container registry localhost:5130: StatusCode: 400, Client sent an HTTP request to an HTTPS server.

edsantiago · 2023-05-11T12:38:39Z

Another weird variation, I'm not even sure if this is the same bug (rawhide rootless):

  [It] push --rm to local directory
...
$ podman [options] manifest add bar quay.io/libpod/testdigest_v2s2
149c9bfb260f9ec1b3e87d4d5225ec920333c505d40694b0e363cb1405673c11
$ podman [options] manifest push --rm bar dir:/tmp/podman_test3533422088/pushed
Getting image list signatures
Copying 1 of 1 images in list
Copying image sha256:755f4d90b3716e2bf57060d249e2cd61c9ac089b1233465c5c2cb2d7ee550fdb (1/1)
Getting image source signatures
Copying blob sha256:51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c
Copying blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
Error: copying image 1/1 from manifest list: reading blob sha256:51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c:
        Get "https://cdn03.quay.io/sha256/51/51125d395dda71b3b65ef7be50fceb192f9a9d649ccca8ee85e5162e43c9ca2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230510%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230510T222316Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=1c80e416db530a9fe483d55b2ee30ec29c78cff4b19a33668a174da842b24376&cf_sign=kidZ9H3WDIPZZBgzaLjgkRsswCGR6HTH80CQHbo%2BsSJIo9Xkkqlg8%2Br08b35XH%2BmY7xRFuswrR5Qhl0G8I1%2B8zw75bcncI%2BsXEJzMEyDXMTke%2FY2lY9MSNHb3Il1mLgsn58vaBjvhFO4NXDUcU1SCV42aHcAfu3M0Av9X%2FdwlOzLXBMCkpIFLNzUZnutdLzB83t5xqK3j7h9tj18W7%2BEce4fID6jd56fQ6eUx9DPXuEWMjJ69lI7kQLv5LyJg%2BqEh2SZmIuJYinin1LgE1obAsJO%2FxvcCiEHRaAXjCJe%2F79hHSx0a7zQkTEMzS4HGQX57okF3DbUYjoKfYS8ui8%2FIA%3D%3D&cf_expiry=1683757996&region=us-east-1":
        remote error: tls: handshake failure

Here the tls handshake is against quay.io (and, sigh, our nemesis cdn03), not localhost. I'm inclined to flag this one as a quay flake, but will leave it uncategorized for now.

edsantiago · 2023-05-18T20:34:42Z

Another one of the http/https flakes, f38 root:

# podman-remote [options] manifest push --tls-verify=false --creds=userupXh:Y9RaC0hKWXUbwnC foo localhost:5175/credstest
Error: pushing manifest list foo:
      initializing source containers-storage:[vfs@/tmp/podman_test563436422/root+/tmp/podman_test563436422/runroot:vfs.imagestore=/tmp/imagecachedir]@0fe2ee05c30eea9a5a0b93fb61c8d26b541614aaf0974ddf6a262c0cadb61807:
      opening "docker://localhost:5175/alpine:latest" as image source:
      pinging container registry localhost:5175:
      StatusCode: 400, Client sent an HTTP request to an HTTPS server.

Here's the journal log from the registry container:

May 17 21:36:04 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:04.99156017Z" level=info msg="listening on [::]:5000, tls" go.version=go1.16.15 instance.id=2877123b-e3e0-48b4-ae31-807ce949f455 service=registry version="v2.8.1+unknown" 
...
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.216505665Z" level=warning msg="error authorizing context: basic authentication challenge for realm "Registry Realm": invalid authorization credential" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=8e4d184e-f8c2-40f4-a7e6-b6a7887ceec8 http.request.method=GET http.request.remoteaddr="10.88.0.1:47988" http.request.uri="/v2/" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" 
May 17 21:36:07 cirrus-task-6710381951844352 romantic_ritchie[137295]:     1     0 root     R     1560   0%   0   0% top
May 17 21:36:07 cirrus-task-6710381951844352 systemd[1]: libpod-02d7e5bfb79c85d4974ca3bb61244e0b8307af296d055f25228c5c34c9a3c2d6.scope: Deactivated successfully.
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.256297762Z" level=info msg="authorized request" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=4677232b-8191-4faa-997b-681ea2dc6f83 http.request.method=HEAD http.request.remoteaddr="10.88.0.1:47990" http.request.uri="/v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" vars.digest="sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" vars.name=alpine 
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: time="2023-05-18T02:36:07.256997024Z" level=error msg="response completed with error" auth.user.name=userupXh err.code="blob unknown" err.detail=sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0 err.message="blob unknown to registry" go.version=go1.16.15 http.request.host="localhost:5175" http.request.id=4677232b-8191-4faa-997b-681ea2dc6f83 http.request.method=HEAD http.request.remoteaddr="10.88.0.1:47990" http.request.uri="/v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" http.request.useragent="containers/5.25.1-dev (github.com/containers/image)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=6.164763ms http.response.status=404 http.response.written=157 vars.digest="sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0" vars.name=alpine 
May 17 21:36:07 cirrus-task-6710381951844352 registry[137050]: 10.88.0.1 - - [18/May/2023:02:36:07 +0000] "HEAD /v2/alpine/blobs/sha256:03901b4a2ea88eeaad62dbe59b072b28b6efa00491962b8741081c5df50c65e0 HTTP/1.1" 404 157 "" "containers/5.25.1-dev (github.com/containers/image)"

Last few days' worth:

fedora-38 : int remote fedora-38 root host boltdb [remote]
- 05-17 22:50 in authenticated push
fedora-38 : int remote fedora-38 root host sqlite [remote]
- 05-16 14:30 in authenticated push

edsantiago · 2023-06-05T16:23:47Z

Still seeing "TLS handshake timeout", but the most common failure I'm seeing these days is:


Error: pushing manifest list foo: copying image 1/1 from manifest list: \
    trying to reuse blob sha256:xxx at destination: \
    pinging container registry localhost:5607: \
    StatusCode: 400, Client sent an HTTP request to an HTTPS server.

fedora-38 : int podman fedora-38 root host sqlite
- 06-01 12:13 in authenticated push
- 05-22 11:05 in authenticated push
- 05-22 09:48 in authenticated push
- 05-19 21:12 in authenticated push
fedora-38 : int remote fedora-38 root host boltdb [remote]
- 06-03 07:43 in authenticated push
- 05-21 08:42 in authenticated push
- 05-17 22:50 in authenticated push
fedora-38 : int remote fedora-38 root host sqlite [remote]
- 06-02 15:57 in authenticated push
- 06-01 19:10 in authenticated push
- 05-19 16:21 in authenticated push
- 05-16 14:30 in authenticated push
rawhide : int remote rawhide root host sqlite [remote]
- 06-01 22:11 in authenticated push
- 05-25 17:21 in authenticated push
- 05-19 18:55 in authenticated push

First: fix podman-registry script so it preserves the initial $PODMAN, so all subsequent invocations of ps, logs, and stop will use the same binary and arguments. Until now we've handled this by requiring that our caller manage $PODMAN (and keep it the same), but that's just wrong. Next, simplify the golang interface: move the $PODMAN setting into registry.go, instead of requiring e2e callers to set it. (This could use some work: the local/remote conditional is icky). IMPORTANT: To prevent registry.go from using the wrong podman binary, the Start() call is gone. Only StartWithOptions() is valid now. And, minor cleanup: comments, and add an actual error-message check Reason for this PR is a recurring flake, containers#18355, whose multiple failure modes I truly can't understand. I don't think this PR is going to fix it, but this is still necessary work. Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2023-06-26T17:03:35Z

This is a nasty flake, triggering often in my no-retries PR. There are different failure modes, but my conclusion is that this is a test conflict. Some other test is clobbering the network while this test is running, and that makes the test fail. The common sequence is:

registry gets started
one or more tests pass
journal shows hundreds and hundreds of "kernel podman0 veth NETFILTER port blah blah" lines
the next test fails. Journal shows "http: TLS handshake error from 10.88.0.1:SRCPORT: EOF"

For example, this test log. Look at the associated journal. Search in-page for 176867. You'll see lots of matches, then a bunch of network-blah-blah lines, keep scrolling down, then the TLS handshake error. Nothing more.

I haven't looked at every single test failure... just, enough of them to convince me that this is the cause of the flake.

Obvious fix is to make this test Serial, but I don't like that: I would much rather understand what is going on with the network, why it's getting destroyed, who is destroying it, and make that test Serial. Who can identify the culprit?

fedora-38 : int podman fedora-38 root host boltdb
- PR Ed's pet PR with no flake retries #17831
  - 06-08 17:27 in Podman manifest authenticated push
fedora-38 : int podman fedora-38 root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 06-20 13:28 in Podman manifest authenticated push
  - 06-12 07:58 in Podman manifest authenticated push
  - 06-01 12:13 in authenticated push
  - 05-22-2023 11:05 in authenticated push
  - 05-22-2023 09:48 in authenticated push
  - 05-19-2023 21:12 in authenticated push
fedora-38 : int remote fedora-38 root host boltdb [remote]
- PR Ed's pet PR with no flake retries #17831
  - 06-26 08:38 in Podman manifest authenticated push
  - 06-03 07:43 in authenticated push
  - 05-21-2023 08:42 in authenticated push
  - 05-17-2023 22:50 in authenticated push
fedora-38 : int remote fedora-38 root host sqlite [remote]
- PR pasta: use code from c/common #18900
  - 06-15 11:25 in Podman manifest authenticated push
- PR Ed's pet PR with no flake retries #17831
  - 06-20 13:36 in Podman manifest authenticated push
  - 06-10 10:19 in Podman manifest authenticated push
  - 06-07 15:23 in Podman manifest authenticated push
  - 06-02 15:57 in authenticated push
  - 06-01 19:10 in authenticated push
  - 05-19-2023 16:21 in authenticated push
rawhide : int podman rawhide root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 06-21 09:17 in Podman manifest authenticated push
rawhide : int remote rawhide root host sqlite [remote]
- PR Ed's pet PR with no flake retries #17831
  - 06-15 16:52 in Podman manifest authenticated push
  - 06-05 22:31 in authenticated push
  - 06-01 22:11 in authenticated push
  - 05-25-2023 17:21 in authenticated push
  - 05-19-2023 18:55 in authenticated push

github-actions · 2023-08-11T00:05:59Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2023-08-14T11:48:43Z

Still happening; reduced rate is because I'm not running #17831 as frequently. This is almost certainly happening in real CI, but passing on the ginkgo retry.

I spent a good part of a day a few weeks ago trying to get a reproducer. No luck. This one is tricky.

fedora-38 : int remote fedora-38 root host boltdb [remote]
- 07-11-2023 11:10 in Podman manifest authenticated push
- 07-10-2023 13:57 in Podman manifest authenticated push
fedora-38 : int remote fedora-38 root host sqlite [remote]
- 07-11-2023 17:05 in Podman manifest authenticated push
- 07-11-2023 10:06 in Podman manifest authenticated push
rawhide : int remote rawhide root host sqlite [remote]
- 07-31 16:45 in Podman manifest authenticated push
- 07-12-2023 21:09 in Podman manifest authenticated push

vrothberg · 2023-08-14T13:19:06Z

@edsantiago shall we try updating quay.io/libpod/registry:2.8 to 2.8.2? This could be a server-side fart.

edsantiago · 2023-08-14T13:22:48Z

Why not? I'll try that in #17831 for a few weeks, then report back. Thanks for the suggestion.

...in hopes of addressing flake containers#18355 Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2023-09-18T15:51:55Z

Have not seen this since bumping registry from 2.8 to 2.8.2. I say this in the expectation that the gods will punish me by making it trigger again. But the gods, realizing this is my expectation, will leave me hanging instead, so I can clearly not choose the wine in front of me.

vrothberg · 2023-09-19T07:44:52Z

Oh nice!

I happily take the blame and declare this flake fixed 😈

edsantiago added the flakes Flakes from Continuous Integration label Apr 26, 2023

edsantiago changed the title ~~rawhide: e2e push test: TLS handshake timeout~~ e2e push test: TLS handshake timeout May 10, 2023

edsantiago mentioned this issue Jun 5, 2023

podman-registry: simpler, safer invocations #18791

Merged

edsantiago changed the title ~~e2e push test: TLS handshake timeout~~ e2e authenticated push test: multiple failures Jun 5, 2023

edsantiago mentioned this issue Jun 20, 2023

Epic: ginkgo: remove -flakeAttempts 3 #17967

Closed

github-actions bot added the stale-issue label Aug 11, 2023

edsantiago removed the stale-issue label Aug 14, 2023

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 24, 2023

e2e tests: use registry:2.8.2 (was 2.8)

61dd7f8

...in hopes of addressing flake containers#18355 Signed-off-by: Ed Santiago <[email protected]>

edsantiago mentioned this issue Aug 24, 2023

e2e tests: use registry:2.8.2 (was 2.8) #19733

Merged

vrothberg closed this as completed Sep 19, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 19, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e authenticated push test: multiple failures #18355

e2e authenticated push test: multiple failures #18355

edsantiago commented Apr 26, 2023

edsantiago commented Apr 26, 2023

edsantiago commented May 10, 2023

edsantiago commented May 10, 2023

edsantiago commented May 11, 2023

edsantiago commented May 11, 2023

edsantiago commented May 18, 2023

edsantiago commented Jun 5, 2023

edsantiago commented Jun 26, 2023

github-actions bot commented Aug 11, 2023

edsantiago commented Aug 14, 2023

vrothberg commented Aug 14, 2023 •

edited

Loading

edsantiago commented Aug 14, 2023

edsantiago commented Sep 18, 2023

vrothberg commented Sep 19, 2023

e2e authenticated push test: multiple failures #18355

e2e authenticated push test: multiple failures #18355

Comments

edsantiago commented Apr 26, 2023

edsantiago commented Apr 26, 2023

edsantiago commented May 10, 2023

edsantiago commented May 10, 2023

edsantiago commented May 11, 2023

edsantiago commented May 11, 2023

edsantiago commented May 18, 2023

edsantiago commented Jun 5, 2023

edsantiago commented Jun 26, 2023

github-actions bot commented Aug 11, 2023

edsantiago commented Aug 14, 2023

vrothberg commented Aug 14, 2023 • edited Loading

edsantiago commented Aug 14, 2023

edsantiago commented Sep 18, 2023

vrothberg commented Sep 19, 2023

vrothberg commented Aug 14, 2023 •

edited

Loading