e2e create same-IP: try to fix flake #18329

edsantiago · 2023-04-24T19:52:21Z

Our friend #7096 is still not fixed: it continues to flake,
singletons only, and only in the "create" test (not "run").

My guess: maybe there's a race somewhere in IP assignment,
such that container1 can have an IP, but not yet be running,
and a container2 can sneak in and start with that IP, and
container1 is the one that fails?

Solution: tighten the logic so we wait for container1 to
truly be running before we start container2. And, when we
start container2, do so with -a so we get to see stdout.
(Am not expecting it to be helpful, but who knows).

Also very minor cleanup

Signed-off-by: Ed Santiago [email protected]

None

openshift-ci · 2023-04-24T19:52:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Our friend containers#7096 is still not fixed: it continues to flake, singletons only, and only in the "create" test (not "run"). My guess: maybe there's a race somewhere in IP assignment, such that container1 can have an IP, but not yet be running, and a container2 can sneak in and start with that IP, and container1 is the one that fails? Solution: tighten the logic so we wait for container1 to truly be running before we start container2. And, when we start container2, do so with -a so we get to see stdout. (Am not expecting it to be helpful, but who knows). Also very minor cleanup Signed-off-by: Ed Santiago <[email protected]>

rhatdan · 2023-04-24T20:11:21Z

LGTM

edsantiago · 2023-04-24T20:14:38Z

Recent failures:

fedora-36 : int podman fedora-36 root container sqlite
- 03-28 22:06 in Podman create with --ip flag [It] Podman create two containers with the same IP
- 03-28 11:11 in Podman create with --ip flag [It] Podman create two containers with the same IP
fedora-36 : int remote fedora-36 root host sqlite [remote]
- 04-24 12:44 in Podman create with --ip flag [It] Podman create two containers with the same IP

All of them in my no-flake-retries PR, which means this is almost certainly happening in production CI, but my flake logger is not seeing those.

vrothberg

Nice idea!

/lgtm

Luap99 · 2023-04-25T08:56:46Z

I don't think this is possible, start should block until the container is started which means it also has to wait for the complete network setup. The whole loop here wait for the ip address to be assigned makes no sense to me. If this is needed podman start is seriously broken.

edsantiago · 2023-04-25T11:24:54Z

@Luap99 if what you say is correct, the test will continue to flake. Here's a question I've long wondered about: does ginkgo have a mechanism for running code on failure? Like:

    Expect("this").ButIfItFailsThen("podman exec container1 ip a or podman logs or something")

Luap99 · 2023-04-25T11:35:18Z

@Luap99 if what you say is correct, the test will continue to flake. Here's a question I've long wondered about: does ginkgo have a mechanism for running code on failure? Like:
    Expect("this").ButIfItFailsThen("podman exec container1 ip a or podman logs or something")

So I assume you want to execute further commands to debug in this case? I don't think this is directly supported. However what could work is using the extra annotations and pass a function that executes the commands.
https://onsi.github.io/gomega/#annotating-assertions (last example)

edsantiago · 2023-04-27T17:46:53Z

Well, phooey

Luap99 · 2023-04-27T18:09:01Z

Well, phooey

Not sure if I should be happy that I was right or hate the fact that something super strange is going on, maybe both?
In any case the network code is 100% locked I have no doubt there. Also based on the links above failure is happening with both netavark and CNI so the error is not inside the ip allocation logic (unless both CNI are netavark have the same bug despite using completely different implementations).

One thing that could cause the behavior is when the first container dies after inspect but before the start test2 call. Not that I belive that this is the case but you are right we need to instrument the tests to gather more data after the failure.

Please reopen the issue I can take a look tomorrow.

openshift-ci bot added the release-note-none label Apr 24, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023

edsantiago force-pushed the create_2_sameip branch from ba2498d to ae5ed6d Compare April 24, 2023 20:09

vrothberg reviewed Apr 25, 2023

View reviewed changes

openshift-ci bot assigned vrothberg Apr 25, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2023

openshift-merge-robot merged commit 242d63a into containers:main Apr 25, 2023

edsantiago deleted the create_2_sameip branch April 25, 2023 11:20

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e create same-IP: try to fix flake #18329

e2e create same-IP: try to fix flake #18329

edsantiago commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023

rhatdan commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg left a comment

Luap99 commented Apr 25, 2023

edsantiago commented Apr 25, 2023

Luap99 commented Apr 25, 2023

edsantiago commented Apr 27, 2023

Luap99 commented Apr 27, 2023

e2e create same-IP: try to fix flake #18329

e2e create same-IP: try to fix flake #18329

Conversation

edsantiago commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023

rhatdan commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg left a comment

Choose a reason for hiding this comment

Luap99 commented Apr 25, 2023

edsantiago commented Apr 25, 2023

Luap99 commented Apr 25, 2023

edsantiago commented Apr 27, 2023

Luap99 commented Apr 27, 2023