Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e create same-IP: try to fix flake #18329

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

edsantiago
Copy link
Member

Our friend #7096 is still not fixed: it continues to flake,
singletons only, and only in the "create" test (not "run").

My guess: maybe there's a race somewhere in IP assignment,
such that container1 can have an IP, but not yet be running,
and a container2 can sneak in and start with that IP, and
container1 is the one that fails?

Solution: tighten the logic so we wait for container1 to
truly be running before we start container2. And, when we
start container2, do so with -a so we get to see stdout.
(Am not expecting it to be helpful, but who knows).

Also very minor cleanup

Signed-off-by: Ed Santiago [email protected]

None

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023
Our friend containers#7096 is still not fixed: it continues to flake,
singletons only, and only in the "create" test (not "run").

My guess: maybe there's a race somewhere in IP assignment,
such that container1 can have an IP, but not yet be running,
and a container2 can sneak in and start with that IP, and
container1 is the one that fails?

Solution: tighten the logic so we wait for container1 to
truly be running before we start container2. And, when we
start container2, do so with -a so we get to see stdout.
(Am not expecting it to be helpful, but who knows).

Also very minor cleanup

Signed-off-by: Ed Santiago <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

LGTM

@edsantiago
Copy link
Member Author

Recent failures:

  • fedora-36 : int podman fedora-36 root container sqlite
    • 03-28 22:06 in Podman create with --ip flag [It] Podman create two containers with the same IP
    • 03-28 11:11 in Podman create with --ip flag [It] Podman create two containers with the same IP
  • fedora-36 : int remote fedora-36 root host sqlite [remote]
    • 04-24 12:44 in Podman create with --ip flag [It] Podman create two containers with the same IP

All of them in my no-flake-retries PR, which means this is almost certainly happening in production CI, but my flake logger is not seeing those.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea!

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2023
@openshift-merge-robot openshift-merge-robot merged commit 242d63a into containers:main Apr 25, 2023
@Luap99
Copy link
Member

Luap99 commented Apr 25, 2023

I don't think this is possible, start should block until the container is started which means it also has to wait for the complete network setup. The whole loop here wait for the ip address to be assigned makes no sense to me. If this is needed podman start is seriously broken.

@edsantiago edsantiago deleted the create_2_sameip branch April 25, 2023 11:20
@edsantiago
Copy link
Member Author

@Luap99 if what you say is correct, the test will continue to flake. Here's a question I've long wondered about: does ginkgo have a mechanism for running code on failure? Like:

    Expect("this").ButIfItFailsThen("podman exec container1 ip a or podman logs or something")

@Luap99
Copy link
Member

Luap99 commented Apr 25, 2023

@Luap99 if what you say is correct, the test will continue to flake. Here's a question I've long wondered about: does ginkgo have a mechanism for running code on failure? Like:

    Expect("this").ButIfItFailsThen("podman exec container1 ip a or podman logs or something")

So I assume you want to execute further commands to debug in this case? I don't think this is directly supported. However what could work is using the extra annotations and pass a function that executes the commands.
https://onsi.github.io/gomega/#annotating-assertions (last example)

@edsantiago
Copy link
Member Author

Well, phooey

@Luap99
Copy link
Member

Luap99 commented Apr 27, 2023

Well, phooey

Not sure if I should be happy that I was right or hate the fact that something super strange is going on, maybe both?
In any case the network code is 100% locked I have no doubt there. Also based on the links above failure is happening with both netavark and CNI so the error is not inside the ip allocation logic (unless both CNI are netavark have the same bug despite using completely different implementations).

One thing that could cause the behavior is when the first container dies after inspect but before the start test2 call. Not that I belive that this is the case but you are right we need to instrument the tests to gather more data after the failure.

Please reopen the issue I can take a look tomorrow.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants