Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/rootless: correctly handle proxy signals on reexec #18681

Merged
merged 1 commit into from
May 27, 2023

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented May 25, 2023

There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fix a problem were the joinUserAndMountNS() code path
would not forward signals at all. This code path is used when you have
running containers but the pause process was killed.

Fixes #16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
flake can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Does this PR introduce a user-facing change?

Fixed a bug in the rootless podman reexec logic were signals were not forwarded correctly.

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 25, 2023
@Luap99
Copy link
Member Author

Luap99 commented May 25, 2023

@giuseppe @edsantiago PTAL

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago
Copy link
Member

I haven't reviewed the code, but I tested it on a 1mt VM and it passes \o/.

Failing because of missing tests. Maybe you could tweak 032.bats to kill the rootless pause process, if running?

@Luap99
Copy link
Member Author

Luap99 commented May 25, 2023

Failing because of missing tests. Maybe you could tweak 032.bats to kill the rootless pause process, if running?

Yeah I am thinking about something like that. I would prefer to not change the existing test but I think I can add something like this to 550-pause-process.

@edsantiago
Copy link
Member

So the question still is what kills the pause process?

New pause-related flake filed, #18685

@Luap99
Copy link
Member Author

Luap99 commented May 25, 2023

@edsantiago updated with tests for both code paths I changed. I confirmed that both fail on main right now and pass with this patch.

@edsantiago
Copy link
Member

Might want to skip_if_remote:

Error: cannot use command "podman-remote unshare" with the remote podman client

(Also, much lower pri, might want to fix the duplication in that message; I'll file an issue for it later)

Other failures are flakes, including unlinkat-ebusy. I restarted them before realizing that the third flake (this one) was a hard failure.

@Luap99
Copy link
Member Author

Luap99 commented May 25, 2023

Might want to skip_if_remote:

yes I also saw that, I just waited so you can take a look at the tests before I repush.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch!

There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fix a problem were the joinUserAndMountNS() code path
would not forward signals at all. This code path is used when you have
running containers but the pause process was killed.

Fixes containers#16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
flake can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member

Am looking at tests, and confused, but nothing that should block this.

(Confusion: I'm having trouble understanding the difference between the two new tests, and how the 2nd new test differs from the single pause process test. But again, that is my lack of comprehension, and not a blocker nor a request to change anything)

Copy link
Member

@TomSweeneyRedHat TomSweeneyRedHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomSweeneyRedHat
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2023
@openshift-merge-robot openshift-merge-robot merged commit e7dc507 into containers:main May 27, 2023
@Luap99 Luap99 deleted the reexec-signals branch May 28, 2023 16:18
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

signal flake: Timed out waiting for BYE
6 participants