-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: system tests: netns leak #24044
Comments
I saw the flake on #24024 which means my netns leak fixes from c/common did not fix it. The good news I see runs before that PR so they did not cause it as well which it is likely a preexisting issue we just discovered with the new leak check. I think I really need #23487 to get a clue here to at least know which container/test leaked it. It kinda sucks that we use a random id and not the container id for the netns files and I am not sure if we can just change it like that. |
The flake reproduces pretty regularly (one out of ten CI runs, maybe). Can you suggest a way to instrument #23275 ? |
It would be good to get my ebpf PR merged or at leat picked into your PR but this will not help to much until I further instrument the ebpf code to trace the netns umount paths I guess. The easiest way would be to log |
Well this is convenient, it failed on #23487. https://api.cirrus-ci.com/v1/artifact/task/6670456856510464/cleanup_tracer/podman-cleanup-tracer.log
So we do setup the netns but there is never a corresponding teardown call which means podman is not failing to remove/unmount the netns it is never calling the netns teardown to begin with which sounds like a rather bad problem. |
And following the parent pid:
Which clearly shows this is a container started by systemd (parent pid 1) and it uses |
netavark setup got killed by sigterm but the podman parent somehow still exited 0? Something really strange is going on. |
I added |
I am pretty sure it prints the teardown output before the test name |
So I am super confused by ebpf logs here as the podman exited 0, yet netavark setup was killed by SIGTERM. That alone should cause the podman network setup to error out. And I run a patched netavark that does a 10s sleep and if I kill it with SIGTERM podman reports the error properly. What I noticed though is that in the log case podman exited before netavark. This should not happen.
I do see that it is possible for podman to exit 0 on sigterm (for likely not very good reasons) and I think this is what happens here. I can only reproduce with run -d, with a attached run it doesn't seem to work. |
Was seeing this often in my pet-PR runs. Like, half my runs. Made a small change. Haven't seen it yet, in about five runs. commit 5eb9a63f112c75bac928a90a8e9f29f22262c8f7
Author: Ed Santiago <[email protected]>
Date: 2024-09-25 07:54:25 -0600
fixme-temporary-skip
Signed-off-by: Ed Santiago <[email protected]>
diff --git a/test/system/255-auto-update.bats b/test/system/255-auto-update.bats
index 359ac5ff32..6bd1025fbc 100644
--- a/test/system/255-auto-update.bats
+++ b/test/system/255-auto-update.bats
@@ -175,2 +175,3 @@ function _confirm_update() {
@test "podman auto-update - label io.containers.autoupdate=image" {
+ skip "FIXME TEMPORARY, just to see if other tests leak netns"
since=$(date --iso-8601=seconds) This doesn't prove that it's a bug in the test itself, but it raises an eyebrow. Unfortunately the test is rather hard to understand. |
I have a pretty reliable reproducer on my laptop but I can't pretend to understand it. Something is happening asynchronously. Overview: diff --git a/test/system/255-auto-update.bats b/test/system/255-auto-update.bats
index 8bde211055..3d64a2d1fe 100644
--- a/test/system/255-auto-update.bats
+++ b/test/system/255-auto-update.bats
@@ -18,6 +18,7 @@ function setup() {
}
function teardown() {
+ ls -l /run/netns | sed -e "s/^/# teardown /" >&3
if [[ -e $SNAME_FILE ]]; then
while read line; do
if [[ "$line" =~ "podman-auto-update" ]]; then
@@ -25,7 +26,9 @@ function teardown() {
systemctl stop $line.timer
systemctl disable $line.timer
else
+ ls -l /run/netns | sed -e "s/^/# before stop $line /" >&3
systemctl stop $line
+ ls -l /run/netns | sed -e "s/^/# after stop $line /" >&3
fi
rm -f $UNIT_DIR/$line.{service,timer}
done < $SNAME_FILE
@@ -66,12 +69,12 @@ function generate_service() {
# Unless specified, set a default command.
if [[ -z "$command" ]]; then
- command="top -d 120"
+ command="top -d $((100 + BATS_SUITE_TEST_NUMBER))"
fi
# Container name. Include the autoupdate type, to make debugging easier.
# IMPORTANT: variable 'cname' is passed (out of scope) up to caller!
- cname=c_${autoupdate//\'/}_$(random_string)
+ cname="c-$(safename)-${autoupdate//\'/}-$(random_string)"
target_img="quay.io/libpod/$target_img_basename:latest"
if [[ -n "$7" ]]; then
target_img="$7"
@@ -172,7 +175,7 @@ function _confirm_update() {
# This test can fail in dev. environment because of SELinux.
# quick fix: chcon -t container_runtime_exec_t ./bin/podman
-@test "podman auto-update - label io.containers.autoupdate=image" {
+@test "podman auto-update - label io.containers.autoupdate=imagexxxxxxx" {
since=$(date --iso-8601=seconds)
run_podman auto-update
is "$output" ""
@@ -214,6 +217,11 @@ function _confirm_update() {
run_podman container inspect --format "{{.ID}}" $ctr_child
run_podman container inspect --format "{{.State.Status}}" $ctr_child
is "$output" "running" "child container is in running state"
+
+ ls -l /run/netns | sed -e 's/^/# before container rm /' >&3
+ run_podman container rm -f -t0 $ctr_child
+ run_podman container rm -f -t0 $ctr_parent
+ ls -l /run/netns | sed -e 's/^/# after container rm /' >&3
}
@test "podman auto-update - label io.containers.autoupdate=image with rollback" { (Not all diffs are relevant for this exercise. And I'm really tired, banking hours for my long lunch tomorrow, so I may be making stupid mistakes in my copy-paste-edits. I think the key is the Anyhow then $ while :;do time hack/bats --root -T 255:xxxxx || break;done (the HTH |
thanks I am able to reproduce with that. At least in my ebpf log it didn't seem to have to do anything with container rm. It rather looks like start is exiting without actually doing anything |
diff --git a/libpod/container_internal_linux.go b/libpod/container_internal_linux.go
index aaeb75d3cc..38119a604c 100644
--- a/libpod/container_internal_linux.go
+++ b/libpod/container_internal_linux.go
@@ -19,6 +19,7 @@ import (
"github.com/containers/common/pkg/cgroups"
"github.com/containers/common/pkg/config"
"github.com/containers/podman/v5/libpod/define"
+ "github.com/containers/podman/v5/libpod/shutdown"
"github.com/containers/podman/v5/pkg/rootless"
spec "github.com/opencontainers/runtime-spec/specs-go"
"github.com/opencontainers/runtime-tools/generate"
@@ -67,6 +68,9 @@ func (c *Container) prepare() error {
tmpStateLock sync.Mutex
)
+ shutdown.Inhibit()
+ defer shutdown.Uninhibit()
+
wg.Add(2)
go func() { @edsantiago can you give this is try on your system? It seems to be working for me with your reproducer and without it it failed rather quickly and I am now at 10mins so I think it is working |
Well, it flaked after a long time, but in a new way:
|
Well I do not understand the flow of the test exactly to see if this is a problem in the test or podman, regardless I would call the netns thing fixed, opened #24082 which some other cleanup |
Placeholder for a known flake wherein network namespaces are being leaked at the end of a system test run, even when all tests pass:
All
podman network create
commands use safename, AFAICT, so this is a difficult one to track down.The text was updated successfully, but these errors were encountered: