Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple e2e tests are flaky because of error container is not running #4405

Closed
sbueringer opened this issue Mar 30, 2021 · 18 comments · Fixed by #4414
Closed

Multiple e2e tests are flaky because of error container is not running #4405

sbueringer opened this issue Mar 30, 2021 · 18 comments · Fixed by #4414
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@sbueringer
Copy link
Member

sbueringer commented Mar 30, 2021

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

According to testgrid capi-e2e.When following the Cluster API quick-start [PR-Blocking] Should create a workload cluster and a few others are failing from time to time.

I looked at the last two occurrences in the capi-quickstart test. In both cases a machine did not come up because mkdir -p /etc/kubernetes/pki was failing because the respective container was not running. It was retried for a while but the container didn't come up. I tried to find any other logs but couldn't find anything. Logs from the controllers aggregated and sorted for the affected node of this test: https://gist.github.com/sbueringer/e007c989c158d66dd6d3078f8c904f30 (ProwJob)

I think right now we don't have the necessary data/logs to find out why this happens. I would propose to gather the logs of the Docker service which is used in those tests (the dind used in the ProwJob). Maybe there's something interesting there. Are there any other Docker / kind / ... logs which we could retrieve?

What I found in the kubekins image we're using:

  • /var/log/docker.log

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

I assume the following test failures are related:

So tl;dr apart from the MachineRemediation most of our other flaky tests are probably caused by this issue. They are usually failing in the following lines of code:

Even if they not all have the same root cause. Fixing this error should fix most of them.

Environment:

  • Cluster-api version:
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2021
@sbueringer sbueringer changed the title Quick-start e2e tests are failing because of container is not running Multiple e2e tests are flaky because of error container is not running Mar 30, 2021
@fabriziopandini
Copy link
Member

@sbueringer Thanks for the detailed analysis

The key point here is

.... was failing because the respective container was not running.

We should get to the root of this, and understand why docker run while creating a CAPD machine/container did not returned here.

I would start trying to get the /var/log/docker.log; a way to achieve this is to copy this file to the artefact folder at then end of our ci-e2e.sh script (we can also collect other info as well, like e.g. docker system df)

Another things that we can do is to add an additional step in CAPD, validates that a machine actually exists after being created create, and if not, dumping some info useful for debugging

From my experience, docker run silently failing could happen when the docker engine fill ups the local disk cache; if this is the case, given that AFAIK we are already using the biggest machines available on prow, the only option I see there is to split the current E2E job in few jobs.
One easy win, trying to remove noise, could be also to stop testing conformance as soon as we have jobs for Kubernetes version running (this includes conformance)

@fabriziopandini fabriziopandini added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 30, 2021
@sbueringer
Copy link
Member Author

@fabriziopandini Agree, I'll open a first PR to gather some more infos.

/assign

@sbueringer
Copy link
Member Author

sbueringer commented Mar 31, 2021

@fabriziopandini PR for better visibility is open: #4414

I didn't make any changes to CAPD yet:

@sbueringer
Copy link
Member Author

sbueringer commented Mar 31, 2021

@fabriziopandini

Got some new data already from #4414

Test capi-e2e: When testing KCP upgrade Should successfully upgrade Kubernetes, DNS, kube-proxy, and etcd in a HA cluster using scale in rollout failed

  • job log

    • kcp_upgrade.go:146 > controlplane_helpers.go:146 (WaitForOneKubeadmControlPlaneMachineToExist)
      • => timeout because control plane DockerMachine kcp-upgrade-vs9awl-control-plane-5lql4 (Machine kcp-upgrade-vs9awl-control-plane-xrbnp) did not come up
  • capd manager log:

    • E0331 10:05:49.849720 failed to run cloud config: command \"docker exec --privileged kcp-upgrade-vs9awl-control-plane-xrbnp mkdir -p /etc/kubernetes/pki\" [...] Container 0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b is not running
    • retries until about 10:15:51.373943
  • no further information in the Machine log folder

  • containerd events

    2021-03-31 10:05:43.07502905 +0000 UTC moby /containers/create {"id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","runtime":{"name":"io.containerd.runtime.v1.linux","options":{"type_url":"containerd.linux.runc.RuncOptions","value":"CgRydW5jEhwvdmFyL3J1bi9kb2NrZXIvcnVudGltZS1ydW5j"}}}
    2021-03-31 10:05:44.385562266 +0000 UTC moby /tasks/create {"container_id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","bundle":"/var/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","io":{"stdin":"/var/run/docker/containerd/0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b/init-stdin","stdout":"/var/run/docker/containerd/0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b/init-stdout","terminal":true},"pid":71881}
    2021-03-31 10:05:44.442658332 +0000 UTC moby /tasks/start {"container_id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","pid":71881}
    2021-03-31 10:05:44.582080203 +0000 UTC moby /tasks/exit {"container_id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","pid":71881,"exit_status":141,"exited_at":"2021-03-31T10:05:44.511218578Z"}
    2021-03-31 10:05:44.641894009 +0000 UTC moby /tasks/delete {"container_id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b","pid":71881,"exit_status":141,"exited_at":"2021-03-31T10:05:44.511218578Z"}
    2021-03-31 10:05:45.330197141 +0000 UTC moby /containers/delete {"id":"0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b"}
    • => tl;dr container start => container exit => exit status 141
  • docker events

    • more or less the same information: 2021-03-31T10:05:44.642662959Z container die 0efe8d9c20b600de451e2a584ce83291152c112535120a53fe47f462a1bea29b (exitCode=141, image=kindest/node:v1.18.2, io.x-k8s.kind.cluster=kcp-upgrade-vs9awl, io.x-k8s.kind.role=control-plane, name=kcp-upgrade-vs9awl-control-plane-xrbnp)

I think we should implement a docker inspect in the following branch:

err := cmd.Run(ctx)
if err != nil {
log.Info("Failed running command", "command", command, "stdout", outStd.String(), "stderr", outErr.String(), "bootstrap data", data)
return errors.Wrap(errors.WithStack(err), "failed to run cloud config")
}
}

Just tried it locally and I hope it gets us some more info, e.g.:

        "State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 127,
            "Error": "OCI runtime create failed: container_linux.go:370: starting container process caused: exec: \"/bin/bac\": stat /bin/bac: no such file or directory: unknown",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

What do you think?

@fabriziopandini
Copy link
Member

+1 for implementing a docker inspect.
TBH, I'm not sure if we are entering the err after create, because it seems that the machine creation continues up to bootstrap phase/cloudinit commands, but we can always inspect all the containers after creation

@sbueringer
Copy link
Member Author

sbueringer commented Mar 31, 2021

@fabriziopandini from the log I suspect we're first failing here (and then at the location I linked above during the retries):

err = wait.PollImmediate(500*time.Millisecond, 4*time.Second, func() (bool, error) {
ps := m.container.Commander.Command("crictl", "ps")
return ps.Run(ctx) == nil, nil
})
if err != nil {
return errors.WithStack(err)
}

I0331 10:05:37.077010       1 machine.go:190] controller-runtime/manager/controller/dockermachine "msg"="Creating control plane machine container" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.019084       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.043525       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.064611       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.078423       1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:47.844159       1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:48.003814       1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
E0331 10:05:48.736814       1 controller.go:302] controller-runtime/manager/controller/dockermachine "msg"="Reconciler error" "error"="failed to create worker DockerMachine: timed out waiting for the condition" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:48.988211       1 loadbalancer.go:126] controller-runtime/manager/controller/dockermachine "msg"="Updating load balancer configuration" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" 
I0331 10:05:49.817654       1 machine.go:312] controller-runtime/manager/controller/dockermachine "msg"="Failed running command" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" "bootstrap 

I think a docker inspect makes sense at both locations.

@sbueringer
Copy link
Member Author

sbueringer commented Apr 1, 2021

Got some more information from test runs on #4416
In addition to what we already know (exit code 141). The container which doesn't come up prints these logs:

"INFO: ensuring we can execute mount/umount even with userns-remap"
"INFO: remounting /sys read-only"
"INFO: making mounts shared"
"INFO: fix cgroup mounts for all subsystems"

This suggests the entrypoint script fails somewhere after:
https://github.com/kubernetes-sigs/kind/blob/c17ca16700df5910126e317d108e65a2e485d081/images/base/files/usr/local/bin/entrypoint#L196

I'll build a custom kind image with set -x to find out where exactly it fails. (and also make a few more adjustments to the e2e tests on this PR to make it easier to reproduce)

@sbueringer
Copy link
Member Author

sbueringer commented Apr 2, 2021

Some new data now with set -x

Script fails here: https://github.com/kubernetes-sigs/kind/blob/v0.9.0/images/base/files/usr/local/bin/entrypoint#L82

Output from docker logs:

+ local docker_cgroup cgroup_subsystems subsystem
++ head -n 1
++ cut '-d ' -f 4
++ echo '6217 6216 0:29 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:923 - cgroup cgroup rw,xattr,name=systemd
6287 6216 0:32 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:924 - cgroup cgroup rw,cpu,cpuacct
6288 6216 0:33 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:925 - cgroup cgroup rw,net_cls,net_prio
6289 6216 0:34 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:926 - cgroup cgroup rw,memory
6290 6216 0:35 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:927 - cgroup cgroup rw,freezer
6291 6216 0:36 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:928 - cgroup cgroup rw,hugetlb
6320 6216 0:37 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:929 - cgroup cgroup rw,pids
6322 6216 0:39 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:931 - cgroup cgroup rw,blkio
6323 6216 0:40 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:932 - cgroup cgroup rw,cpuset
6324 6216 0:41 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:933 - cgroup cgroup rw,perf_event
6325 6216 0:42 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:934 - cgroup cgroup rw,devices'
+ docker_cgroup=/docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556

So there seems to be a PIPEFAIL in docker_cgroup=$(echo "${docker_cgroup_mounts}" | head -n 1 | cut -d' ' -f 4)

I assume that's because echo writes into the pipe and head apparently stops reading after the first line (which is the intended behavior of head). The /bin/echo binary seems to work a bit different than the echo bash built-in. I'll try if the error still occurs when using /bin/echo instead of echo. I can only assume this happens when echo somehow "slowly" writes into the pipe (CPU context switch, maybe?), because it happens rarely.

Some links:

In case this doesn't work, here's an alternative solution:

@sbueringer
Copy link
Member Author

I implemented the alternative solution now, as it's more logical that we can avoid pipe failed errors with head if echo is not part of the pipe.

I think I'm at a point where it's hard to prove that the issue is gone, but it should be fixed with my custom kind images. Most of the failures in testgrid are caused by this issue. So I assume if I can run the tests maybe 10 times in a row without hitting this issue, the probability is fairly high that it's gone.
I think I test it a few more times on #4416 and if it still not occurs again I'll open an issue in the kind repo and ask them what they think.

@sbueringer
Copy link
Member Author

/reopen
(shouldn't have written Gathers more logs and data from Docker and containerd, to be able to fix #4405 mid-term. in the PR)

@k8s-ci-robot k8s-ci-robot reopened this Apr 2, 2021
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen
(shouldn't have written Gathers more logs and data from Docker and containerd, to be able to fix #4405 mid-term. in the PR)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member Author

sbueringer commented Apr 6, 2021

The issue will be fixed soon in kind (kubernetes-sigs/kind#2179). After the next kind release, which should be in the next 1-2 weeks, we can upgrade to newer images with the new entrypoint script which should fix our flakes.

Not sure how the new images are tagged, but I assume we can just update to the newest v1.19.* and v1.18.* images then.

Can we also update to the latest kind version and to 1.20.* and 1.19.* images or do we want to explicitly test the update from 1.18 to 1.19?

@sbueringer
Copy link
Member Author

sbueringer commented May 24, 2021

Status update: #4469 has been merged. So I'll take a look the next few days if the "container is not running" issue is gone as expected

@sbueringer
Copy link
Member Author

sbueringer commented May 25, 2021

Still occurs (link), but that makes sense as in some tests we're using kindest/node images published by old kind versions (docker images). These images still contain the old entrypoint script.

I'll open a PR (#4663) to upgrade to the latest images. I think right now we're only using old/ pinned images in our "regular tests" on the default branch. Our (periodic) upgrade tests should either use an image published by latest kind or they build the image with kind v0.11.0 locally.

@vincepri
Copy link
Member

/priority important-soon
/milestone v0.4
/assign @sbueringer
/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 25, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone May 25, 2021
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 25, 2021
@sbueringer
Copy link
Member Author

@vincepri @fabriziopandini

I took a rough look at the last 12 failed e2e tests and they didn't show the "is not running" issue. So I would suggest to close this issue.

I wouldn't say that the e2e tests are completely stable. I think this should (depending on priority) be addressed by a new issue.
image

P.S. I won't be able to further investigate the current flakes before July.

@fabriziopandini
Copy link
Member

@sbueringer we are definitely in a better shape than before. Thanks for this work!

/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

@sbueringer we are definitely in a better shape than before. Thanks for this work!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants