-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple e2e tests are flaky because of error container is not running #4405
Comments
@sbueringer Thanks for the detailed analysis The key point here is
We should get to the root of this, and understand why I would start trying to get the /var/log/docker.log; a way to achieve this is to copy this file to the artefact folder at then end of our Another things that we can do is to add an additional step in CAPD, validates that a machine actually exists after being created create, and if not, dumping some info useful for debugging From my experience, docker run silently failing could happen when the docker engine fill ups the local disk cache; if this is the case, given that AFAIK we are already using the biggest machines available on prow, the only option I see there is to split the current E2E job in few jobs. |
@fabriziopandini Agree, I'll open a first PR to gather some more infos. /assign |
@fabriziopandini PR for better visibility is open: #4414 I didn't make any changes to CAPD yet:
|
Got some new data already from #4414 Test
I think we should implement a cluster-api/test/infrastructure/docker/docker/machine.go Lines 310 to 315 in 7478817
Just tried it locally and I hope it gets us some more info, e.g.: "State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 127,
"Error": "OCI runtime create failed: container_linux.go:370: starting container process caused: exec: \"/bin/bac\": stat /bin/bac: no such file or directory: unknown",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
}, What do you think? |
+1 for implementing a docker inspect. |
@fabriziopandini from the log I suspect we're first failing here (and then at the location I linked above during the retries): cluster-api/test/infrastructure/docker/docker/machine.go Lines 222 to 228 in 7478817
I0331 10:05:37.077010 1 machine.go:190] controller-runtime/manager/controller/dockermachine "msg"="Creating control plane machine container" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:47.019084 1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:47.043525 1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:47.064611 1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:47.078423 1 dockermachine_controller.go:73] controller-runtime/manager/controller/dockermachine "msg"="Waiting for Machine Controller to set OwnerRef on DockerMachine" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:47.844159 1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:48.003814 1 dockermachine_controller.go:200] controller-runtime/manager/controller/dockermachine "msg"="Waiting for the Bootstrap provider controller to set bootstrap data" "name"="kcp-upgrade-b3dk85-control-plane-kwgrx" "namespace"="kcp-upgrade-k135el" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
E0331 10:05:48.736814 1 controller.go:302] controller-runtime/manager/controller/dockermachine "msg"="Reconciler error" "error"="failed to create worker DockerMachine: timed out waiting for the condition" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:48.988211 1 loadbalancer.go:126] controller-runtime/manager/controller/dockermachine "msg"="Updating load balancer configuration" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine"
I0331 10:05:49.817654 1 machine.go:312] controller-runtime/manager/controller/dockermachine "msg"="Failed running command" "name"="kcp-upgrade-vs9awl-control-plane-5lql4" "namespace"="kcp-upgrade-xv6wt7" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="DockerMachine" "bootstrap I think a |
Got some more information from test runs on #4416 "INFO: ensuring we can execute mount/umount even with userns-remap"
"INFO: remounting /sys read-only"
"INFO: making mounts shared"
"INFO: fix cgroup mounts for all subsystems" This suggests the entrypoint script fails somewhere after: I'll build a custom kind image with |
Some new data now with Script fails here: https://github.com/kubernetes-sigs/kind/blob/v0.9.0/images/base/files/usr/local/bin/entrypoint#L82 Output from docker logs: + local docker_cgroup cgroup_subsystems subsystem
++ head -n 1
++ cut '-d ' -f 4
++ echo '6217 6216 0:29 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:923 - cgroup cgroup rw,xattr,name=systemd
6287 6216 0:32 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:924 - cgroup cgroup rw,cpu,cpuacct
6288 6216 0:33 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:925 - cgroup cgroup rw,net_cls,net_prio
6289 6216 0:34 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:926 - cgroup cgroup rw,memory
6290 6216 0:35 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:927 - cgroup cgroup rw,freezer
6291 6216 0:36 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:928 - cgroup cgroup rw,hugetlb
6320 6216 0:37 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:929 - cgroup cgroup rw,pids
6322 6216 0:39 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:931 - cgroup cgroup rw,blkio
6323 6216 0:40 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:932 - cgroup cgroup rw,cpuset
6324 6216 0:41 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:933 - cgroup cgroup rw,perf_event
6325 6216 0:42 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:934 - cgroup cgroup rw,devices'
+ docker_cgroup=/docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 So there seems to be a I assume that's because Some links:
In case this doesn't work, here's an alternative solution: |
I implemented the alternative solution now, as it's more logical that we can avoid pipe failed errors with I think I'm at a point where it's hard to prove that the issue is gone, but it should be fixed with my custom kind images. Most of the failures in testgrid are caused by this issue. So I assume if I can run the tests maybe 10 times in a row without hitting this issue, the probability is fairly high that it's gone. |
/reopen |
@sbueringer: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The issue will be fixed soon in kind (kubernetes-sigs/kind#2179). After the next kind release, which should be in the next 1-2 weeks, we can upgrade to newer images with the new entrypoint script which should fix our flakes. Not sure how the new images are tagged, but I assume we can just update to the newest v1.19.* and v1.18.* images then. Can we also update to the latest kind version and to 1.20.* and 1.19.* images or do we want to explicitly test the update from 1.18 to 1.19? |
Status update: #4469 has been merged. So I'll take a look the next few days if the "container is not running" issue is gone as expected |
Still occurs (link), but that makes sense as in some tests we're using I'll open a PR (#4663) to upgrade to the latest images. I think right now we're only using old/ pinned images in our "regular tests" on the default branch. Our (periodic) upgrade tests should either use an image published by latest kind or they build the image with kind v0.11.0 locally. |
/priority important-soon |
@sbueringer we are definitely in a better shape than before. Thanks for this work! /close |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
According to testgrid
capi-e2e.When following the Cluster API quick-start [PR-Blocking] Should create a workload cluster
and a few others are failing from time to time.I looked at the last two occurrences in the capi-quickstart test. In both cases a machine did not come up because
mkdir -p /etc/kubernetes/pki
was failing because the respective container was not running. It was retried for a while but the container didn't come up. I tried to find any other logs but couldn't find anything. Logs from the controllers aggregated and sorted for the affected node of this test: https://gist.github.com/sbueringer/e007c989c158d66dd6d3078f8c904f30 (ProwJob)I think right now we don't have the necessary data/logs to find out why this happens. I would propose to gather the logs of the Docker service which is used in those tests (the dind used in the ProwJob). Maybe there's something interesting there. Are there any other Docker / kind / ... logs which we could retrieve?
What I found in the kubekins image we're using:
What did you expect to happen:
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I assume the following test failures are related:
So tl;dr apart from the MachineRemediation most of our other flaky tests are probably caused by this issue. They are usually failing in the following lines of code:
cluster-api/test/framework/controlplane_helpers.go
Line 109 in 7478817
cluster-api/test/framework/controlplane_helpers.go
Line 146 in 7478817
cluster-api/test/framework/machinedeployment_helpers.go
Line 120 in 7478817
cluster-api/test/framework/machinepool_helpers.go
Line 85 in 7478817
Even if they not all have the same root cause. Fixing this error should fix most of them.
Environment:
kubectl version
):/etc/os-release
):/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: