Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

Closed
sbueringer opened this issue Apr 2, 2021 · 5 comments · Fixed by #2179
Closed

kindest/node entrypoint sometimes fails with exit code 141 (SIGPIPE) #2178

sbueringer opened this issue Apr 2, 2021 · 5 comments · Fixed by #2179
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sbueringer
Copy link
Member

sbueringer commented Apr 2, 2021

What happened:

Some context:

  • we're using kind for e2e tests in cluster-api
  • tests are using CAPD (ClusterAPI Provider Docker) to spin up other Docker-based clusters
  • tests are executed in Prow
  • tests are running in parallel
  • we're using the kindest/node image to spin up our own Docker-based clusters

Most of it shouldn't matter, but it might be helpful context. From time to time we're seeing that the kindest/node image fails on startup. According to docker inspect it fails with exit code 141 (logs). docker logs shows the following:

  • regular log:
"INFO: ensuring we can execute mount/umount even with userns-remap"
"INFO: remounting /sys read-only"
"INFO: making mounts shared"
"INFO: fix cgroup mounts for all subsystems"
  • log with set -x:
+ local docker_cgroup cgroup_subsystems subsystem
++ head -n 1
++ cut '-d ' -f 4
++ echo '6217 6216 0:29 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:923 - cgroup cgroup rw,xattr,name=systemd
6287 6216 0:32 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:924 - cgroup cgroup rw,cpu,cpuacct
6288 6216 0:33 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:925 - cgroup cgroup rw,net_cls,net_prio
6289 6216 0:34 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:926 - cgroup cgroup rw,memory
6290 6216 0:35 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:927 - cgroup cgroup rw,freezer
6291 6216 0:36 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:928 - cgroup cgroup rw,hugetlb
6320 6216 0:37 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:929 - cgroup cgroup rw,pids
6322 6216 0:39 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:931 - cgroup cgroup rw,blkio
6323 6216 0:40 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:932 - cgroup cgroup rw,cpuset
6324 6216 0:41 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:933 - cgroup cgroup rw,perf_event
6325 6216 0:42 /docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556 /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:934 - cgroup cgroup rw,devices'
+ docker_cgroup=/docker/61633ea3cf604679d3cc4292524d1d37dfbcdcce1a35b0ae6f487bd32a86e556

To me it looks like it fails here: https://github.com/kubernetes-sigs/kind/blob/v0.9.0/images/base/files/usr/local/bin/entrypoint#L82

The only reasonable explanation I could come up with is:

  • echo sends the docker_cgroup_mounts into the pipe (maybe line by line?)
  • head -n 1 exits after the first line
  • the pipe (and thus the whole script) fails because echo cannot send the other lines

I suspect it happens rarely because this might be caused by a CPU context switch (or something like this?) at the wrong moment, because of the high concurrency we have during our tests.

What you expected to happen:
kindest/node should not fail on startup.

How to reproduce it (as minimally and precisely as possible):
Really hard to reproduce. In our case we can be sure it's gone after about 10-20 successful test runs in a row.

Anything else we need to know?:

I changed the following and this seems to fix the issue in my case:

  • old code:
docker_cgroup=$(echo "${docker_cgroup_mounts}" | head -n 1 | cut -d' ' -f 4)
  • "fixed" code:
docker_cgroup=$( < <(echo "${docker_cgroup_mounts}") head -n 1 | cut -d' ' -f 4)

The idea is to move the echo out of the pipe so it should not be possible anymore to get a SIGPIPE/ pipefail.

The code in kind on the main branch changed a bit since the version 0.9 we're currently using, but there's one line left which should have the same issue:
https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint#L232

For reference, the respective cluster-api issue: kubernetes-sigs/cluster-api#4405

Environment:

  • kind version: (use kind version): 0.9
  • Kubernetes version: (use kubectl version): 1.19.1
  • Docker version: (use docker info):
Client: Docker Engine - Community
 Version: 20.10.3
 API version: 1.40
 Go version: go1.13.15
 Git commit: 48d30b5
 Built: Fri Jan 29 14:33:25 2021
 OS / Arch: linux / amd64
 Context: default
 Experimental: true

Server: Docker Engine - Community
 Engine:
  Version: 19.03.15
  API version: 1.40 (minimum version 1.12)
  Go version: go1.13.15
  Git commit: 99e3ed8919
  Built: Sat Jan 30 03:15:34 2021
  OS / Arch: linux / amd64
  Experimental: false
 containerd:
  Version: 1.4.3
  GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version: 1.0.0-rc92
  GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version: 0.18.0
  GitCommit: fec3683
  • OS (e.g. from /etc/os-release): not sure but the image we're using in the ProwJob is: gcr.io/k8s-testimages/kubekins-e2e:v20210330-fadf59c-go-canary
@sbueringer sbueringer added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2021
@BenTheElder
Copy link
Member

we've seen something like this before in kubernetes kubernetes/kubernetes#57805, I'm guessing piping to head has the same issue.

we should use the heredoc approach here as well, it's a bit cheaper / simpler

@BenTheElder
Copy link
Member

thanks for tracking this down and filing the very detailed report :-)
/assign

@sbueringer
Copy link
Member Author

sbueringer commented Apr 2, 2021

thanks for tracking this down and filing the very detailed report :-)
/assign

You're welcome :)

How/when is the new version of the script included in the images?

I assume something like:

  • after the merge on main on new images or
  • after the next release in new images

Thx for the fast fix

@BenTheElder
Copy link
Member

How/when is the new version of the script included in the images?

we push images as needed, the base image is updated in the fix PR, I'd push a node image but we're already working on revamped node images in #2176 which will hopefully be ready to go soon. If it doesn't we'll push one anyhow, we keep one fairly up to date for HEAD.

A full set of images is pushed on release, we were hoping for a release next week, but with people out it may slip another week (I am out all week, it looks like others are out some as well).

@sbueringer
Copy link
Member Author

Ah perfect. I think we'll wait for the full set, but a few weeks is what I hoped for so that's great.

We're currently using 1.19 and for upgrade tests also a 1.18 image (I'll have to check why not 1.20 and 1.19).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants