Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kind create cluster fails with "ERROR: failed to create cluster" (slow disk operations?) #2416

Closed
shivam-51 opened this issue Aug 14, 2021 · 12 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@shivam-51
Copy link

Just doing a simple kind create cluster fails with the error
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

Complete logs: https://pastebin.ubuntu.com/p/BDwBJqkwcH/

What happened:
Cluster did not got created
What you expected to happen:
A cluster should be created
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • kind version: (use kind version):
    kind v0.11.1 go1.16.4 linux/amd64

  • Kubernetes version: (use kubectl version):
    Kubernetes is not installed as its not a requirement according to https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster

  • Docker version: (use docker info):
    Client: Docker Engine - Community
    Version: 20.10.8
    API version: 1.41
    Go version: go1.16.6
    Git commit: 3967b7d
    Built: Fri Jul 30 19:54:27 2021
    OS/Arch: linux/amd64
    Context: default
    Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:52:33 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0

@shivam-51 shivam-51 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 14, 2021
@shivam-51
Copy link
Author

shivam-51 commented Aug 14, 2021

@aojea @BenTheElder
journal.log file
journal.log

@aojea
Copy link
Contributor

aojea commented Aug 16, 2021

that docker info is missing some important information, drivers, backend storage, ...
Also, the journal.log is not enough , you should attach the whole kind export logs folder in a tarball

@shivam-51
Copy link
Author

that docker info is missing some important information, drivers, backend storage, ...

docker info:

 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.6.1-docker)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 20.10.8
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e25210fe30a0a703442421b0f60afac609f950a3
 runc version: v1.0.1-0-g4144b63
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.11.0-25-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.672GiB
 Name: shivam-dell
 ID: ZCSJ:TZEP:6M3E:7IYB:IKCW:DGUU:TMEE:KW3T:LHG5:7XQC:IYCI:R4YC
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Also, the journal.log is not enough , you should attach the whole kind export logs folder in a tarball

kind export logs:
904057590.zip

@aojea
Copy link
Contributor

aojea commented Aug 16, 2021

it seems your system doesn't have enough resources, most of the operations are taking a lot of time to be executed and failing

2021-08-16T09:55:48.881995154Z stderr F 2021-08-16 09:55:48.881566 W | etcdserver: read-only range request "key:"/registry/namespaces/kube-system" " with result "range_response_count:1 size:351" took too long (1.888310872s) to execute

2021-08-16T09:56:21.086708931Z stderr F 2021-08-16 09:56:21.086625 W | etcdserver: read-only range request "key:"/registry/services/endpoints/default/kubernetes" " with result "range_response_count:1 size:418" took too long (1.947086569s) to execute

most of the components are not able to work on that environment

2021-08-16T09:56:12.990614822Z stderr F E0816 09:56:12.990412 1 leaderelection.go:361] Failed to update lock: Put "https://172.18.0.2:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
2021-08-16T09:56:12.990668296Z stderr F I0816 09:56:12.990503 1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
2021-08-16T09:56:12.990677743Z stderr F F0816 09:56:12.990604 1 controllermanager.go:284] leaderelection lost
2021-08-16T09:56:14.643490731Z stderr F goroutine 150 [running]:

@shivam-51
Copy link
Author

it seems your system doesn't have enough resources, most of the operations are taking a lot of time to be executed and failing

Does kind need very high resources? I have 8gb ram with dual boot(Linux + Windows). Is that not enough?

sudo fdisk -l:

Disk model: ST2000LM007-1R81
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: A063354F-F25C-4029-9BBF-98342ED01F0D

Device          Start        End    Sectors   Size Type
/dev/sda1        2048    1333247    1331200   650M EFI System
/dev/sda2     1333248    1595391     262144   128M Microsoft reserved
/dev/sda3     1595392 1358851884 1357256493 647.2G Microsoft basic data
/dev/sda4  1358852096 1359947775    1095680   535M Windows recovery environment
/dev/sda5  1359949824 1361768447    1818624   888M Windows recovery environment
/dev/sda6  1361770496 2035150847  673380352 321.1G Microsoft basic data
/dev/sda7  3878352896 3879297023     944128   461M Windows recovery environment
/dev/sda8  3879297024 3904761855   25464832  12.1G Windows recovery environment
/dev/sda9  3904763904 3907004415    2240512   1.1G Windows recovery environment
/dev/sda10 2035150848 2218194943  183044096  87.3G Linux filesystem
/dev/sda11 2218194944 2237728767   19533824   9.3G Linux swap

Partition table entries are not in disk order.

and free -h

                  total        used        free      shared  buff/cache   available
Mem:          7.7Gi       4.7Gi       334Mi       429Mi       2.6Gi       2.3Gi
Swap:            0B          0B          0B

@aojea
Copy link
Contributor

aojea commented Aug 16, 2021

Does kind need very high resources? I have 8gb ram with dual boot(Linux + Windows). Is that not enough?

it is enough, but your logs show very slow queries, check with top and iotop if is a cpu or storage iops problem ... I can't say what is the problem, but it has to be something specific of your enviroment, that setup seems normal

@shivam-51
Copy link
Author

On inspecting iotop with kind create cluster The speed of Disk operations does decrease drastically when its in the Starting Control Plane phase. But its pretty fast in the other phases.
Is there absolutely no way to fix this? (apart from changing some hardware)

@aojea
Copy link
Contributor

aojea commented Aug 17, 2021

Is there absolutely no way to fix this? (apart from changing some hardware)

Kind works for thousands of people and ci jobs , also with less resources with a similar set-up, ubuntu 20.4, per example github actions.

It looks that, I don't know why, you have something different, the only symptom I can think about is that something is very slow, but that is a problem of debugging performance on your system

Try running etcd in memory , #845 (comment), if that works it means you have something in your setup that is causing slow disks operations

@carlosrecuero
Copy link

Hi, I have faced the same error on different machines too.

After some tests, we realized it was caused by the length of the cluster name. Could it be your problem?

@shivam-51
Copy link
Author

After some tests, we realized it was caused by the length of the cluster name. Could it be your problem?

That does not seem to be the issue as I am just running kind create cluster exactly.

@BenTheElder
Copy link
Member

After some tests, we realized it was caused by the length of the cluster name. Could it be your problem?

This should also be producing a warning if you try to use one that is unlikely to work

logger.Warnf("cluster name %q is probably too long, this might not work properly on some systems", opts.Config.Name)

If it's not, we could use a separate report with the length etc. to get that fixed, it's more difficult to track problems like this when they're interleaved into a single issue 😅


Regarding OP's issue, I might suspect the disk speed, upstream kubernetes / etcd has some timeout assumptions and can be pretty disk heavy. Trying in memory per #2416 (comment) might be worth a shot.

Unfortunately there is not a ton we can do to reduce that need further, we do tweak things to reduce usage on a few dimensions slightly but mostly usage is just down to kubernetes, the things kind adds are very lightweight and the customizations we do vs a standard kubeadm call / build are only making it lighter (e.g. dockerless + providerless build).

@BenTheElder BenTheElder changed the title kind create cluster fails with "ERROR: failed to create cluster" kind create cluster fails with "ERROR: failed to create cluster" (slow disk operations?) Nov 10, 2021
@BenTheElder BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 10, 2021
@BenTheElder
Copy link
Member

ST2000LM007-1R81

this disk looks fast enough to me, but it seems pretty clear something about this host setup is causing etcd to have very slow reads. other than perhaps #845 (comment) I don't think we can do much about this in kind. etcd's needs come primarily from kubernetes upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

4 participants