Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix rootless e2e tests #2750

Closed
neolit123 opened this issue Aug 26, 2022 · 9 comments
Closed

fix rootless e2e tests #2750

neolit123 opened this issue Aug 26, 2022 · 9 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@neolit123
Copy link
Member

neolit123 commented Aug 26, 2022

slack thread in #release-ci-signal
https://kubernetes.slack.com/archives/CN0K3TE2C/p1661502210909039

started failing yesterday

kinder-rootless-control-plane-1:$ kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version
time="15:39:40" level=debug msg="Running: docker exec kinder-rootless-control-plane-1 kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version"

kinder-rootless-control-plane-1:$ Could not execute 'etcd --version' inside "kinder-rootless-control-plane-1" (attempt 8/10): Error from server: error dialing backend: dial tcp 172.17.0.4:10250: connect: connection refused: exit status 1


kinder-rootless-control-plane-1:$ kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version
time="15:39:40" level=debug msg="Running: docker exec kinder-rootless-control-plane-1 kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version"

kinder-rootless-control-plane-1:$ Could not execute 'etcd --version' inside "kinder-rootless-control-plane-1" (attempt 9/10): Error from server: error dialing backend: dial tcp 172.17.0.4:10250: connect: connection refused: exit status 1


kinder-rootless-control-plane-1:$ kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version
time="15:39:40" level=debug msg="Running: docker exec kinder-rootless-control-plane-1 kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-rootless-control-plane-1 -- etcd --version"
Error: failed to exec action cluster-info: exit status 1

kinder-rootless-control-plane-1:$ Could not execute 'etcd --version' inside "kinder-rootless-control-plane-1" (attempt 10/10): Error from server: error dialing backend: dial tcp 172.17.0.4:10250: connect: connection refused: exit status 1

 exit status 1

it seems it cannot call etcd --version on the etcd-kinder-rootless-control-plane-1 pod in the cluster-info task, for some reason.
the apiserver and etcd pods seem to be running.

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-rootless-latest
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-rootless-latest/1562823830211137536

i don't see any potential issues in the before..after failing commits in k/k
kubernetes/kubernetes@b87a436...7627791

@neolit123 neolit123 added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Aug 26, 2022
@neolit123 neolit123 added this to the v1.26 milestone Aug 26, 2022
@neolit123
Copy link
Member Author

neolit123 commented Aug 26, 2022

the same is not failing in this "latest" job
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-25-latest

kinder-upgrade-control-plane-1:$ kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-upgrade-control-plane-1 -- etcd --version
time="06:59:58" level=debug msg="Running: docker exec kinder-upgrade-control-plane-1 kubectl --request-timeout=2 --kubeconfig=/etc/kubernetes/admin.conf exec -n=kube-system etcd-kinder-upgrade-control-plane-1 -- etcd --version"

kinder-upgrade-control-plane-1:$ Using etcdctl version: 3.5.4

very odd...i wonder how the "rootless" mode is related here.

@pacoxu

This comment was marked as abuse.

@neolit123
Copy link
Member Author

Sorry, I misunderstood the errors.Wrap()

yeah the contributor confirmed it will be nil

@pacoxu
Copy link
Member

pacoxu commented Aug 26, 2022

There are some error messages in etcd log like below.

The log is 172.17.0.4 on control-plane-1.

  • 172.17.0.5 is control-plane-3
2022-08-25T15:39:08.582412869Z stderr F {"level":"warn","ts":"2022-08-25T15:39:08.581Z","caller":"etcdserver/cluster_util.go:288","msg":"failed to reach the peer URL","address":"https://172.17.0.5:2380/version","remote-member-id":"19d35eba7c9d3dbf","error":"Get \"https://172.17.0.5:2380/version\": dial tcp 172.17.0.5:2380: connect: connection refused"}
2022-08-25T15:39:08.582469488Z stderr F {"level":"warn","ts":"2022-08-25T15:39:08.582Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"19d35eba7c9d3dbf","error":"Get \"https://172.17.0.5:2380/version\": dial tcp 172.17.0.5:2380: connect: connection refused"}
2022-08-25T15:39:09.32215246Z stderr F {"level":"warn","ts":"2022-08-25T15:39:09.321Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"19d35eba7c9d3dbf","rtt":"9.916407ms","error":"dial tcp 172.17.0.5:2380: connect: connection refused"}
2022-08-25T15:39:09.329435331Z stderr F {"level":"warn","ts":"2022-08-25T15:39:09.329Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"19d35eba7c9d3dbf","rtt":"1.141885ms","error":"dial tcp 172.17.0.5:2380: connect: connection refused"}
2022-08-25T15:39:12.366943809Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.366Z","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"40fd14fa28910cab","to":"19d35eba7c9d3dbf","stream-type":"stream MsgApp v2"}
2022-08-25T15:39:12.367010606Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.366Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"19d35eba7c9d3dbf"}
2022-08-25T15:39:12.367021511Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.366Z","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"40fd14fa28910cab","remote-peer-id":"19d35eba7c9d3dbf"}
2022-08-25T15:39:12.375789688Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.375Z","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"40fd14fa28910cab","to":"19d35eba7c9d3dbf","stream-type":"stream Message"}
2022-08-25T15:39:12.375851336Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.375Z","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"40fd14fa28910cab","remote-peer-id":"19d35eba7c9d3dbf"}
2022-08-25T15:39:12.380608129Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.380Z","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"40fd14fa28910cab","remote-peer-id":"19d35eba7c9d3dbf"}
2022-08-25T15:39:12.381931346Z stderr F {"level":"info","ts":"2022-08-25T15:39:12.381Z","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"40fd14fa28910cab","remote-peer-id":"19d35eba7c9d3dbf"}
2022-08-25T15:39:13.715406586Z stderr F {"level":"warn","ts":"2022-08-25T15:39:13.715Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"172.17.0.5:53302","server-name":"","error":"EOF"}

kubelet exit log

Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: I0825 15:39:33.455967    2214 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: I0825 15:39:33.456031    2214 state_mem.go:36] "Initialized new in-memory state store"
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: I0825 15:39:33.456113    2214 util_unix.go:104] "Using this endpoint is deprecated, please consider using full URL format" endpoint="" URL="unix://"
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: W0825 15:39:33.456870    2214 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "Addr": "",
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "ServerName": "",
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "Attributes": null,
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "BalancerAttributes": null,
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "Type": 0,
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]:   "Metadata": null
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: }. Err: connection error: desc = "transport: Error while dialing dial unix: missing address"
Aug 25 15:39:33 kinder-rootless-control-plane-3 kubelet[2214]: E0825 15:39:33.457525    2214 run.go:74] "command failed" err="failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix: missing address\""
Aug 25 15:39:33 kinder-rootless-control-plane-3 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Aug 25 15:39:33 kinder-rootless-control-plane-3 systemd[1]: kubelet.service: Failed with result 'exit-code'.

@neolit123
Copy link
Member Author

Error while dialing dial unix: missing address

that's odd, is it failing to dial the cri socket for some reason?

might be worth reporting to sig node if we understand better what's happening..

@pacoxu
Copy link
Member

pacoxu commented Aug 26, 2022

I found the problem. I will send a PR.

@pacoxu
Copy link
Member

pacoxu commented Aug 26, 2022

During the upgrade of kubeadm, the kubelet config is corrupt by kubernetes/kubernetes#112000.

kubernetes/kubernetes#112062 will fix it.
The unit test doesn't cover the general case.

@neolit123
Copy link
Member Author

neolit123 commented Aug 26, 2022

makes sense yet the main upgrade job works as mentioned here
#2750 (comment)

@neolit123
Copy link
Member Author

During the upgrade of kubeadm, the kubelet config

looks like that was the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

No branches or pull requests

2 participants