after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

hlugt · 2020-05-10T10:44:38Z

Version
rancher/system-upgrade-controller:v0.5.0
rancher/kubectl:v1.18.2
on
rancher/k3s: v1.18.2-k3s1

Platform/Architecture
arm64 (pine64: rock64pro master, rock64 nodes)

Describe the bug
After a succesfull upgrade run the, the upgrade agent pods are deleted, but the upgrade controller running on the maste keeps complaing with error syncing messages:
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Operation cannot be fulfilled on "system-upgrade-controller": delaying object set, requeuing"
"1 controller.go:135] error syncing 'system-upgrade/k3s-server': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp: lookup update.k3s.io on 10.43.0.10:53: server misbehaving, requeuing" (NB: 10.43.0.10 is actual kube-dns pod address)
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp 10.0.0.1:443: connect: network is unreachable, requeuing"

To Reproduce
Have not tried to reproduce as this would most likely mean downgrading the k3s cluster and redoing upgrade .

Expected behavior
Would expect only messages polling the channel for upgrades or plan changes.

Actual behavior
Upgrade controller started upgrade, concurrency 1 on agents. Have seen to be expected behaviour: containers downloading executing, upgrading master/server, and on switching to agents, draining, upgrading. Succesfully restarted all jobs. All well it seems, except these continous flooding of error syncing in log.

Additional context
Added controller deployment and plan yaml's.
1.system-upgrade-controller.txt
2.k3s_upgrade-plan.txt

dweomer · 2020-05-10T22:31:34Z

@hlugt is it always those same 3 errors? It looks like there might be an underlying networking issue from the perspective of the SUC.

hlugt · 2020-05-11T09:59:14Z

Yes, it is the same 3 errors.

Maybe the (original) install/startup parameters of K3S are relevant to note?
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=<***> INSTALL_K3S_EXEC="--no-deploy traefik --no-deploy=servicelb" sh -s - --kubelet-arg="authentication-token-webhook=true" --kubelet-arg="authorization-mode=Webhook" --kubelet-arg="address=0.0.0.0"

Any advice on what and how to check for these possible networking issues?

(Considering to delete the suc pod... but then I will not be able to troubleshoot any further I guess)
(By the way maybe also good to know: have traefik 1.7.24 installed separately and metallb 0.9.3)

dweomer · 2020-05-11T19:50:16Z

Yes, it is the same 3 errors.

Maybe the (original) install/startup parameters of K3S are relevant to note?
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=<***> INSTALL_K3S_EXEC="--no-deploy traefik --no-deploy=servicelb" sh -s - --kubelet-arg="authentication-token-webhook=true" --kubelet-arg="authorization-mode=Webhook" --kubelet-arg="address=0.0.0.0"

Any advice on what and how to check for these possible networking issues?

(Considering to delete the suc pod... but then I will not be able to troubleshoot any further I guess)
(By the way maybe also good to know: have traefik 1.7.24 installed separately and metallb 0.9.3)

@hlugt these two (related?) errors look like the underlying cause to me:

Get https://update.k3s.io/v1-release/channels/latest: dial tcp: lookup update.k3s.io on 10.43.0.10:53: server misbehaving
Get https://update.k3s.io/v1-release/channels/latest: dial tcp 10.0.0.1:443: connect: network is unreachable

dweomer · 2020-05-11T19:54:19Z

Shoot, I wonder if you are running into k3s-io/k3s#1719 ? Is the coredns pod running on a different host than the SUC?

hlugt · 2020-05-11T19:55:52Z

Shoot, I wonder if you are running into rancher/k3s#1719 ? Is the coredns pod running on a different host than the SUC?

No, they are both on the master.

hlugt · 2020-05-11T20:04:31Z

Get https://update.k3s.io/v1-release/channels/latest: dial tcp: lookup update.k3s.io on 10.43.0.10:53: server misbehaving

Get https://update.k3s.io/v1-release/channels/latest: dial tcp 10.0.0.1:443: connect: network is unreachable

Yes, it would have been best to get into the suc controller to try the network access. From another pod (that had curl installed) I was able to access the url...

Afraid though that I have rebuild my cluster (and forgot to stick to 1.17.4 to be able to upgrade again) due to a messed up rancher ui install waiting on api when trying to import cluster.
This also seems to point to connections issues, and have therefore skipped the extra kubelet arguments for now and will look further...
When I have resolved that issue I will return to the SUC...

For now I can imagine you want to put this on hold or even have me close it?

dweomer · 2020-05-11T20:10:18Z

For now I can imagine you want to put this on hold or even have me close it?

We can leave it open for now.

hlugt · 2020-05-11T21:51:31Z

Ok, reverted to 1.17.4 and minimized amount of pods. Installed SUC and now the errors seem to stay away.
So either I had issues caused by too much io on the nfs due to too many concurrent restarting pods, or removing the kubelet args may have helped.

Will check in a few hours to see whether the errors do not return.
I would have expected info messages mentioning checking the plan and the channel for available updates, but these are not shown?
(now the last message is from during the upgrade telling it is unable to handle the request get secrets, get nodes, get job.batch, get plans.upgrade.catlle.io -> that is as it should be I presume?)

Regards

(btw: think above mentioned rancher ui api error is due to self signed certs. It gave no issues when trying the secure k3s cluster import step, but curl gives back that I probably do need the insecure one...)

hlugt · 2020-05-12T07:13:33Z

Yep: all is well, no more flooding log messages.
I will close this, sorry for the bother.

hlugt closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

hlugt commented May 10, 2020

dweomer commented May 10, 2020

hlugt commented May 11, 2020 •

edited

Loading

dweomer commented May 11, 2020

dweomer commented May 11, 2020

hlugt commented May 11, 2020

hlugt commented May 11, 2020

dweomer commented May 11, 2020

hlugt commented May 11, 2020 •

edited

Loading

hlugt commented May 12, 2020

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

Comments

hlugt commented May 10, 2020

dweomer commented May 10, 2020

hlugt commented May 11, 2020 • edited Loading

dweomer commented May 11, 2020

dweomer commented May 11, 2020

hlugt commented May 11, 2020

hlugt commented May 11, 2020

dweomer commented May 11, 2020

hlugt commented May 11, 2020 • edited Loading

hlugt commented May 12, 2020

hlugt commented May 11, 2020 •

edited

Loading

hlugt commented May 11, 2020 •

edited

Loading