Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after upgrade k3s continuous "1 controller.go:135] error syncing 'system-upgrade/k3s-agent' messages" #72

Closed
hlugt opened this issue May 10, 2020 · 9 comments

Comments

@hlugt
Copy link

hlugt commented May 10, 2020

Version
rancher/system-upgrade-controller:v0.5.0
rancher/kubectl:v1.18.2
on
rancher/k3s: v1.18.2-k3s1

Platform/Architecture
arm64 (pine64: rock64pro master, rock64 nodes)

Describe the bug
After a succesfull upgrade run the, the upgrade agent pods are deleted, but the upgrade controller running on the maste keeps complaing with error syncing messages:
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Operation cannot be fulfilled on "system-upgrade-controller": delaying object set, requeuing"
"1 controller.go:135] error syncing 'system-upgrade/k3s-server': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp: lookup update.k3s.io on 10.43.0.10:53: server misbehaving, requeuing" (NB: 10.43.0.10 is actual kube-dns pod address)
"1 controller.go:135] error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Get https://update.k3s.io/v1-release/channels/latest: dial tcp 10.0.0.1:443: connect: network is unreachable, requeuing"

To Reproduce
Have not tried to reproduce as this would most likely mean downgrading the k3s cluster and redoing upgrade .

Expected behavior
Would expect only messages polling the channel for upgrades or plan changes.

Actual behavior
Upgrade controller started upgrade, concurrency 1 on agents. Have seen to be expected behaviour: containers downloading executing, upgrading master/server, and on switching to agents, draining, upgrading. Succesfully restarted all jobs. All well it seems, except these continous flooding of error syncing in log.

Additional context
Added controller deployment and plan yaml's.
1.system-upgrade-controller.txt
2.k3s_upgrade-plan.txt

@dweomer
Copy link
Contributor

dweomer commented May 10, 2020

@hlugt is it always those same 3 errors? It looks like there might be an underlying networking issue from the perspective of the SUC.

@hlugt
Copy link
Author

hlugt commented May 11, 2020

Yes, it is the same 3 errors.

Maybe the (original) install/startup parameters of K3S are relevant to note?
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=<***> INSTALL_K3S_EXEC="--no-deploy traefik --no-deploy=servicelb" sh -s - --kubelet-arg="authentication-token-webhook=true" --kubelet-arg="authorization-mode=Webhook" --kubelet-arg="address=0.0.0.0"

Any advice on what and how to check for these possible networking issues?

(Considering to delete the suc pod... but then I will not be able to troubleshoot any further I guess)
(By the way maybe also good to know: have traefik 1.7.24 installed separately and metallb 0.9.3)

@dweomer
Copy link
Contributor

dweomer commented May 11, 2020

Yes, it is the same 3 errors.

Maybe the (original) install/startup parameters of K3S are relevant to note?
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=<***> INSTALL_K3S_EXEC="--no-deploy traefik --no-deploy=servicelb" sh -s - --kubelet-arg="authentication-token-webhook=true" --kubelet-arg="authorization-mode=Webhook" --kubelet-arg="address=0.0.0.0"

Any advice on what and how to check for these possible networking issues?

(Considering to delete the suc pod... but then I will not be able to troubleshoot any further I guess)
(By the way maybe also good to know: have traefik 1.7.24 installed separately and metallb 0.9.3)

@hlugt these two (related?) errors look like the underlying cause to me:

@dweomer
Copy link
Contributor

dweomer commented May 11, 2020

Shoot, I wonder if you are running into k3s-io/k3s#1719 ? Is the coredns pod running on a different host than the SUC?

@hlugt
Copy link
Author

hlugt commented May 11, 2020

Shoot, I wonder if you are running into rancher/k3s#1719 ? Is the coredns pod running on a different host than the SUC?

No, they are both on the master.

@hlugt
Copy link
Author

hlugt commented May 11, 2020

Yes, it would have been best to get into the suc controller to try the network access. From another pod (that had curl installed) I was able to access the url...

Afraid though that I have rebuild my cluster (and forgot to stick to 1.17.4 to be able to upgrade again) due to a messed up rancher ui install waiting on api when trying to import cluster.
This also seems to point to connections issues, and have therefore skipped the extra kubelet arguments for now and will look further...
When I have resolved that issue I will return to the SUC...

For now I can imagine you want to put this on hold or even have me close it?

@dweomer
Copy link
Contributor

dweomer commented May 11, 2020

For now I can imagine you want to put this on hold or even have me close it?

We can leave it open for now.

@hlugt
Copy link
Author

hlugt commented May 11, 2020

Ok, reverted to 1.17.4 and minimized amount of pods. Installed SUC and now the errors seem to stay away.
So either I had issues caused by too much io on the nfs due to too many concurrent restarting pods, or removing the kubelet args may have helped.

Will check in a few hours to see whether the errors do not return.
I would have expected info messages mentioning checking the plan and the channel for available updates, but these are not shown?
(now the last message is from during the upgrade telling it is unable to handle the request get secrets, get nodes, get job.batch, get plans.upgrade.catlle.io -> that is as it should be I presume?)

Regards

(btw: think above mentioned rancher ui api error is due to self signed certs. It gave no issues when trying the secure k3s cluster import step, but curl gives back that I probably do need the insecure one...)

@hlugt
Copy link
Author

hlugt commented May 12, 2020

Yep: all is well, no more flooding log messages.
I will close this, sorry for the bother.

@hlugt hlugt closed this as completed May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants