Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

[BUG] Node degraded due to kube-apiserver after upgrade #1209

Closed
r0mant opened this issue Mar 5, 2020 · 5 comments
Closed

[BUG] Node degraded due to kube-apiserver after upgrade #1209

r0mant opened this issue Mar 5, 2020 · 5 comments
Assignees
Labels
kind/bug Something isn't working port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x priority/1 Medium priority

Comments

@r0mant
Copy link
Contributor

r0mant commented Mar 5, 2020

Describe the bug

This was observed in robotest. The upgrade operation completes successfully but the status shows unhealthy due to kube-apiserver unit being in a failed state on one of the nodes:

"nodes":[{"hostname":"robotest-660fd05b-node-1","advertise_ip":"10.142.15.236","role":"master","profile":"node","status":"healthy"},{"hostname":"robotest-660fd05b-node-2","advertise_ip":"10.142.15.238","role":"master","profile":"node","status":"degraded","failed_probes":["kube-apiserver.service (failed)"]},{"hostname":"robotest-660fd05b-node-0","advertise_ip":"10.142.15.237","role":"master","profile":"node","status":"healthy"}]

Looking into the logs, apiserver shut down with an error on that node:

Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: kube-apiserver.service: Main process exited, code=exited, status=255/EXCEPTION
Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: kube-apiserver.service: Failed with result 'exit-code'.
Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: Stopped Kubernetes API Server.

On another node, kube-apiserver was running fine. Another interesting data point is that according to the logs it seemed like two apiservers were running simultaneously (which shouldn't happen) until one of them shut down with the error.

To Reproduce

Unclear, it happened during a 3-node robotest upgrade.

Expected behavior

Logs

10.142.15.236-planet-journal-export.log.gz
10.142.15.238-planet-journal-export.log.gz

To view logs, unzip and:

cat ./planet-journal-export.log | /lib/systemd/systemd-journal-remote -o ./journal/system.journal -
journalctl -D ./journal ...

Environment (please complete the following information):

  • OS [e.g. Redhat 7.4]:
  • Gravity [e.g. 5.5.4]:
  • Platform [e.g. Vmware, AWS]:

Additional context

@r0mant r0mant added kind/bug Something isn't working priority/1 Medium priority labels Mar 5, 2020
@wadells
Copy link
Contributor

wadells commented Mar 5, 2020

I hit this in version/6.1.x for 5 different upgrade scenarios, all 5.5.x versions -> 6.1.latest. More info here:

#1203 (comment)

This appears to be flaky, two of the failed scenarios were successful upon a later retry.

@wadells
Copy link
Contributor

wadells commented Mar 19, 2020

I believe your work in gravitational/planet#575 is to mitigate this issue @a-palchikov. If I'm mistaken, you're welcome to un-assign.

@a-palchikov
Copy link
Contributor

@walt - yes, this is my wip.

@wadells
Copy link
Contributor

wadells commented Mar 24, 2020

Triaging robotest runs from 03-11 to 03-24 showed 95 (probable) occurrences of this: link

@a-palchikov a-palchikov added port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x labels Jun 15, 2020
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 16, 2020
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result.

Updates gravitational/gravity#1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result.

Updates gravitational/gravity#1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result.

Updates gravitational/gravity#1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. (#686)

Updates gravitational/gravity#1209.
@r0mant r0mant mentioned this issue Jun 18, 2020
14 tasks
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 19, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 19, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 19, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 23, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
@r0mant
Copy link
Contributor Author

r0mant commented Jun 23, 2020

Done in 5.5, tracking forward-ports in #1740.

@r0mant r0mant closed this as completed Jun 23, 2020
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 26, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
a-palchikov added a commit to gravitational/planet that referenced this issue Jun 26, 2020
* Reset any control plane units that failed to stop.
* Address review comments

Updates github.com/gravitational/gravity/issues/1209.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Something isn't working port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x priority/1 Medium priority
Projects
None yet
Development

No branches or pull requests

3 participants