[BUG] Node degraded due to kube-apiserver after upgrade #1209

r0mant · 2020-03-05T18:14:30Z

Describe the bug

This was observed in robotest. The upgrade operation completes successfully but the status shows unhealthy due to kube-apiserver unit being in a failed state on one of the nodes:

"nodes":[{"hostname":"robotest-660fd05b-node-1","advertise_ip":"10.142.15.236","role":"master","profile":"node","status":"healthy"},{"hostname":"robotest-660fd05b-node-2","advertise_ip":"10.142.15.238","role":"master","profile":"node","status":"degraded","failed_probes":["kube-apiserver.service (failed)"]},{"hostname":"robotest-660fd05b-node-0","advertise_ip":"10.142.15.237","role":"master","profile":"node","status":"healthy"}]

Looking into the logs, apiserver shut down with an error on that node:

Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: kube-apiserver.service: Main process exited, code=exited, status=255/EXCEPTION
Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: kube-apiserver.service: Failed with result 'exit-code'.
Mar 05 04:52:23 robotest-660fd05b-node-2 systemd[1]: Stopped Kubernetes API Server.

On another node, kube-apiserver was running fine. Another interesting data point is that according to the logs it seemed like two apiservers were running simultaneously (which shouldn't happen) until one of them shut down with the error.

To Reproduce

Unclear, it happened during a 3-node robotest upgrade.

Expected behavior

Logs

10.142.15.236-planet-journal-export.log.gz
10.142.15.238-planet-journal-export.log.gz

To view logs, unzip and:

cat ./planet-journal-export.log | /lib/systemd/systemd-journal-remote -o ./journal/system.journal -
journalctl -D ./journal ...

Environment (please complete the following information):

OS [e.g. Redhat 7.4]:
Gravity [e.g. 5.5.4]:
Platform [e.g. Vmware, AWS]:

Additional context

The text was updated successfully, but these errors were encountered:

wadells · 2020-03-05T20:16:52Z

I hit this in version/6.1.x for 5 different upgrade scenarios, all 5.5.x versions -> 6.1.latest. More info here:

#1203 (comment)

This appears to be flaky, two of the failed scenarios were successful upon a later retry.

wadells · 2020-03-19T18:50:10Z

I believe your work in gravitational/planet#575 is to mitigate this issue @a-palchikov. If I'm mistaken, you're welcome to un-assign.

a-palchikov · 2020-03-19T20:11:17Z

@walt - yes, this is my wip.

wadells · 2020-03-24T22:29:12Z

Triaging robotest runs from 03-11 to 03-24 showed 95 (probable) occurrences of this: link

Updates github.com/gravitational/gravity/issues/1209.

…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. Updates gravitational/gravity#1209.

…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. (#686) Updates gravitational/gravity#1209.

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

r0mant · 2020-06-23T21:44:26Z

Done in 5.5, tracking forward-ports in #1740.

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

r0mant added kind/bug Something isn't working priority/1 Medium priority labels Mar 5, 2020

r0mant added this to the Gravity Reliability Engineering Improvements milestone Mar 5, 2020

wadells mentioned this issue Mar 5, 2020

[6.1] port version & upgrade matrix changes #1203

Merged

wadells assigned a-palchikov Mar 19, 2020

a-palchikov mentioned this issue Apr 1, 2020

Agent state reconciler loop gravitational/planet#575

Closed

a-palchikov mentioned this issue Apr 27, 2020

[6.1.x] Update wormhole to fix upgrade issue #1457

Merged

bernardjkim mentioned this issue Apr 27, 2020

[6.1.x] Add leader change event (#1355) #1387

Merged

This was referenced Jun 9, 2020

Add shrink scenario to PRs and bump robotest to 2.0.0. #1671

Merged

Add shrink scenario to PRs and bump robotest to 2.0.0. #1673

Merged

a-palchikov mentioned this issue Jun 11, 2020

Use kubectl's drain package directly during upgrades #1652

Merged

3 tasks

a-palchikov added port/5.5 Requires port to version/5.5.x port/6.1 Requires port to version/6.1.x port/7.0 Requires port to version/7.0.x labels Jun 15, 2020

a-palchikov added a commit to gravitational/planet that referenced this issue Jun 16, 2020

Reset any control plane units that failed to stop.

786f8f7

Updates github.com/gravitational/gravity/issues/1209.

a-palchikov mentioned this issue Jun 16, 2020

[6.1.x] reset any control plane units that failed to stop gravitational/planet#685

Merged

a-palchikov mentioned this issue Jun 17, 2020

[5.5.x] reset any control plane units that failed to stop gravitational/planet#686

Merged

a-palchikov mentioned this issue Jun 17, 2020

[5.5.x] reset failed service on stop when failing over #1727

Merged

3 tasks

a-palchikov mentioned this issue Jun 18, 2020

[6.1.x] status improvements #1723

Merged

3 tasks

r0mant mentioned this issue Jun 18, 2020

Tracking back/forward ports #1740

Closed

14 tasks

a-palchikov added a commit to gravitational/planet that referenced this issue Jun 19, 2020

[7.0.x] reset any control plane units that failed to stop.

d9737cb

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

a-palchikov added a commit to gravitational/planet that referenced this issue Jun 19, 2020

Reset any control plane units that failed to stop.

19db43e

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

r0mant closed this as completed Jun 23, 2020

a-palchikov added a commit to gravitational/planet that referenced this issue Jun 26, 2020

Reset any control plane units that failed to stop.

d8b999b

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

a-palchikov added a commit to gravitational/planet that referenced this issue Jun 26, 2020

Reset any control plane units that failed to stop. (#688)

f9a2d18

* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Node degraded due to kube-apiserver after upgrade #1209

[BUG] Node degraded due to kube-apiserver after upgrade #1209

r0mant commented Mar 5, 2020

wadells commented Mar 5, 2020 •

edited

Loading

wadells commented Mar 19, 2020

a-palchikov commented Mar 19, 2020

wadells commented Mar 24, 2020

r0mant commented Jun 23, 2020

[BUG] Node degraded due to kube-apiserver after upgrade #1209

[BUG] Node degraded due to kube-apiserver after upgrade #1209

Comments

r0mant commented Mar 5, 2020

wadells commented Mar 5, 2020 • edited Loading

wadells commented Mar 19, 2020

a-palchikov commented Mar 19, 2020

wadells commented Mar 24, 2020

r0mant commented Jun 23, 2020

wadells commented Mar 5, 2020 •

edited

Loading