This repository has been archived by the owner on Feb 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 109
[BUG] Node degraded due to kube-apiserver after upgrade #1209
Labels
kind/bug
Something isn't working
port/5.5
Requires port to version/5.5.x
port/6.1
Requires port to version/6.1.x
port/7.0
Requires port to version/7.0.x
priority/1
Medium priority
Milestone
Comments
I hit this in This appears to be flaky, two of the failed scenarios were successful upon a later retry. |
I believe your work in gravitational/planet#575 is to mitigate this issue @a-palchikov. If I'm mistaken, you're welcome to un-assign. |
@walt - yes, this is my wip. |
Triaging robotest runs from 03-11 to 03-24 showed 95 (probable) occurrences of this: link |
This was referenced Jun 9, 2020
3 tasks
a-palchikov
added
port/5.5
Requires port to version/5.5.x
port/6.1
Requires port to version/6.1.x
port/7.0
Requires port to version/7.0.x
labels
Jun 15, 2020
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 16, 2020
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. Updates gravitational/gravity#1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. Updates gravitational/gravity#1209.
3 tasks
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. Updates gravitational/gravity#1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 17, 2020
…ane units that are active on the leader, need to shut down (i.e. during fail-over or when the elections are explicitly paused) and (one of the) units fail to stop properly entering failed state which degrades cluster state as a result. (#686) Updates gravitational/gravity#1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 19, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 19, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 19, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
This was referenced Jun 19, 2020
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 23, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
Done in 5.5, tracking forward-ports in #1740. |
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 26, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
a-palchikov
added a commit
to gravitational/planet
that referenced
this issue
Jun 26, 2020
* Reset any control plane units that failed to stop. * Address review comments Updates github.com/gravitational/gravity/issues/1209.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
kind/bug
Something isn't working
port/5.5
Requires port to version/5.5.x
port/6.1
Requires port to version/6.1.x
port/7.0
Requires port to version/7.0.x
priority/1
Medium priority
Describe the bug
This was observed in robotest. The upgrade operation completes successfully but the status shows unhealthy due to kube-apiserver unit being in a failed state on one of the nodes:
"nodes":[{"hostname":"robotest-660fd05b-node-1","advertise_ip":"10.142.15.236","role":"master","profile":"node","status":"healthy"},{"hostname":"robotest-660fd05b-node-2","advertise_ip":"10.142.15.238","role":"master","profile":"node","status":"degraded","failed_probes":["kube-apiserver.service (failed)"]},{"hostname":"robotest-660fd05b-node-0","advertise_ip":"10.142.15.237","role":"master","profile":"node","status":"healthy"}]
Looking into the logs, apiserver shut down with an error on that node:
On another node, kube-apiserver was running fine. Another interesting data point is that according to the logs it seemed like two apiservers were running simultaneously (which shouldn't happen) until one of them shut down with the error.
To Reproduce
Unclear, it happened during a 3-node robotest upgrade.
Expected behavior
Logs
10.142.15.236-planet-journal-export.log.gz
10.142.15.238-planet-journal-export.log.gz
To view logs, unzip and:
Environment (please complete the following information):
Additional context
The text was updated successfully, but these errors were encountered: