Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

some k8s components report unhealthy status after cluster bootsrap #64

Closed
sym3tri opened this issue Jun 23, 2016 · 9 comments
Closed

some k8s components report unhealthy status after cluster bootsrap #64

sym3tri opened this issue Jun 23, 2016 · 9 comments
Labels
dependency/external kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/Pmaybe reviewed/won't fix

Comments

@sym3tri
Copy link

sym3tri commented Jun 23, 2016

curl 127.0.0.1:8080/api/v1/componentstatuses

{
  "kind": "ComponentStatusList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/componentstatuses"
  },
  "items": [
    {
      "metadata": {
        "name": "scheduler",
        "selfLink": "/api/v1/componentstatuses/scheduler",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "False",
          "message": "Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused"
        }
      ]
    },
    {
      "metadata": {
        "name": "controller-manager",
        "selfLink": "/api/v1/componentstatuses/controller-manager",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "False",
          "message": "Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused"
        }
      ]
    },
    {
      "metadata": {
        "name": "etcd-0",
        "selfLink": "/api/v1/componentstatuses/etcd-0",
        "creationTimestamp": null
      },
      "conditions": [
        {
          "type": "Healthy",
          "status": "True",
          "message": "{\"health\": \"true\"}"
        }
      ]
    }
  ]
@aaronlevy aaronlevy self-assigned this Jun 23, 2016
@aaronlevy
Copy link
Contributor

Looks like it's hard-coded to expect scheduler + controller-manager are on the same host as api-server: https://github.com/kubernetes/kubernetes/blob/04ce042ff9cfb32b2c776f755cc7abc886b8a441/pkg/master/master.go#L620-L623

We do not adhere to this assumption because schedule + controller manager are deployments which could be on different hosts (and do not use host-networking).

@sym3tri would you be able to inspect this information from another api-endpoint? Maybe inspecting pods in kube-system, or a specific set of pods via label query?

It seems like this componentstatus endpoint is somewhat contentious as it stands:
kubernetes/kubernetes#18610
kubernetes/kubernetes#19570
kubernetes/kubernetes#13216

@aaronlevy aaronlevy added dependency/external kind/bug Categorizes issue or PR as related to a bug. priority/Pmaybe labels Jun 23, 2016
@bgrant0607
Copy link

I have no love for the current componentstatuses endpoint.

I don't remember whether it was all captured in the proposal, but I think we iterated towards a consensus on Karl's component registration proposal, which you cited:
kubernetes/kubernetes#13216

Someone would need to work on it.

@aaronlevy
Copy link
Contributor

I skimmed the proposal & I more or less agree that it's not exactly a pressing issue to have a single /componentstatuses api-endpoint.

I like the idea of fronting healthcheck endpoints with a service (e.g. "scheduler-health.kube-system.cluster.local"). Then if we wanted to drill down into how many of those pods are healthy, it's just a matter of querying the service itself.

@sym3tri is this still blocking you for any reason? Would the health-check service endpoint be a reasonable end-goal? Or is directly querying the pods sufficient?

@sym3tri
Copy link
Author

sym3tri commented Jul 19, 2016

@aaronlevy Directly querying the pods puts a lot of burden on the caller. If we can have a fronting service that would be ideal.

Directly querying the pods is an ok workaround for the time-being but not a good long-term solution. We'd just be shifting the hardcoded services to our code, and there is no other way to query etcd health via the API.

@aaronlevy
Copy link
Contributor

Opened #85 to track that feature specifically.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dependency/external kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/Pmaybe reviewed/won't fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants