Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating to 1.8 with RBAC is incompatiable #4163

Closed
naveensrinivasan opened this issue Dec 28, 2017 · 31 comments
Closed

Migrating to 1.8 with RBAC is incompatiable #4163

naveensrinivasan opened this issue Dec 28, 2017 · 31 comments

Comments

@naveensrinivasan
Copy link
Contributor

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.
    Version 1.8.0 (git-4876009bd)

  2. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
    v1.7.7

  3. What cloud provider are you using?
    aws

  4. What commands did you run? What is the simplest way to reproduce this issue?
    kops update cluster

  5. What happened after the commands executed?

  6. What did you expect to happen?
    Upgrade the cluster to v1.8.6

  7. Please provide your cluster manifest. Execute
    kops get --name my.example.com -oyaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.

  8. Please run the commands with most verbose logging by adding the -v 10 flag.
    Paste the logs into this report, or in a gist and provide the gist link here.

  9. Anything else do we need to know?

  • We are trying to upgrade the cluster from v1.7.7 to v1.8.6 with RBAC turned on.
  • We used the kops master to upgrade kops version Version 1.8.0 (git-4876009bd)
I1227 16:17:34.682684       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "pods" in namespace "kube-system"
I1227 16:17:34.682827       7 wrap.go:42] POST /api/v1/namespaces/kube-system/pods: (352.225µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.683112       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.683175       7 wrap.go:42] POST /api/v1/namespaces/default/events: (204.479µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.684278       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.684381       7 wrap.go:42] POST /api/v1/namespaces/default/events: (272.221µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-26T20:42:03Z
  name: k8s.playground.REDACTED.io
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://k8s.playground.REDACTED.io/k8s.playground.REDACTED.io
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.k8s.playground.REDACTED.io
  masterPublicName: api.k8s.playground.REDACTED.io
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: us-east-1b
    type: Public
    zone: us-east-1b
  - cidr: 172.20.96.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

We did run this yaml before migrating and it still didn't help.

kubectl get  clusterrolebinding system:node -o yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: 2017-12-26T20:53:27Z
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node
  resourceVersion: "850"
  selfLink: /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings/system%3Anode
  uid: d24fbe68-ea7e-11e7-a9e1-0201c744720e
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

  2. Feel free to provide a design supporting your feature request.

@liggitt
Copy link
Member

liggitt commented Dec 28, 2017

Do the authorization errors persist in the log after the api server has completed startup and /healthz returns a 200? Some denials during server startup are normal as the authorization cache fills

@naveensrinivasan
Copy link
Contributor Author

It continues and the cluster is inoperable.

@liggitt
Copy link
Member

liggitt commented Dec 28, 2017

After upgrading, what does this show?

kubectl get clusterrolebinding system:node -o yaml
kubectl get clusterrole system:node -o yaml

@liggitt
Copy link
Member

liggitt commented Dec 28, 2017

I also see this: https://github.com/kubernetes/kops/blob/1ff42edfac77df99ffa617113e51dad209ae0ce8/upup/models/cloudup/resources/addons/rbac.addons.k8s.io/k8s-1.8.yaml

I'm not familiar with what kops does on upgrade with the add on bindings

@naveensrinivasan
Copy link
Contributor Author

@liggitt I manually ran the above yaml and it didn't help.

The api-server is unavailable after the upgrade so any of the kubectl commands fail.

@liggitt
Copy link
Member

liggitt commented Dec 28, 2017

kubelet permissions should not affect api server availability. I'm not sure how to debug further if the api server is unreachable. Do you have more apiserver logs that might be illuminating? @chrislovecnm any ideas of what else might be at play here?

@KashifSaadat
Copy link
Contributor

KashifSaadat commented Dec 29, 2017

@naveensrinivasan was RBAC already configured and working when the Cluster was on v1.7.7, or did you change it in the spec as part of the upgrade?

@liggitt not sure about the addons behaviour, but if performing an upgrade from v1.7 then the necessary RoleBinding will already exist, so that shouldn't be the issue I suspect.

@naveensrinivasan
Copy link
Contributor Author

@KashifSaadat RBAC was already configured and working when the cluster was v1.7.7

@naveensrinivasan
Copy link
Contributor Author

Here are the log files. https://gist.github.com/naveensrinivasan/80eb10aa3bd2259139b48a6a78100357

I don't know exactly when I grabbed them. This from the master and I grabbed all the logs

  • api
  • controller
  • proxy
  • scheduler

@mqasimsarfraz
Copy link

mqasimsarfraz commented Jan 2, 2018

I am hitting the same issue after the upgrade following different installation method and I am sure system:nodes group have system:node role. Interestingly it isn't just system:nodes I see other groups e.g system:authenticated effected as well.

RBAC DENY: user "system:kube-proxy" groups ["system:authenticated"] cannot "list" resource "services" cluster-wide

Following this API server never comes up and kube control plane is down.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

@naveensrinivasan what does apiserver /healthz show while the API server is crashlooping in that state? do you have the full apiserver manifest used, including all flags?

seeing this, which makes me suspect issues writing to etcd:

I1226 17:20:58.368013       8 trace.go:76] Trace[2144299595]: "Create /api/v1/namespaces" (started: 2017-12-26 17:20:53.848730671 +0000 UTC) (total time: 4.5192501s):
Trace[2144299595]: [4.284563321s] [4.284499666s] About to store object in database
Trace[2144299595]: [4.5192501s] [234.686779ms] END
I1226 17:20:58.368361       8 wrap.go:42] POST /api/v1/namespaces: (4.519639312s) 500

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

@mqasimsarfraz what is the output of a superuser in the system:masters group calling /healthz on the apiserver? RBAC denials could prevent other components from talking to the API server, but would not keep the API server from coming up. I suspect issues reading from and/or writing to etcd

@mqasimsarfraz
Copy link

@liggitt Where can I find that output? also following is what I can find related to /healthz in API server logs:

 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x1ca
logging error output: "[+]ping ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-apiserver-informers ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\nhealthz check failed\n"
 [[kube-probe/1.8] 127.0.0.1:55014]

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

formatted better, that shows:

[+]ping ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[-]poststarthook/bootstrap-controller failed: reason withheld
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[-]poststarthook/ca-registration failed: reason withheld
[+]poststarthook/start-kube-apiserver-informers ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[-]autoregister-completion failed: reason withheld
healthz check failed

the details for the failed hooks are available at these URLs:

/healthz/poststarthook/bootstrap-controller
/healthz/poststarthook/rbac/bootstrap-roles
/healthz/poststarthook/ca-registration
/healthz/autoregister-completion

@mqasimsarfraz
Copy link

Can't find anything useful from URLs:

[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/bootstrap-controller
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:29 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/rbac/bootstrap-roles
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:42 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/ca-registration
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:51 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/autoregister-completion
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:09:02 GMT
Content-Length: 495

internal server error: missing APIService: [v1. v1.authentication.k8s.io v1.authorization.k8s.io v1.autoscaling v1.batch v1.networking.k8s.io v1.rbac.authorization.k8s.io v1.storage.k8s.io v1alpha1.admissionregistration.k8s.io v1beta1.apiextensions.k8s.io v1beta1.apps v1beta1.authentication.k8s.io v1beta1.authorization.k8s.io v1beta1.batch v1beta1.certificates.k8s.io v1beta1.extensions v1beta1.policy v1beta1.rbac.authorization.k8s.io v1beta1.storage.k8s.io v1beta2.apps v2beta1.autoscaling]

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

All of those point to etcd write errors/hangs. Did etcd setup change during the upgrade? What are the flags passed to the apiserver?

@mqasimsarfraz
Copy link

Ahan interesting, No I haven't changed it but let me try to check etcd dumps. Also following are flags to apiserver:

    - --advertise-address=10.1.165.137
    - --etcd-servers=https://10.1.165.214:2379,https://10.1.165.66:2379,https://10.1.165.240:2379
    - --etcd-quorum-read=true
    - --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com-key.pem
    - --insecure-bind-address=0.0.0.0
    - --apiserver-count=3
    - --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,GenericAdmissionWebhook,ResourceQuota
    - --service-cluster-ip-range=10.234.0.0/18
    - --service-node-port-range=30000-32767
    - --client-ca-file=/etc/kubernetes/ssl/ca.pem
    - --profiling=false
    - --repair-malformed-updates=false
    - --kubelet-client-certificate=/etc/kubernetes/ssl/node-kube-master-03.example.com.pem
    - --kubelet-client-key=/etc/kubernetes/ssl/node-kube-master-03.example.com-key.pem
    - --service-account-lookup=true
    - --tls-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --tls-private-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --proxy-client-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --proxy-client-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --service-account-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --secure-port=6443
    - --insecure-port=8080
    - --storage-backend=etcd3
    - --runtime-config=admissionregistration.k8s.io/v1alpha1
    - --v=2
    - --allow-privileged=true
    - --anonymous-auth=False
    - --authorization-mode=RBAC
    - --feature-gates=Initializers=true

@chrislovecnm
Copy link
Contributor

I noticed that etcd is not setup for etcd 3 btw. Check but I think you are still running efcd2

@chrislovecnm
Copy link
Contributor

You have

storageBackend: etcd3

But you are not setting the etcd version in the manifest as required

@mqasimsarfraz
Copy link

@liggitt thanks for the pointer for me it was etcd. The etcd cluster was misbehaving for some reason and everything is back to normal once I fixed it. I wonder why ectd was marked ok in the health check or there wasn't any logging for etcd failure.

[+]etcd ok

Thanks again!

@naveensrinivasan
Copy link
Contributor Author

naveensrinivasan commented Jan 2, 2018

I have it running as etcd3

kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

@naveensrinivasan and is your etcd cluster an etcd3 cluster? What version is it running?

@naveensrinivasan
Copy link
Contributor Author

@liggitt It was running etcd and part of the upgrade I had to change it to etcd3.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

Did you migrate the etcd data from the etcd2 to etcd3 stores? You cannot simply upgrade the etcd binary and switch to etcd3 mode. If you didn't do a migration, you should continue to run kubernetes in etcd2 mode as long as you have v2 data (even against an etcd3 server)

@naveensrinivasan
Copy link
Contributor Author

Nope, I didn't migrate. I was trying to use etcd2 in kops for 1.8 and I was running into issues which made me change to etcd3.

@chrislovecnm Would kops upgrade to v1.8 without moving to etcd3.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

You can continue to use etcd2 (or etcd3 in etcd2 mode) against 1.8 and 1.9

@naveensrinivasan
Copy link
Contributor Author

how do you use etcd2 in etcd3?

@liggitt
Copy link
Member

liggitt commented Jan 2, 2018

Run etcd3 binaries and start the kube apiserver with --storage-backend=etcd2

Kubernetes will continue to use the v2 API (which etcd3 still supports) and will have access to your old c2 data via it

@naveensrinivasan
Copy link
Contributor Author

Thanks, I don't know if kops is doing this or is it possible to do this in kops?

@chrislovecnm
Copy link
Contributor

Yes, remove the etcd3 line in you manifest. Or edit your cluster.

@naveensrinivasan
Copy link
Contributor Author

I think the issue was I was using the kops from the master branch or another version which was causing the whole migration messed up. I pulled the release version of kops 1.8 and it is working. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants