Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus-kube-stack] After successful deployment the prometheus-operator pod fails #742

Closed
Anonymous-Coward opened this issue Mar 8, 2021 · 1 comment · May be fixed by #4751
Closed
Labels
bug Something isn't working

Comments

@Anonymous-Coward
Copy link

Anonymous-Coward commented Mar 8, 2021

Describe the bug
The deployment runs fine, but the prometheus-operator pod fails almost immediately after it starts.

Helm Version:

version.BuildInfo{Version:"v3.5.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

Kubernetes Version:

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.17+IKS", GitCommit:"efb199999989c04fd340dfcfbac6b54737d18f30", GitTreeState:"clean", BuildDate:"2021-02-09T00:24:29Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
prometheus-community/kube-prometheus-stack

Which version of the chart:
kube-prometheus-stack-14.0.0

What happened:
helm setup went fine. All pods started. Only, shortly after startup, the prometheus-operator pod failed, with a very short log:

ts=2021-03-08T19:08:17.263567905Z caller=main.go:99 msg="Staring insecure server on :8080"
level=info ts=2021-03-08T19:08:17.282302674Z caller=operator.go:452 component=prometheusoperator msg="connection established" cluster-version=v1.17.17+IKS
level=info ts=2021-03-08T19:08:17.282386097Z caller=operator.go:294 component=thanosoperator msg="connection established" cluster-version=v1.17.17+IKS
level=info ts=2021-03-08T19:08:17.282473789Z caller=operator.go:214 component=alertmanageroperator msg="connection established" cluster-version=v1.17.17+IKS
ts=2021-03-08T19:08:17.338940504Z caller=main.go:305 msg="Unhandled error received. Exiting..." err="getting CRD: Alertmanager: customresourcedefinitions.apiextensions.k8s.io \"alertmanagers.monitoring.coreos.com\" is forbidden: User \"system:serviceaccount:monitoring:prometheus-kube-prometheus-operator\" cannot get resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the cluster scope"

What you expected to happen:
All pods should go on running forever after deployment.

How to reproduce it (as minimally and precisely as possible):
Just do a similar deployment, i.e. with webhooks and TLS disabled. The root cause, as far as I could investigate, I describe at the end of the ticket.

Changed values of values.yaml (only put values which differ from the defaults):

  tls:
    enabled: false
  admissionWebhooks:
    enabled: false
    patch:
      enabled: false
    tlsProxy:
      enabled: false

The helm command that you execute and failing/misfunctioning:

Helm values set after installation/upgrade:

grafana:
  admin:
    existingSecret: prometheus-grafana-credentials
    passwordKey: adminPassword
    userKey: adminLogin
  grafana.ini:
    auth.proxy:
      auto_sign_up: false
      enabled: true
      header_name: x-user-name
    dataproxy:
      logging: true
    log:
      level: debug
    server:
      domain: ...
      root_url: '%(protocol)s://%(domain)s/grafana'
    smtp:
      enabled: true
      from_address: [email protected]
      from_name: Grafana
      host: ...
      password: ...
      user: ...
    snapshots:
      external_enabled: false
  image:
    pullSecrets:
    - ...
  notifiers:
    notifiers.yaml:
      notifiers:
      - frequency: 1h
        is_default: true
        name: mattermost-notifier
        org_id: 1
        send_reminder: true
        settings:
          url: ...
        type: slack
        uid: mattermost
  service:
    targetPort: 8080
  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
    - ...
    logLevel: debug
    securityContext:
      fsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
          - ReadWriteMany
          resources:
            requests:
              storage: 50Gi
          storageClassName: ...
prometheusOperator:
  admissionWebhooks:
    enabled: false
    patch:
      enabled: false
    tlsProxy:
      enabled: false
  image:
    pullPolicy: IfNotPresent
    repository: quay.io/coreos/prometheus-operator
    tag: v0.38.1
  tls:
    enabled: false

Anything else we need to know:
The prometheus-operator pod runs with the service account prometheus-kube-prometheus-operator. There's neither a cluster role binding nor a role binding that should give that service account access to custom resource definitions.

@Anonymous-Coward Anonymous-Coward added the bug Something isn't working label Mar 8, 2021
@Anonymous-Coward
Copy link
Author

Duplicate. For some reason, the git web UI created a double.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant