Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana-agent-operator: integration pod failing (grafana-agent-metrics does not exist) #3282

Closed
samlaf opened this issue Mar 14, 2023 · 11 comments · Fixed by #5099
Closed

Grafana-agent-operator: integration pod failing (grafana-agent-metrics does not exist) #3282

samlaf opened this issue Mar 14, 2023 · 11 comments · Fixed by #5099
Assignees
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Milestone

Comments

@samlaf
Copy link

samlaf commented Mar 14, 2023

I'm installing the grafana agent operator on our aws eks cluster (tried on a local kind cluster and not getting this error).

grafana-agent-integrations-ds pod lands in this state

pod/grafana-agent-integrations-ds-7nr7f                  1/2     CrashLoopBackOff   3 (22s ago)   71s

and the relevant logs show (see bottom for complete logs)

...
ts=2023-03-14T23:37:58.554159683Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter 
ts=2023-03-14T23:37:58.554735044Z caller=autoscrape.go:127 level=error component=autoscraper msg="cannot autoscrape integration" name=node_exporter/ip-10-0-2-49.ec2.internal:8080 err="instance monitoring/grafana-agent-metrics does not exist"
ts=2023-03-14T23:37:58.554766659Z caller=main.go:57 level=error msg="error creating the agent server entrypoint" err="configuring autoscraper failed: instance monitoring/grafana-agent-metrics does not exist"

Fix

I realized that this is fixed simply by restarting the integration daemonset...
kubectl -n monitoring rollout restart daemonset grafana-agent-integrations-ds

Not sure why this is happening, but would be nice not to have to restart the daemonset everytime.

Steps to reproduce

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana-agent-operator grafana/grafana-agent-operator -n monitoring
k apply -f grafana-agent-operator-CRs-manifest.yaml -n monitoring

where the manifest is

apiVersion: v1
kind: ServiceAccount
metadata:
  name: grafana-agent
  namespace: monitoring
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.5.0
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: v1
data: {}
kind: Secret
metadata:
  name: logs-secret
  namespace: monitoring
stringData:
  password: PASSWORD
  username: USERNAME
type: Opaque
---
apiVersion: v1
data: {}
kind: Secret
metadata:
  name: metrics-secret
  namespace: monitoring
stringData:
  password: PASSWORD
  username: USERNAME
type: Opaque
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: agent-eventhandler
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grafana-agent
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - nodes/metrics
  - services
  - endpoints
  - pods
  - events
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
- nonResourceURLs:
  - /metrics
  - /metrics/cadvisor
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.5.0
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: grafana-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana-agent
subjects:
- kind: ServiceAccount
  name: grafana-agent
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.5.0
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.5.0
  name: kube-state-metrics
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: 2.5.0
  name: kube-state-metrics
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: 2.5.0
    spec:
      automountServiceAccountToken: true
      containers:
      - image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.5.0
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsUser: 65534
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: GrafanaAgent
metadata:
  name: grafana-agent
  namespace: monitoring
spec:
  image: grafana/agent:v0.26.1
  integrations:
    selector:
      matchLabels:
        agent: grafana-agent
  logs:
    instanceSelector:
      matchLabels:
        agent: grafana-agent
  metrics:
    externalLabels:
      cluster: eks_dev_stakewise
    instanceSelector:
      matchLabels:
        agent: grafana-agent
    scrapeInterval: 60s
  serviceAccountName: grafana-agent
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: Integration
metadata:
  labels:
    agent: grafana-agent
  name: agent-eventhandler
  namespace: monitoring
spec:
  config:
    cache_path: /etc/eventhandler/eventhandler.cache
    logs_instance: monitoring/grafana-agent-logs
  name: eventhandler
  type:
    unique: true
  volumeMounts:
  - mountPath: /etc/eventhandler
    name: agent-eventhandler
  volumes:
  - name: agent-eventhandler
    persistentVolumeClaim:
      claimName: agent-eventhandler
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: Integration
metadata:
  labels:
    agent: grafana-agent
  name: node-exporter
  namespace: monitoring
spec:
  config:
    autoscrape:
      enable: true
      metrics_instance: monitoring/grafana-agent-metrics
    procfs_path: host/proc
    rootfs_path: /host/root
    sysfs_path: /host/sys
  name: node_exporter
  type:
    allNodes: true
    unique: true
  volumeMounts:
  - mountPath: /host/root
    name: rootfs
  - mountPath: /host/sys
    name: sysfs
  - mountPath: /host/proc
    name: procfs
  volumes:
  - hostPath:
      path: /
    name: rootfs
  - hostPath:
      path: /sys
    name: sysfs
  - hostPath:
      path: /proc
    name: procfs
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: LogsInstance
metadata:
  labels:
    agent: grafana-agent
  name: grafana-agent-logs
  namespace: monitoring
spec:
  clients:
  - basicAuth:
      password:
        key: password
        name: logs-secret
      username:
        key: username
        name: logs-secret
    externalLabels:
      cluster: eks_dev_stakewise
    url: https://logs-prod-017.grafana.net/loki/api/v1/push
  podLogsNamespaceSelector: {}
  podLogsSelector:
    matchLabels:
      instance: primary
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: MetricsInstance
metadata:
  labels:
    agent: grafana-agent
  name: grafana-agent-metrics
  namespace: monitoring
spec:
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      instance: primary
  remoteWrite:
  - basicAuth:
      password:
        key: password
        name: metrics-secret
      username:
        key: username
        name: metrics-secret
    url: https://prometheus-us-central1.grafana.net/api/prom/push
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      instance: primary
---
apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
  labels:
    instance: primary
  name: kubernetes-logs
  namespace: monitoring
spec:
  namespaceSelector:
    any: true
  pipelineStages:
  - cri: {}
  relabelings:
  - sourceLabels:
    - __meta_kubernetes_pod_node_name
    targetLabel: __host__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    sourceLabels:
    - __meta_kubernetes_namespace
    targetLabel: namespace
  - action: replace
    sourceLabels:
    - __meta_kubernetes_pod_name
    targetLabel: pod
  - action: replace
    sourceLabels:
    - __meta_kubernetes_container_name
    targetLabel: container
  - replacement: /var/log/pods/*$1/*.log
    separator: /
    sourceLabels:
    - __meta_kubernetes_pod_uid
    - __meta_kubernetes_pod_container_name
    targetLabel: __path__
  selector:
    matchLabels: {}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    instance: primary
  name: cadvisor-monitor
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    interval: 60s
    metricRelabelings:
    - action: keep
      regex: kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubernetes_build_info|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_cpu_cfs_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kube_horizontalpodautoscaler_status_desired_replicas|kubelet_server_expiration_renew_errors|kube_pod_status_reason|rest_client_requests_total|kube_statefulset_status_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes_used|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|machine_memory_bytes|volume_manager_total_volumes|kube_deployment_spec_replicas|namespace_memory:kube_pod_container_resource_requests:sum|process_resident_memory_bytes|kube_statefulset_status_observed_generation|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kube_deployment_status_observed_generation|kube_node_info|kubelet_runtime_operations_errors_total|kube_deployment_metadata_generation|namespace_workload_pod:kube_pod_owner:relabel|kube_node_status_allocatable|kubelet_certificate_manager_server_ttl_seconds|container_fs_writes_total|kubelet_pleg_relist_interval_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|container_network_receive_bytes_total|kubelet_node_name|kube_statefulset_status_current_revision|kube_job_status_start_time|kube_horizontalpodautoscaler_spec_max_replicas|kube_node_spec_taint|kubelet_running_pods|kubelet_running_containers|kubelet_pleg_relist_duration_seconds_bucket|container_cpu_usage_seconds_total|container_network_transmit_packets_total|container_cpu_cfs_throttled_periods_total|node_filesystem_avail_bytes|kubelet_certificate_manager_client_ttl_seconds|container_memory_rss|container_fs_reads_bytes_total|kubelet_volume_stats_available_bytes|kube_daemonset_status_number_available|kube_pod_owner|go_goroutines|kube_daemonset_status_updated_number_scheduled|kube_statefulset_metadata_generation|container_network_transmit_bytes_total|node_filesystem_size_bytes|kubelet_running_pod_count|kube_statefulset_status_replicas_updated|kubelet_node_config_error|kube_deployment_status_replicas_available|kube_daemonset_status_number_misscheduled|container_memory_cache|kubelet_volume_stats_inodes|kube_statefulset_status_replicas_ready|kube_replicaset_owner|kubelet_cgroup_manager_duration_seconds_count|kube_statefulset_replicas|kube_horizontalpodautoscaler_status_current_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|container_network_receive_packets_total|kube_node_status_capacity|node_namespace_pod_container:container_memory_swap|storage_operation_duration_seconds_count|storage_operation_errors_total|kube_horizontalpodautoscaler_spec_min_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|namespace_workload_pod|kube_pod_info|kube_deployment_status_replicas_updated|container_memory_swap|kube_pod_status_phase|kube_resourcequota|kubelet_pod_worker_duration_seconds_bucket|kube_job_failed|kube_daemonset_status_desired_number_scheduled|container_memory_working_set_bytes|kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|kube_job_status_active|kube_daemonset_status_current_number_scheduled|container_network_transmit_packets_dropped_total|kube_persistentvolumeclaim_resource_requests_storage_bytes|kubelet_pod_start_duration_seconds_bucket|kube_node_status_condition|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_cache|kubelet_running_container_count|kube_pod_container_status_waiting_reason|container_fs_reads_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_count|kube_namespace_status_phase|container_fs_writes_bytes_total|kube_statefulset_status_update_revision|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|container_network_receive_packets_dropped_total|kube_pod_container_resource_requests|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
      sourceLabels:
      - __name__
    path: /metrics/cadvisor
    port: https-metrics
    relabelings:
    - sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    - action: replace
      replacement: integrations/kubernetes/cadvisor
      targetLabel: job
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      app.kubernetes.io/name: kubelet
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    instance: primary
  name: ksm-monitor
  namespace: monitoring
spec:
  endpoints:
  - honorLabels: true
    interval: 60s
    metricRelabelings:
    - action: keep
      regex: kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubernetes_build_info|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_cpu_cfs_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kube_horizontalpodautoscaler_status_desired_replicas|kubelet_server_expiration_renew_errors|kube_pod_status_reason|rest_client_requests_total|kube_statefulset_status_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes_used|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|machine_memory_bytes|volume_manager_total_volumes|kube_deployment_spec_replicas|namespace_memory:kube_pod_container_resource_requests:sum|process_resident_memory_bytes|kube_statefulset_status_observed_generation|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kube_deployment_status_observed_generation|kube_node_info|kubelet_runtime_operations_errors_total|kube_deployment_metadata_generation|namespace_workload_pod:kube_pod_owner:relabel|kube_node_status_allocatable|kubelet_certificate_manager_server_ttl_seconds|container_fs_writes_total|kubelet_pleg_relist_interval_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|container_network_receive_bytes_total|kubelet_node_name|kube_statefulset_status_current_revision|kube_job_status_start_time|kube_horizontalpodautoscaler_spec_max_replicas|kube_node_spec_taint|kubelet_running_pods|kubelet_running_containers|kubelet_pleg_relist_duration_seconds_bucket|container_cpu_usage_seconds_total|container_network_transmit_packets_total|container_cpu_cfs_throttled_periods_total|node_filesystem_avail_bytes|kubelet_certificate_manager_client_ttl_seconds|container_memory_rss|container_fs_reads_bytes_total|kubelet_volume_stats_available_bytes|kube_daemonset_status_number_available|kube_pod_owner|go_goroutines|kube_daemonset_status_updated_number_scheduled|kube_statefulset_metadata_generation|container_network_transmit_bytes_total|node_filesystem_size_bytes|kubelet_running_pod_count|kube_statefulset_status_replicas_updated|kubelet_node_config_error|kube_deployment_status_replicas_available|kube_daemonset_status_number_misscheduled|container_memory_cache|kubelet_volume_stats_inodes|kube_statefulset_status_replicas_ready|kube_replicaset_owner|kubelet_cgroup_manager_duration_seconds_count|kube_statefulset_replicas|kube_horizontalpodautoscaler_status_current_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|container_network_receive_packets_total|kube_node_status_capacity|node_namespace_pod_container:container_memory_swap|storage_operation_duration_seconds_count|storage_operation_errors_total|kube_horizontalpodautoscaler_spec_min_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|namespace_workload_pod|kube_pod_info|kube_deployment_status_replicas_updated|container_memory_swap|kube_pod_status_phase|kube_resourcequota|kubelet_pod_worker_duration_seconds_bucket|kube_job_failed|kube_daemonset_status_desired_number_scheduled|container_memory_working_set_bytes|kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|kube_job_status_active|kube_daemonset_status_current_number_scheduled|container_network_transmit_packets_dropped_total|kube_persistentvolumeclaim_resource_requests_storage_bytes|kubelet_pod_start_duration_seconds_bucket|kube_node_status_condition|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_cache|kubelet_running_container_count|kube_pod_container_status_waiting_reason|container_fs_reads_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_count|kube_namespace_status_phase|container_fs_writes_bytes_total|kube_statefulset_status_update_revision|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|container_network_receive_packets_dropped_total|kube_pod_container_resource_requests|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
      sourceLabels:
      - __name__
    path: /metrics
    port: http-metrics
    relabelings:
    - action: replace
      replacement: integrations/kubernetes/kube-state-metrics
      targetLabel: job
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    instance: primary
  name: kubelet-monitor
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    interval: 60s
    metricRelabelings:
    - action: keep
      regex: kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubernetes_build_info|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|container_cpu_cfs_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kube_horizontalpodautoscaler_status_desired_replicas|kubelet_server_expiration_renew_errors|kube_pod_status_reason|rest_client_requests_total|kube_statefulset_status_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_volume_stats_inodes_used|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|machine_memory_bytes|volume_manager_total_volumes|kube_deployment_spec_replicas|namespace_memory:kube_pod_container_resource_requests:sum|process_resident_memory_bytes|kube_statefulset_status_observed_generation|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kube_deployment_status_observed_generation|kube_node_info|kubelet_runtime_operations_errors_total|kube_deployment_metadata_generation|namespace_workload_pod:kube_pod_owner:relabel|kube_node_status_allocatable|kubelet_certificate_manager_server_ttl_seconds|container_fs_writes_total|kubelet_pleg_relist_interval_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|container_network_receive_bytes_total|kubelet_node_name|kube_statefulset_status_current_revision|kube_job_status_start_time|kube_horizontalpodautoscaler_spec_max_replicas|kube_node_spec_taint|kubelet_running_pods|kubelet_running_containers|kubelet_pleg_relist_duration_seconds_bucket|container_cpu_usage_seconds_total|container_network_transmit_packets_total|container_cpu_cfs_throttled_periods_total|node_filesystem_avail_bytes|kubelet_certificate_manager_client_ttl_seconds|container_memory_rss|container_fs_reads_bytes_total|kubelet_volume_stats_available_bytes|kube_daemonset_status_number_available|kube_pod_owner|go_goroutines|kube_daemonset_status_updated_number_scheduled|kube_statefulset_metadata_generation|container_network_transmit_bytes_total|node_filesystem_size_bytes|kubelet_running_pod_count|kube_statefulset_status_replicas_updated|kubelet_node_config_error|kube_deployment_status_replicas_available|kube_daemonset_status_number_misscheduled|container_memory_cache|kubelet_volume_stats_inodes|kube_statefulset_status_replicas_ready|kube_replicaset_owner|kubelet_cgroup_manager_duration_seconds_count|kube_statefulset_replicas|kube_horizontalpodautoscaler_status_current_replicas|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|container_network_receive_packets_total|kube_node_status_capacity|node_namespace_pod_container:container_memory_swap|storage_operation_duration_seconds_count|storage_operation_errors_total|kube_horizontalpodautoscaler_spec_min_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|namespace_workload_pod|kube_pod_info|kube_deployment_status_replicas_updated|container_memory_swap|kube_pod_status_phase|kube_resourcequota|kubelet_pod_worker_duration_seconds_bucket|kube_job_failed|kube_daemonset_status_desired_number_scheduled|container_memory_working_set_bytes|kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|kube_job_status_active|kube_daemonset_status_current_number_scheduled|container_network_transmit_packets_dropped_total|kube_persistentvolumeclaim_resource_requests_storage_bytes|kubelet_pod_start_duration_seconds_bucket|kube_node_status_condition|kubelet_runtime_operations_total|node_namespace_pod_container:container_memory_cache|kubelet_running_container_count|kube_pod_container_status_waiting_reason|container_fs_reads_total|node_namespace_pod_container:container_memory_rss|kubelet_pleg_relist_duration_seconds_count|kube_namespace_status_phase|container_fs_writes_bytes_total|kube_statefulset_status_update_revision|kubelet_volume_stats_capacity_bytes|process_cpu_seconds_total|kubelet_certificate_manager_client_expiration_renew_errors|container_network_receive_packets_dropped_total|kube_pod_container_resource_requests|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*
      sourceLabels:
      - __name__
    path: /metrics
    port: https-metrics
    relabelings:
    - sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    - action: replace
      replacement: integrations/kubernetes/kubelet
      targetLabel: job
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      app.kubernetes.io/name: kubelet
---

complete logs of the restarting pod are

ts=2023-03-14T23:37:58.548237036Z caller=server.go:191 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2023-03-14T23:37:58.549132798Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2023-03-14T23:37:58.54928561Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2023-03-14T23:37:58Z level=info caller=traces/traces.go:143 msg="Traces Logger Initialized" component=traces
ts=2023-03-14T23:37:58.551620368Z caller=integrations.go:138 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2023-03-14T23:37:58.552092357Z caller=filesystem_common.go:111 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)
ts=2023-03-14T23:37:58.552207301Z caller=filesystem_common.go:113 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2023-03-14T23:37:58.553425099Z caller=node_exporter.go:53 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 msg="Enabled node_exporter collectors"
ts=2023-03-14T23:37:58.553457811Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=arp
ts=2023-03-14T23:37:58.55354026Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=bcache
ts=2023-03-14T23:37:58.553553664Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=bonding
ts=2023-03-14T23:37:58.553610139Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=btrfs
ts=2023-03-14T23:37:58.553626482Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=conntrack
ts=2023-03-14T23:37:58.553655327Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=cpu
ts=2023-03-14T23:37:58.553665967Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=cpufreq
ts=2023-03-14T23:37:58.553678711Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=diskstats
ts=2023-03-14T23:37:58.553747377Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=dmi
ts=2023-03-14T23:37:58.553759495Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=edac
ts=2023-03-14T23:37:58.553808943Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=entropy
ts=2023-03-14T23:37:58.55381907Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=fibrechannel
ts=2023-03-14T23:37:58.553827883Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=filefd
ts=2023-03-14T23:37:58.553836442Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=filesystem
ts=2023-03-14T23:37:58.553844832Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=hwmon
ts=2023-03-14T23:37:58.553852258Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=infiniband
ts=2023-03-14T23:37:58.553859687Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=ipvs
ts=2023-03-14T23:37:58.553867093Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=loadavg
ts=2023-03-14T23:37:58.553874711Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=mdadm
ts=2023-03-14T23:37:58.553888459Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=meminfo
ts=2023-03-14T23:37:58.553901129Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=netclass
ts=2023-03-14T23:37:58.553908845Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=netdev
ts=2023-03-14T23:37:58.553921796Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=netstat
ts=2023-03-14T23:37:58.553930407Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=nfs
ts=2023-03-14T23:37:58.553938739Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=nfsd
ts=2023-03-14T23:37:58.55394962Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=nvme
ts=2023-03-14T23:37:58.553958234Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=os
ts=2023-03-14T23:37:58.55396803Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=powersupplyclass
ts=2023-03-14T23:37:58.553976699Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=pressure
ts=2023-03-14T23:37:58.55398493Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=rapl
ts=2023-03-14T23:37:58.553993329Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=schedstat
ts=2023-03-14T23:37:58.554002782Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=sockstat
ts=2023-03-14T23:37:58.55401322Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=softnet
ts=2023-03-14T23:37:58.554022103Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=stat
ts=2023-03-14T23:37:58.55403019Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=tapestats
ts=2023-03-14T23:37:58.55404158Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=textfile
ts=2023-03-14T23:37:58.554051868Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=thermal_zone
ts=2023-03-14T23:37:58.554061775Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=time
ts=2023-03-14T23:37:58.554075301Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=timex
ts=2023-03-14T23:37:58.554086096Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=udp_queues
ts=2023-03-14T23:37:58.554108334Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=uname
ts=2023-03-14T23:37:58.554119369Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=vmstat
ts=2023-03-14T23:37:58.554149485Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=xfs
ts=2023-03-14T23:37:58.554159683Z caller=node_exporter.go:60 level=info component=integrations integration=node_exporter instance=ip-10-0-2-49.ec2.internal:8080 collector=zfs
ts=2023-03-14T23:37:58.554735044Z caller=autoscrape.go:127 level=error component=autoscraper msg="cannot autoscrape integration" name=node_exporter/ip-10-0-2-49.ec2.internal:8080 err="instance monitoring/grafana-agent-metrics does not exist"
ts=2023-03-14T23:37:58.554766659Z caller=main.go:57 level=error msg="error creating the agent server entrypoint" err="configuring autoscraper failed: instance monitoring/grafana-agent-metrics does not exist"
@ThisDevDane
Copy link

I'm experiencing the same problem

@anmolnagpal
Copy link

I'm getting the same issue :(

@meezaan
Copy link

meezaan commented Mar 17, 2023

Same here.

@rfratto rfratto added the bug Something isn't working label Mar 18, 2023
@robertvandervoort
Copy link
Contributor

I also ran into this. The fix worked.

@icron
Copy link

icron commented Mar 26, 2023

the same error.

@allnightlong
Copy link

Thanks for workaround!

@tiredpixel
Copy link

I, too, experienced this on at least 2 clusters, Kubernetes 1.26.3, Grafana Agent 0.33.1, deployed via Grafana Cloud instructions, using the operator method installed via Helm, configured to collect metrics, logs, and events. Restarting the daemonset worked in each case. Thanks for the workaround!

@captncraig
Copy link
Contributor

I've had some difficulty reproducing, but I think I have now. I can only get it to work if I:

  • Create all resources except the MetricsInstance
  • Wait for the daemonset to deploy
  • Add the MetricsInstance to the cluster.

The integrations daemonset continues crash-looping, even though the secret is correct. Ordering seems to matter here, and a reconcile where the integration exists but the metricsinstance does not seems to cause a problem. I'd expect reconcile to fail in that case, but it seems to put things in a bad state.

The second problem is why the daemonsets are not recovering once the secret is correct. The volumes all look right. Perhaps something is off with the reloader there.

I am continuing to dig on this.

@captncraig
Copy link
Contributor

captncraig commented Jul 6, 2023

Ah, the reload fails from the sidecar container. That explains why it does not self-resolve:

level=error ts=2023-07-06T22:14:01.512739968Z caller=runutil.go:100 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://127.0.0.1:8080/-/reload\": dial tcp 127 │
│ .0.0.1:8080: connect: connection refused" 

But I'm still not sure why a crash-looping agent container would not get the updates if the reloader is otherwise working. Need to dig more.

@captncraig captncraig self-assigned this Jul 7, 2023
@captncraig
Copy link
Contributor

Fix to reloader is in thanos-io/thanos#6519. If that is accepted, I will try to get it into prometheus-operator who owns the reloader image, and then into the operator defaults.

@captncraig
Copy link
Contributor

The fix for this has been merged upstream, and the main build of prometheus-config-reloader has a fix for this deadlock. It can be used by setting a field on your GrafanaAgent resources if you are using agent-operator v0.35.0 or after. I have not updated the default yet, but I will when prometheus-operator makes a new release. Unsure of the timing on that.

apiVersion: monitoring.grafana.com/v1alpha1
kind: GrafanaAgent
metadata:
  ...
spec:
  configReloaderVersion: main
  ...

If that field is not availible, you may need to update your CRD definitions from the latest in the repo.

I will leave this issue open until I can update the default image the operator uses to a stable release version.

@jcreixell jcreixell added this to the v0.36.0 milestone Aug 22, 2023
@jcreixell jcreixell modified the milestones: v0.36.0, v0.37.0 Sep 5, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Grafana Agent (Public) Sep 5, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.