Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PromQL queries fail, but MQL works #325

Closed
willchan opened this issue Sep 21, 2022 · 7 comments
Closed

PromQL queries fail, but MQL works #325

willchan opened this issue Sep 21, 2022 · 7 comments

Comments

@willchan
Copy link

As a workaround for #223, I set up self-deployed collection running in my single node kubeadm cluster so I could scape the kubelet cadvisor, with a config like:

global:
  scrape_interval: 60s
scrape_configs:
  - job_name: "kubernetes-cadvisor"
      - role: node

    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    metric_relabel_configs:
      - source_labels: [container]
        regex: "^$|POD"
        action: drop
      - regex: "id"
        action: labeldrop

At first I thought the metrics weren't getting ingested, because I could not query container_cpu_usage_seconds_total with my self-deployed prometheus frontend, nor the Managed Service for Prometheus page. I got Empty query result, although the UIs would both auto-complete the metrics when I typed into the UI. This implied to me that they somehow knew about the metric, but I couldn't query them.

When I tried the Cloud Monitoring Metrics Explorer, I was able to see the metrics. I'm not sure if it's something about this metric, or the # of labelsets (53), or what, but AFAICT MQL works to query it, but PromQL returns an empty query.

@lyanco
Copy link
Collaborator

lyanco commented Sep 22, 2022

Can you please email me your project ID so we can investigate further? (my GH user name) at (the company I work for) .com.

@willchan
Copy link
Author

Done

@lyanco
Copy link
Collaborator

lyanco commented Sep 23, 2022

Closing - there's a hard 100 labels-per-metric limit that is unlikely to be hit, except in a case like this where there's 50+ label keys and every one of them changes, while keeping the metric name the same. Deleting the metric descriptor fixed the problem.

@lyanco lyanco closed this as completed Sep 23, 2022
@andrew-helm
Copy link

Deleting the metric descriptor fixed the problem.

@lyanco could you please elaborate on this fix? Where and how do I delete the metric descriptor?

I'm running into the exact same issue, same metric even. I count 57 labels on container_cpu_usage_seconds_total coming out of cadvisor. If I understand correctly, once I delete the metric descriptor I should be able to fix this by adding a relabel policy that drops all of the superfluous labels, does that sound correct?

@willchan
Copy link
Author

There's an API for it. I wrote some throwaway python code to do it.

@andrew-helm
Copy link

Thanks for the pointer!

For anyone else that ends up here, here's my throwaway code and relabel config:

from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
name = "projects/YOUR_PROJECT_ID/metricDescriptors/prometheus.googleapis.com/container_cpu_usage_seconds_total/counter"
client.delete_metric_descriptor(name=name)
    - regex: "container_label_Description|container_label_INCLUDES_NOTICES|container_label_INCLUDES_SOURCE|container_label_LICENSE|container_label_LICENSE_CATEGORY|container_label_LICENSE_SCAN_TAG|container_label_NOTICES_PATH|container_label_SOURCES_INCLUDED|container_label_Vendor|container_label_Version|container_label_app|container_label_app_kubernetes_io_component|container_label_app_kubernetes_io_instance|container_label_app_kubernetes_io_managed_by|container_label_app_kubernetes_io_name|container_label_app_kubernetes_io_part_of|container_label_app_kubernetes_io_version|container_label_build_id|container_label_component|container_label_components_gke_io_component_name|container_label_controller_revision_hash|container_label_description|container_label_helm_sh_chart|container_label_io_cri_containerd_kind|container_label_k8s_app|container_label_kubernetes_io_cluster_service|container_label_maintainer|container_label_maintainers|container_label_name|container_label_org_opencontainers_image_base_name|container_label_org_opencontainers_image_created|container_label_org_opencontainers_image_description|container_label_org_opencontainers_image_documentation|container_label_org_opencontainers_image_licenses|container_label_org_opencontainers_image_ref_name|container_label_org_opencontainers_image_revision|container_label_org_opencontainers_image_source|container_label_org_opencontainers_image_title|container_label_org_opencontainers_image_url|container_label_org_opencontainers_image_vendor|container_label_org_opencontainers_image_version|container_label_pod_template_generation|container_label_pod_template_hash|container_label_release|container_label_statefulset_kubernetes_io_pod_name|container_label_suite|container_label_tier|container_label_vendor|container_label_version"
      action: "labeldrop"

@lyanco
Copy link
Collaborator

lyanco commented Mar 28, 2023

2 updates here:

We recently increased the limit to 360 labels across all projects with the metric defined, so this should be less of an issue going forward.

Here's a short go script we've been sending folks who have run into this issue that iterates through metrics, looks for a regex match on having more than 8 digits in a row (an indication of a common situation where someone sends statsd metrics by accident, which can have a timestamp in the metric name), and deletes them. Please feel free to use if needed:

package main

import (
	"context"
	"flag"
	"fmt"
	"log"

	"regexp"
	"time"

	monitoring "cloud.google.com/go/monitoring/apiv3/v2"
	"cloud.google.com/go/monitoring/apiv3/v2/monitoringpb"
	"google.golang.org/api/iterator"

	"google.golang.org/api/option"
)

var (
	cloudMonitoringEndpoint = flag.String("address", "monitoring.googleapis.com:443", "address of monitoring API")
	resourceContainer       = flag.String("resource_container", "", "target resource container, ex. projects/test-project")
	dryRun                  = flag.Bool("dry_run", false, "whether to dry run or not")
)

/*
* To acquire Application Default Credentials, run:

gcloud auth application-default login

* One way to run this file is to initialize a go module.
* For example, move this file into a new directory and run the following:

go mod init example.com/m
go mod tidy
go run delete_metric_descriptors_timestamps.go -resource_container=projects/test-project

*/

func main() {
	flag.Parse()
	ctx := context.Background()

	client, err := monitoring.NewMetricClient(ctx, option.WithEndpoint(*cloudMonitoringEndpoint))

	if err != nil {
		log.Fatalf("failed to build NewMetricClient for %s", *cloudMonitoringEndpoint)
	}

	it := client.ListMetricDescriptors(
		ctx,
		&monitoringpb.ListMetricDescriptorsRequest{
			Name: *resourceContainer,
		})

	var deleted = 0
	var numTotal = 0

	for {
		resp, err := it.Next()
		if err == iterator.Done {
			break
		}
		if err != nil {
			log.Fatalf("Failed ListMetricDescriptors request: %v", err)
		}
		var metricType = resp.Type
		match, err := regexp.MatchString("\\d{8,}", metricType)
		if err == nil && match {
			if *dryRun {
				numTotal++
				fmt.Printf("%s has timestamp values\n", metricType)
			} else {
				err := client.DeleteMetricDescriptor(ctx, &monitoringpb.DeleteMetricDescriptorRequest{
					Name: fmt.Sprintf("%s/metricDescriptors/%s", *resourceContainer, metricType),
				})
				if err != nil {
					log.Fatalf("Failed DeleteMetricDescriptors: %v", err)
				}
				numTotal++
				deleted++
				fmt.Printf("%s deleted\n", metricType)
				// Delete metrics in batches of 1000 metrics and sleep inbetween batches to avoid overwhelming
				// configuration servers.
				if deleted == 1000 {
					time.Sleep(5 * time.Minute)
					deleted = 0
				}
			}
		}
	}
	fmt.Printf("%d deleted in total.\n", numTotal)

	if err := client.Close(); err != nil {
		log.Fatalf("Failed to close client: %v", err)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants