PromQL queries fail, but MQL works #325

willchan · 2022-09-21T01:01:17Z

As a workaround for #223, I set up self-deployed collection running in my single node kubeadm cluster so I could scape the kubelet cadvisor, with a config like:

global:
  scrape_interval: 60s
scrape_configs:
  - job_name: "kubernetes-cadvisor"
      - role: node

    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    metric_relabel_configs:
      - source_labels: [container]
        regex: "^$|POD"
        action: drop
      - regex: "id"
        action: labeldrop

At first I thought the metrics weren't getting ingested, because I could not query container_cpu_usage_seconds_total with my self-deployed prometheus frontend, nor the Managed Service for Prometheus page. I got Empty query result, although the UIs would both auto-complete the metrics when I typed into the UI. This implied to me that they somehow knew about the metric, but I couldn't query them.

When I tried the Cloud Monitoring Metrics Explorer, I was able to see the metrics. I'm not sure if it's something about this metric, or the # of labelsets (53), or what, but AFAICT MQL works to query it, but PromQL returns an empty query.

The text was updated successfully, but these errors were encountered:

lyanco · 2022-09-22T17:13:57Z

Can you please email me your project ID so we can investigate further? (my GH user name) at (the company I work for) .com.

willchan · 2022-09-22T17:32:10Z

Done

lyanco · 2022-09-23T15:16:26Z

Closing - there's a hard 100 labels-per-metric limit that is unlikely to be hit, except in a case like this where there's 50+ label keys and every one of them changes, while keeping the metric name the same. Deleting the metric descriptor fixed the problem.

andrew-helm · 2023-03-27T23:06:47Z

Deleting the metric descriptor fixed the problem.

@lyanco could you please elaborate on this fix? Where and how do I delete the metric descriptor?

I'm running into the exact same issue, same metric even. I count 57 labels on container_cpu_usage_seconds_total coming out of cadvisor. If I understand correctly, once I delete the metric descriptor I should be able to fix this by adding a relabel policy that drops all of the superfluous labels, does that sound correct?

willchan · 2023-03-27T23:47:16Z

There's an API for it. I wrote some throwaway python code to do it.

andrew-helm · 2023-03-28T00:06:10Z

Thanks for the pointer!

For anyone else that ends up here, here's my throwaway code and relabel config:

from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
name = "projects/YOUR_PROJECT_ID/metricDescriptors/prometheus.googleapis.com/container_cpu_usage_seconds_total/counter"
client.delete_metric_descriptor(name=name)

    - regex: "container_label_Description|container_label_INCLUDES_NOTICES|container_label_INCLUDES_SOURCE|container_label_LICENSE|container_label_LICENSE_CATEGORY|container_label_LICENSE_SCAN_TAG|container_label_NOTICES_PATH|container_label_SOURCES_INCLUDED|container_label_Vendor|container_label_Version|container_label_app|container_label_app_kubernetes_io_component|container_label_app_kubernetes_io_instance|container_label_app_kubernetes_io_managed_by|container_label_app_kubernetes_io_name|container_label_app_kubernetes_io_part_of|container_label_app_kubernetes_io_version|container_label_build_id|container_label_component|container_label_components_gke_io_component_name|container_label_controller_revision_hash|container_label_description|container_label_helm_sh_chart|container_label_io_cri_containerd_kind|container_label_k8s_app|container_label_kubernetes_io_cluster_service|container_label_maintainer|container_label_maintainers|container_label_name|container_label_org_opencontainers_image_base_name|container_label_org_opencontainers_image_created|container_label_org_opencontainers_image_description|container_label_org_opencontainers_image_documentation|container_label_org_opencontainers_image_licenses|container_label_org_opencontainers_image_ref_name|container_label_org_opencontainers_image_revision|container_label_org_opencontainers_image_source|container_label_org_opencontainers_image_title|container_label_org_opencontainers_image_url|container_label_org_opencontainers_image_vendor|container_label_org_opencontainers_image_version|container_label_pod_template_generation|container_label_pod_template_hash|container_label_release|container_label_statefulset_kubernetes_io_pod_name|container_label_suite|container_label_tier|container_label_vendor|container_label_version"
      action: "labeldrop"

lyanco · 2023-03-28T20:00:26Z

2 updates here:

We recently increased the limit to 360 labels across all projects with the metric defined, so this should be less of an issue going forward.

Here's a short go script we've been sending folks who have run into this issue that iterates through metrics, looks for a regex match on having more than 8 digits in a row (an indication of a common situation where someone sends statsd metrics by accident, which can have a timestamp in the metric name), and deletes them. Please feel free to use if needed:

package main

import (
	"context"
	"flag"
	"fmt"
	"log"

	"regexp"
	"time"

	monitoring "cloud.google.com/go/monitoring/apiv3/v2"
	"cloud.google.com/go/monitoring/apiv3/v2/monitoringpb"
	"google.golang.org/api/iterator"

	"google.golang.org/api/option"
)

var (
	cloudMonitoringEndpoint = flag.String("address", "monitoring.googleapis.com:443", "address of monitoring API")
	resourceContainer       = flag.String("resource_container", "", "target resource container, ex. projects/test-project")
	dryRun                  = flag.Bool("dry_run", false, "whether to dry run or not")
)

/*
* To acquire Application Default Credentials, run:

gcloud auth application-default login

* One way to run this file is to initialize a go module.
* For example, move this file into a new directory and run the following:

go mod init example.com/m
go mod tidy
go run delete_metric_descriptors_timestamps.go -resource_container=projects/test-project

*/

func main() {
	flag.Parse()
	ctx := context.Background()

	client, err := monitoring.NewMetricClient(ctx, option.WithEndpoint(*cloudMonitoringEndpoint))

	if err != nil {
		log.Fatalf("failed to build NewMetricClient for %s", *cloudMonitoringEndpoint)
	}

	it := client.ListMetricDescriptors(
		ctx,
		&monitoringpb.ListMetricDescriptorsRequest{
			Name: *resourceContainer,
		})

	var deleted = 0
	var numTotal = 0

	for {
		resp, err := it.Next()
		if err == iterator.Done {
			break
		}
		if err != nil {
			log.Fatalf("Failed ListMetricDescriptors request: %v", err)
		}
		var metricType = resp.Type
		match, err := regexp.MatchString("\\d{8,}", metricType)
		if err == nil && match {
			if *dryRun {
				numTotal++
				fmt.Printf("%s has timestamp values\n", metricType)
			} else {
				err := client.DeleteMetricDescriptor(ctx, &monitoringpb.DeleteMetricDescriptorRequest{
					Name: fmt.Sprintf("%s/metricDescriptors/%s", *resourceContainer, metricType),
				})
				if err != nil {
					log.Fatalf("Failed DeleteMetricDescriptors: %v", err)
				}
				numTotal++
				deleted++
				fmt.Printf("%s deleted\n", metricType)
				// Delete metrics in batches of 1000 metrics and sleep inbetween batches to avoid overwhelming
				// configuration servers.
				if deleted == 1000 {
					time.Sleep(5 * time.Minute)
					deleted = 0
				}
			}
		}
	}
	fmt.Printf("%d deleted in total.\n", numTotal)

	if err := client.Close(); err != nil {
		log.Fatalf("Failed to close client: %v", err)
	}
}

lyanco closed this as completed Sep 23, 2022

pintohutch mentioned this issue Jun 22, 2023

One or more TimeSeries could not be written: The new labels would cause the metric XXX to have over 100 labels. #493

Closed

lyanco mentioned this issue Aug 4, 2023

Limit for prometheus.googleapis.com #542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PromQL queries fail, but MQL works #325

PromQL queries fail, but MQL works #325

willchan commented Sep 21, 2022

lyanco commented Sep 22, 2022

willchan commented Sep 22, 2022

lyanco commented Sep 23, 2022

andrew-helm commented Mar 27, 2023

willchan commented Mar 27, 2023

andrew-helm commented Mar 28, 2023

lyanco commented Mar 28, 2023

PromQL queries fail, but MQL works #325

PromQL queries fail, but MQL works #325

Comments

willchan commented Sep 21, 2022

lyanco commented Sep 22, 2022

willchan commented Sep 22, 2022

lyanco commented Sep 23, 2022

andrew-helm commented Mar 27, 2023

willchan commented Mar 27, 2023

andrew-helm commented Mar 28, 2023

lyanco commented Mar 28, 2023