Prometheus parser incorrectly classifies counters as histograms #39705

gpop63 · 2024-05-23T20:45:19Z

Overview

While attempting to ingest nginx stream metrics using the prometheus module, we've noticed that certain metrics such as nginx_sts_upstream_connects_total and nginx_sts_server_connects_total are not being ingested.

This issue is not present in metricbeat version 7.17.x where all metrics are ingested correctly. However, in versions 8.7 and later, these metrics are being skipped.

The issue seems to be rooted in the text parser logic in metricbeat/helper/prometheus/textparse.go, specifically within the ParseMetricFamilies function. When processing a group of metrics, the metric type variable is set to the latest metric type in the group, even if their types might differ.

# HELP nginx_sts_server_bytes_total The request/response bytes
# TYPE nginx_sts_server_bytes_total counter
# HELP nginx_sts_server_connects_total The connects counter
# TYPE nginx_sts_server_connects_total counter
# HELP nginx_sts_server_session_seconds_total The session duration time
# TYPE nginx_sts_server_session_seconds_total counter
# HELP nginx_sts_server_session_seconds The average of session duration time in seconds
# TYPE nginx_sts_server_session_seconds gauge
# HELP nginx_sts_server_session_duration_seconds The histogram of session duration in seconds
# TYPE nginx_sts_server_session_duration_seconds histogram
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="in"} 0
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="out"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="1xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="2xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="3xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="4xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="5xx"} 171
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="total"} 171
nginx_sts_server_session_seconds_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.016
nginx_sts_server_session_seconds{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.000

After the loop completes for a group of metrics, mt is set to the last metric type in the group, which in this case is histogram. This is problematic because for example nginx_sts_server_connects_total is actually a counter. This causes an error later in the code when we validate if the metric name contains a suffix specific to a histogram - the check fails and these metrics are skipped.

How to replicate

(We had a recent issue with nginx, that's why the setup - it could be replicated way easier)

Prerequisites:

nginx
nginx-module-sts
nginx-module-stream-sts
nginx-module-vts (not sure if this one is needed)

# nginx modules
git clone https://github.com/vozlt/nginx-module-sts.git
git clone https://github.com/vozlt/nginx-module-stream-sts.git
git clone https://github.com/vozlt/nginx-module-vts.git

# nginx
wget http://nginx.org/download/nginx-1.18.0.tar.gz
tar -xzvf nginx-1.18.0.tar.gz
cd nginx-1.18.0

# configure & build nginx
./configure --with-stream --add-module=/path/to/nginx-module-sts --add-module=/path/to/nginx-module-stream-sts --add-module=/path/to/nginx-module-vts --with-http_stub_status_module
make
make install

Make sure to use the below nginx.conf which will be in /usr/local/nginx/conf/nginx.conf before starting nginx.

nginx.conf

events {
    worker_connections  1024;
}

http {
    stream_server_traffic_status_zone;
    server {
        listen       80;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }

        location /nginx_status {
            stub_status;
        }

        location /http_traffic_status {
            vhost_traffic_status_bypass_limit on;
            vhost_traffic_status_bypass_stats on;
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
        }

        location /status {
            stream_server_traffic_status_display;
            stream_server_traffic_status_display_format html;
        }

        location = /50x.html {
            root   html;
        }
    }
}

stream {
    server_traffic_status_zone;

    server {
        listen 8091;
        proxy_pass localhost:8092;
    }
}

Then start nginx and make some dummy requests.

# start nginx
sudo /usr/local/nginx/sbin/nginx

# GET request to create some stream metrics
curl http://localhost:8091

Metricbeat Prometheus config

- module: prometheus
  period: 10s
  metricsets: ["collector"]
  hosts: ["localhost:80"]
  metrics_path: /status/format/prometheus

Possible solution

Did a quick test to see if it works and it does but not sure what the impact would be. Needs more testing.

diff --git a/metricbeat/helper/prometheus/textparse.go b/metricbeat/helper/prometheus/textparse.go
index 4dca85c3aa..e05a3783f0 100644
--- a/metricbeat/helper/prometheus/textparse.go
+++ b/metricbeat/helper/prometheus/textparse.go
@@ -489,6 +489,7 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 		histogramsByName     = map[string]map[string]*OpenMetric{}
 		fam                  *MetricFamily
 		mt                   = textparse.MetricTypeUnknown
+		metricTypes          = make(map[string]textparse.MetricType)
 	)
 	var err error
 
@@ -530,7 +531,7 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 			} else {
 				fam.Type = t
 			}
-			mt = t
+			metricTypes[s] = t
 			continue
 		case textparse.EntryHelp:
 			buf, t := parser.Help()
@@ -611,6 +612,8 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 		lookupMetricName := metricName
 		var exm *exemplar.Exemplar
 
+		mt = metricTypes[metricName]
+
 		// Suffixes - https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#suffixes
 		switch mt {
 		case textparse.MetricTypeCounter:

cc @agithomas

The text was updated successfully, but these errors were encountered:

agithomas · 2024-05-24T04:25:30Z

Thanks @gpop63 for looking into the issue.

Engaging code owners, @elastic/obs-cloud-monitoring , for validation.

zmoog · 2024-05-24T07:40:34Z

Engaging code owners, @elastic/obs-cloud-monitoring , for validation.

Hey @agithomas, thanks for the heads up.

We are sunsetting elastic/obs-cloud-monitoring, so the reference handle for our team is now @elastic/obs-ds-hosted-services.

constanca-m · 2024-05-24T08:05:26Z

I have worked a bit with how to parse metrics. I believe this should be the PR that made the latest big change there: #36669.

I think this is happening because we expect the metrics to be like this:

#HELP metric1
#TYPE metric1
metric1
#HELP metric2
#TYPE metric2
metric2

And according to the metrics you put in the description, it is like this:

#HELP metric1
#TYPE metric1
#HELP metric2
#TYPE metric2
metric1
metric2

I believe it should not be a very hard change. Right now, we have unit tests for all possibilities, so we should create some for multiple metrics like you specified to test your change.

shoulian-zhao · 2024-07-03T07:40:20Z

@gpop63 This has been backported to 8.14.
May I know if it is possible to backport to 8.13?

pierrehilbert · 2024-07-03T07:52:29Z

As we are not planning to release a new 8.13 version, I'm not sure it would be useful.

shoulian-zhao · 2024-07-03T08:21:38Z

@pierrehilbert The customer who met this issue and raised https://github.com/elastic/sdh-beats/issues/4742 is planning the upgrade.
The issue is that they tested 8.13.x on their test environment, so they will upgrade to 8.13, not 8.14.
It will be helpful for them to have this backported to 8.13.

pierrehilbert · 2024-07-03T08:28:48Z

Backporting to 8.13 won't imply having a new release for 8.13: as I mentioned in previous comment, we are not planning to release a new 8.13 version.
The only thing that could be done with a backport to 8.13 would be to create a customer build (so not an official release) that you will be able to send them but this is not something we are recommending for production as this is not an official release.

shoulian-zhao · 2024-07-03T08:35:19Z

Thanks @pierrehilbert for the explanation.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 23, 2024

agithomas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label May 24, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024

agithomas added the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024

gpop63 mentioned this issue Jun 12, 2024

Fix Prometheus metric types parser #39743

Merged

6 tasks

pierrehilbert closed this as completed in #39743 Jul 2, 2024

mergify bot mentioned this issue Jul 2, 2024

[8.14](backport #39743) Fix Prometheus metric types parser #40078

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus parser incorrectly classifies counters as histograms #39705

Prometheus parser incorrectly classifies counters as histograms #39705

gpop63 commented May 23, 2024 •

edited

Loading

How to replicate

Prerequisites:

Metricbeat Prometheus config

agithomas commented May 24, 2024 •

edited

Loading

zmoog commented May 24, 2024

constanca-m commented May 24, 2024

shoulian-zhao commented Jul 3, 2024

pierrehilbert commented Jul 3, 2024

shoulian-zhao commented Jul 3, 2024

pierrehilbert commented Jul 3, 2024

shoulian-zhao commented Jul 3, 2024

Prometheus parser incorrectly classifies counters as histograms #39705

Prometheus parser incorrectly classifies counters as histograms #39705

Comments

gpop63 commented May 23, 2024 • edited Loading

Overview

How to replicate

Prerequisites:

Metricbeat Prometheus config

Possible solution

agithomas commented May 24, 2024 • edited Loading

zmoog commented May 24, 2024

constanca-m commented May 24, 2024

shoulian-zhao commented Jul 3, 2024

pierrehilbert commented Jul 3, 2024

shoulian-zhao commented Jul 3, 2024

pierrehilbert commented Jul 3, 2024

shoulian-zhao commented Jul 3, 2024

gpop63 commented May 23, 2024 •

edited

Loading

agithomas commented May 24, 2024 •

edited

Loading