Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus parser incorrectly classifies counters as histograms #39705

Closed
gpop63 opened this issue May 23, 2024 · 8 comments · Fixed by #39743
Closed

Prometheus parser incorrectly classifies counters as histograms #39705

gpop63 opened this issue May 23, 2024 · 8 comments · Fixed by #39743
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@gpop63
Copy link
Contributor

gpop63 commented May 23, 2024

Overview

While attempting to ingest nginx stream metrics using the prometheus module, we've noticed that certain metrics such as nginx_sts_upstream_connects_total and nginx_sts_server_connects_total are not being ingested.

This issue is not present in metricbeat version 7.17.x where all metrics are ingested correctly. However, in versions 8.7 and later, these metrics are being skipped.

The issue seems to be rooted in the text parser logic in metricbeat/helper/prometheus/textparse.go, specifically within the ParseMetricFamilies function. When processing a group of metrics, the metric type variable is set to the latest metric type in the group, even if their types might differ.

# HELP nginx_sts_server_bytes_total The request/response bytes
# TYPE nginx_sts_server_bytes_total counter
# HELP nginx_sts_server_connects_total The connects counter
# TYPE nginx_sts_server_connects_total counter
# HELP nginx_sts_server_session_seconds_total The session duration time
# TYPE nginx_sts_server_session_seconds_total counter
# HELP nginx_sts_server_session_seconds The average of session duration time in seconds
# TYPE nginx_sts_server_session_seconds gauge
# HELP nginx_sts_server_session_duration_seconds The histogram of session duration in seconds
# TYPE nginx_sts_server_session_duration_seconds histogram
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="in"} 0
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="out"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="1xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="2xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="3xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="4xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="5xx"} 171
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="total"} 171
nginx_sts_server_session_seconds_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.016
nginx_sts_server_session_seconds{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.000

After the loop completes for a group of metrics, mt is set to the last metric type in the group, which in this case is histogram. This is problematic because for example nginx_sts_server_connects_total is actually a counter. This causes an error later in the code when we validate if the metric name contains a suffix specific to a histogram - the check fails and these metrics are skipped.

How to replicate

(We had a recent issue with nginx, that's why the setup - it could be replicated way easier)

Prerequisites:

  • nginx
  • nginx-module-sts
  • nginx-module-stream-sts
  • nginx-module-vts (not sure if this one is needed)
# nginx modules
git clone https://github.com/vozlt/nginx-module-sts.git
git clone https://github.com/vozlt/nginx-module-stream-sts.git
git clone https://github.com/vozlt/nginx-module-vts.git

# nginx
wget http://nginx.org/download/nginx-1.18.0.tar.gz
tar -xzvf nginx-1.18.0.tar.gz
cd nginx-1.18.0

# configure & build nginx
./configure --with-stream --add-module=/path/to/nginx-module-sts --add-module=/path/to/nginx-module-stream-sts --add-module=/path/to/nginx-module-vts --with-http_stub_status_module
make
make install

Make sure to use the below nginx.conf which will be in /usr/local/nginx/conf/nginx.conf before starting nginx.

nginx.conf

events {
    worker_connections  1024;
}

http {
    stream_server_traffic_status_zone;
    server {
        listen       80;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }

        location /nginx_status {
            stub_status;
        }

        location /http_traffic_status {
            vhost_traffic_status_bypass_limit on;
            vhost_traffic_status_bypass_stats on;
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
        }

        location /status {
            stream_server_traffic_status_display;
            stream_server_traffic_status_display_format html;
        }

        location = /50x.html {
            root   html;
        }
    }
}

stream {
    server_traffic_status_zone;

    server {
        listen 8091;
        proxy_pass localhost:8092;
    }
}

Then start nginx and make some dummy requests.

# start nginx
sudo /usr/local/nginx/sbin/nginx

# GET request to create some stream metrics
curl http://localhost:8091

Metricbeat Prometheus config

- module: prometheus
  period: 10s
  metricsets: ["collector"]
  hosts: ["localhost:80"]
  metrics_path: /status/format/prometheus

Possible solution

Did a quick test to see if it works and it does but not sure what the impact would be. Needs more testing.

diff --git a/metricbeat/helper/prometheus/textparse.go b/metricbeat/helper/prometheus/textparse.go
index 4dca85c3aa..e05a3783f0 100644
--- a/metricbeat/helper/prometheus/textparse.go
+++ b/metricbeat/helper/prometheus/textparse.go
@@ -489,6 +489,7 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 		histogramsByName     = map[string]map[string]*OpenMetric{}
 		fam                  *MetricFamily
 		mt                   = textparse.MetricTypeUnknown
+		metricTypes          = make(map[string]textparse.MetricType)
 	)
 	var err error
 
@@ -530,7 +531,7 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 			} else {
 				fam.Type = t
 			}
-			mt = t
+			metricTypes[s] = t
 			continue
 		case textparse.EntryHelp:
 			buf, t := parser.Help()
@@ -611,6 +612,8 @@ func ParseMetricFamilies(b []byte, contentType string, ts time.Time, logger *log
 		lookupMetricName := metricName
 		var exm *exemplar.Exemplar
 
+		mt = metricTypes[metricName]
+
 		// Suffixes - https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#suffixes
 		switch mt {
 		case textparse.MetricTypeCounter:

cc @agithomas

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 23, 2024
@agithomas
Copy link
Contributor

agithomas commented May 24, 2024

Thanks @gpop63 for looking into the issue.

Engaging code owners, @elastic/obs-cloud-monitoring , for validation.

@agithomas agithomas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label May 24, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024
@agithomas agithomas added the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 24, 2024
@zmoog
Copy link
Contributor

zmoog commented May 24, 2024

Engaging code owners, @elastic/obs-cloud-monitoring , for validation.

Hey @agithomas, thanks for the heads up.

We are sunsetting elastic/obs-cloud-monitoring, so the reference handle for our team is now @elastic/obs-ds-hosted-services.

@constanca-m
Copy link
Contributor

I have worked a bit with how to parse metrics. I believe this should be the PR that made the latest big change there: #36669.

I think this is happening because we expect the metrics to be like this:

#HELP metric1
#TYPE metric1
metric1
#HELP metric2
#TYPE metric2
metric2

And according to the metrics you put in the description, it is like this:

#HELP metric1
#TYPE metric1
#HELP metric2
#TYPE metric2
metric1
metric2

I believe it should not be a very hard change. Right now, we have unit tests for all possibilities, so we should create some for multiple metrics like you specified to test your change.

@shoulian-zhao
Copy link

@gpop63 This has been backported to 8.14.
May I know if it is possible to backport to 8.13?

@pierrehilbert
Copy link
Collaborator

As we are not planning to release a new 8.13 version, I'm not sure it would be useful.

@shoulian-zhao
Copy link

@pierrehilbert The customer who met this issue and raised https://github.com/elastic/sdh-beats/issues/4742 is planning the upgrade.
The issue is that they tested 8.13.x on their test environment, so they will upgrade to 8.13, not 8.14.
It will be helpful for them to have this backported to 8.13.

@pierrehilbert
Copy link
Collaborator

Backporting to 8.13 won't imply having a new release for 8.13: as I mentioned in previous comment, we are not planning to release a new 8.13 version.
The only thing that could be done with a backport to 8.13 would be to create a customer build (so not an official release) that you will be able to send them but this is not something we are recommending for production as this is not an official release.

@shoulian-zhao
Copy link

Thanks @pierrehilbert for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants