Prometheus endpoint sending invalid type for Histograms #740

amoldavsky · 2019-08-01T17:53:43Z

For record of type Histogram the Prometheus endpoint sends quantiles which is not a supported type for Histogram. It seems as if the Seldon Prometheus server is throwing these records away, but our DataDog implementation is trying to make sense of these records and results in broken reports.

here is a log snippet:

# HELP seldon_api_engine_client_requests_seconds Timer of RestTemplate operation
# TYPE seldon_api_engine_client_requests_seconds histogram

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.5",} 0.005767168

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.75",} 0.006029312

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.95",} 0.0065536

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.98",} 0.00786432

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.99",} 0.00917504

seldon_api_engine_client_requests_seconds_bucket{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",le="0.001",} 0.0

seldon_api_engine_client_requests_seconds_bucket{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-

The first five are Histograms but having a type of quantile.

According to the documentation, Histogram only supports three types and quantiles is not one of them:
https://prometheus.io/docs/concepts/metric_types/#histogram

The text was updated successfully, but these errors were encountered:

ukclivecox · 2019-08-24T08:03:56Z

We are using micrometer (see here and here) with the configuration:

seldon-core/engine/src/main/resources/application.properties

Lines 6 to 7 in 559f897

    
           management.metrics.web.client.requests-metric-name=seldon.api.engine.client.requests 
        
           management.metrics.distribution.percentiles.all=0.5, 0.75, 0.95, 0.98, 0.99

What version of Prometheus are you using? We have had no issues with these metrics in our analytics helm chart using Prometheus and Grafana.

Can you post the error from which component you are seeing?

markusgay · 2019-08-27T12:57:31Z

Hello Clive,

Thank you for your response.
You configure Micrometer Monitoring to report Histograms and percentiles in one reply.
The configuration settings management.metrics.distribution.percentiles and management.metrics.distribution.percentiles-histogram shall not be set together
when compliance with Prometheus specification.
The setting management.metrics.distribution.percentiles adds quantile records to the Prometheus histogram type as defined by management.metrics.distribution.percentiles-histogram.
Quantile records are reserved for the Prometheus type summary.
The Prometheus server implementation from prometheus.io ignores any records which cannot be part of the sent metric type.
Other reference implementations like Datadog, NewRelic are handling invalid records as an error and don't process the received metric.

markusgay · 2019-08-27T19:37:47Z

Hello,

The previous Micromenter issue micrometer-metrics/micrometer#562 includes the explanation of the difference between Micromenter Histogram Metric and Prometheus Histogram- and Summary Metric Type. If an application wants to be compliant with the Prometheus Metric Type definition, it has to define a timed metric with the method 'publishPercentilesHistogram(true)' for the histogram metric. And it needs to create a separate time metric object with the method 'publishPercentiles(0.5, 0.75, 0.95, 0.98, 0.99)' for the summary metric.

ukclivecox · 2019-08-27T21:55:06Z

This is our setup at present:

seldon-core/engine/src/main/resources/application.properties

Lines 4 to 8 in 58927fa

    
           management.metrics.web.server.auto-time-requests=false 
        
           management.metrics.web.server.requests-metric-name=seldon.api.engine.server.requests 
        
           management.metrics.web.client.requests-metric-name=seldon.api.engine.client.requests 
        
           management.metrics.distribution.percentiles.all=0.5, 0.75, 0.95, 0.98, 0.99 
        
           management.metrics.distribution.percentiles-histogram.all=true

Happy to discuss how this can be changed.

markusgay · 2019-08-29T15:04:54Z

Your Grafana dashboard file predictions-analytics-dashboard.json uses the function histogram_quantile (e. g. below) to calculate the quantile from the buckets on the server-side. So the application setting 'management.metrics.distribution.percentiles.all=0.5, 0.75, 0.95, 0.98, 0.99' is not needed, which creates quantiles on the application side.


       "expr": "histogram_quantile(0.99, sum(rate(seldon_api_engine_client_requests_seconds_bucket{uri=\"/predict\",model_image=~\"$model_image\",predictor_name=~\"$predictor\",predictor_version=~\"$version\",model_name=~\"$model_name\",model_version=~\"$model_version\"}[20s])) by (predictor_name,predictor_version,model_name,model_image,model_version,le))",

If you want to use Prometheus Histogram and Summary metric types in your dashboards you would have to create two separate timed metrics in your application source code. At the moment, this cannot be achieved by using simply application property settings.

ukclivecox · 2019-09-05T10:16:43Z

Are you saying micrometer does not allow for what you need. Are you able to suggest the correct config for sping/micrometer?

markusgay · 2019-09-06T17:08:56Z

The configuration without the line 'management.metrics.distribution.percentiles.all=0.5, 0.75, 0.95, 0.98, 0.99' is working for us. I also looked at your Grafan dashboard code, and it is using the function histogram_quantile. So removing the configuration setting 'management.metrics.distribution.percentiles.all=0.5, 0.75, 0.95, 0.98, 0.99' should not break your dashboards.

ukclivecox · 2020-01-16T11:14:17Z

We are moving to Go replacement. Will be available in 1.1

ukclivecox · 2020-01-16T11:15:01Z

Please reopen if an issue in Go executor @markusgay

zyxue · 2020-08-05T16:22:14Z

For record of type Histogram the Prometheus endpoint sends quantiles which is not a supported type for Histogram. It seems as if the Seldon Prometheus server is throwing these records away, but our DataDog implementation is trying to make sense of these records and results in broken reports.

here is a log snippet:

# HELP seldon_api_engine_client_requests_seconds Timer of RestTemplate operation
# TYPE seldon_api_engine_client_requests_seconds histogram

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.5",} 0.005767168

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.75",} 0.006029312

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.95",} 0.0065536

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.98",} 0.00786432

seldon_api_engine_client_requests_seconds{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",quantile="0.99",} 0.00917504

seldon_api_engine_client_requests_seconds_bucket{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-predictor",predictor_version="v1",status="200",uri="/predict",le="0.001",} 0.0

seldon_api_engine_client_requests_seconds_bucket{clientName="localhost",deployment_name="mnist-deployment",method="POST",model_image="seldonio/sk-example-mnist",model_name="sk-mnist-classifier",model_version="0.2",predictor_name="sk-mnist-

The first five are Histograms but having a type of quantile.

According to the documentation, Histogram only supports three types and quantiles is not one of them:
https://prometheus.io/docs/concepts/metric_types/#histogram

I also find this a bit confusing. Do I understand correctly that in principle, those quantile lines (with quantile="something") shouldn't appear in the histogram metric type?

ukclivecox added this to the 1.0.x milestone Aug 24, 2019

ukclivecox assigned gsunner Sep 5, 2019

ukclivecox modified the milestones: 1.0.x, 0.5.x Sep 18, 2019

ukclivecox modified the milestones: 0.5.x, 1.0.x Oct 31, 2019

ukclivecox assigned JoelH96 Nov 7, 2019

ukclivecox modified the milestones: 1.0, 1.1 Nov 7, 2019

ukclivecox unassigned JoelH96 Jan 16, 2020

ukclivecox closed this as completed Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus endpoint sending invalid type for Histograms #740

Prometheus endpoint sending invalid type for Histograms #740

amoldavsky commented Aug 1, 2019

ukclivecox commented Aug 24, 2019 •

edited

Loading

markusgay commented Aug 27, 2019

markusgay commented Aug 27, 2019

ukclivecox commented Aug 27, 2019

markusgay commented Aug 29, 2019 •

edited

Loading

ukclivecox commented Sep 5, 2019

markusgay commented Sep 6, 2019 •

edited

Loading

ukclivecox commented Jan 16, 2020

ukclivecox commented Jan 16, 2020

zyxue commented Aug 5, 2020

Prometheus endpoint sending invalid type for Histograms #740

Prometheus endpoint sending invalid type for Histograms #740

Comments

amoldavsky commented Aug 1, 2019

ukclivecox commented Aug 24, 2019 • edited Loading

markusgay commented Aug 27, 2019

markusgay commented Aug 27, 2019

ukclivecox commented Aug 27, 2019

markusgay commented Aug 29, 2019 • edited Loading

ukclivecox commented Sep 5, 2019

markusgay commented Sep 6, 2019 • edited Loading

ukclivecox commented Jan 16, 2020

ukclivecox commented Jan 16, 2020

zyxue commented Aug 5, 2020

ukclivecox commented Aug 24, 2019 •

edited

Loading

markusgay commented Aug 29, 2019 •

edited

Loading

markusgay commented Sep 6, 2019 •

edited

Loading