Basic prometheus metrics #745

jmprusi · 2018-06-05T15:54:52Z

APIcast ships with prometheus support, but only exposes the nginx_metric_errors_total metric.

I would like to propose some basic metrics to be added to the APIcast base policy:

Counters

Request:
- Total
- Request 2xx
- Request 4xx
- Request 5xx
Connections:
- Read
- Write
- Wait
- Open
Nginx Error Log
Free Dictionary Space
Threescale (fetching config):
- Update
- Reload

Histogram

Request latency

Some of them where already added to apicast-cloud-hosted: https://github.com/3scale/apicast-cloud-hosted/pull/5/files#diff-047b1780b0ffeb4eba7b3d05beb76d5e

What do you think? Any other metrics to add?

The text was updated successfully, but these errors were encountered:

mikz · 2018-06-07T13:25:43Z

APIcast policy does several things:

extract credentials from the request and possibly terminates the request with an error
maps mapping rules to metrics
calls 3scale for authorization
sends the request to upstream
uses round robin load balancer

I think the APIcast policy metrics should be focused on those operations (and maybe on some other I missed). IMO that does not include nginx error log, free shdict space (unless monitoring shdicts APIcast uses), nginx connections.

If we want those metrics, then they should be in some other (possible active by default ) policy.

Error log and shdict space monitoring are very important metrics that should be available somehow.

One metric we could do is how many requests (and with what status) was terminated a policy and not came from upstream).

andrewdavidmackenzie · 2018-06-07T13:59:24Z

Reading between the lines, it sounds like there's a slight difference in where the metrics are implemented.

Joaquim lists a list of metrics, and suggests doing in apicast policy.

Michal lists a set that are related to apicast policy operations and would make sense to do there.

But not clear about how to do the ones that are not apicast policy related?

BTW: I think I saw people asking for "#BytesTransferred" also in sme-apis.

mikz · 2018-06-07T15:25:42Z

@andrewdavidmackenzie metrics is a phase implemented by policies. Each policy can expose metrics about its' operation and they are in the end merged together. So APIcast policy would expose metrics about itself and other policies would expose metrics about their functionality. Of course there could be policies that just expose metrics.

andrewdavidmackenzie · 2018-06-07T15:41:10Z

OK, cool.

My main point was that it sounded to me like Joaquim was asking for some metrics that are not related to any specific policy, but the underlying NGINX?
(Free dictionary space etc....)

mikz · 2018-06-07T15:43:20Z

Yes. And we can expose them in some other policy or make every policy responsible for monitoring own free space. But we will need some global non apicast/3scale metrics anyway, so probably better to shove it to some nginx metrics policy.

andrewdavidmackenzie · 2018-06-07T15:43:54Z

👍

jmprusi · 2018-06-12T13:09:06Z

@mikz Yes, it makes sense to have specific apicast policy metrics (related mostly to threescale, and operation mode) and then have another policy for basic metrics.

davidor · 2018-07-26T15:51:38Z

Ping @3scale/product
Can you provide your input for this feature and decide whether it should be part of the next release?
Thanks.

mikz · 2018-07-31T13:33:54Z

@davidor I think this is not necessary for the next release, but it is possible will be there done by ostia team.

andrewdavidmackenzie · 2018-07-31T14:08:27Z

This was raised last week by Product as a last minute request.
If @MarkCheshire can get us a simple list, ASAP, then we said we'd consider it.
But I'd say it's at the bottom of the priority list and we shouldn't delay the release for this.

MarkCheshire · 2018-08-27T08:41:24Z

I recommend the base set of metrics to start:

3scale-auth status codes: Total, 2xx, 4xx, 5xx
Upstream status codes: Total, 2xx, 4xx, 5xx
Request time:

must full end to end latency
must upstream latency
optional breakdown of latency in APIcast pre- and post-request
Connections (per Joaquim): Read, Write, Wait, Open

davidor · 2018-08-31T08:35:03Z

This is what is going to be included in 3.3: #860
I'll keep the issue open so we can discuss what to include in future versions.

andrewdavidmackenzie · 2018-08-31T10:10:40Z

or close and have a new enhancement issue to discuss what to add?

(It's nice to see issues get closed.....)

gnunn1 · 2018-09-05T19:44:27Z

Are prometheus metrics available in the current master version of apicast? I'm curling the /metrics endpoint where you would normally find prometheus metrics and not seeing anything:

sh-4.2$ curl -i http://127.0.0.1:8080/metrics
HTTP/1.1 404 Not Found
Server: openresty/1.13.6.2
Date: Wed, 05 Sep 2018 19:41:18 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive

sh-4.2$ curl -i http://127.0.0.1:8090/metrics
HTTP/1.1 404 Not Found
Server: openresty/1.13.6.2
Date: Wed, 05 Sep 2018 19:41:23 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive

Could not resolve GET /metrics - nil

Is there an environment variable that needs to be enabled?

mikz · 2018-09-06T07:42:33Z

@gnunn1 Metrics are exposed on port 9421.

$ curl localhost:9421/metrics                                                                                                                     
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 1
nginx_http_connections{state="active"} 1
nginx_http_connections{state="handled"} 1
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 1
nginx_http_connections{state="waiting"} 0
nginx_http_connections{state="writing"} 1
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="batched_reports"} 1048576
openresty_shdict_capacity{dict="batched_reports_locks"} 1048576
openresty_shdict_capacity{dict="cached_auths"} 1048576
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="limiter"} 1048576
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10412032
openresty_shdict_free_space{dict="batched_reports"} 1032192
openresty_shdict_free_space{dict="batched_reports_locks"} 1032192
openresty_shdict_free_space{dict="cached_auths"} 1032192
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="limiter"} 1032192
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528

gnunn1 · 2018-09-06T12:28:32Z

Thanks @mikz, that works fine. Are 4xx and 5xx response codes supposed to be captured in the prometheus metrics like the 2xx response codes? If I execute a request in postman that generates a 404 response, i.e. requesting a REST entity that doesn't exist or where a mapping rule hasn't been set in 3scale, I don't see the 4xx response status codes being returned with backend_response{status="4xx"}

I do have APICAST_RESPONSE_CODES set to true and do see the 4xx response codes in 3scale analytics.

Here's the output of metrics after making a few 404 requests:

sh-4.2$ curl localhost:9421/metrics
# HELP backend_response Response status codes from 3scale's backend
# TYPE backend_response counter
backend_response{status="2xx"} 13
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 180
nginx_http_connections{state="active"} 2
nginx_http_connections{state="handled"} 180
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 195
nginx_http_connections{state="waiting"} 1
nginx_http_connections{state="writing"} 1
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="batched_reports"} 1048576
openresty_shdict_capacity{dict="batched_reports_locks"} 1048576
openresty_shdict_capacity{dict="cached_auths"} 1048576
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="limiter"} 1048576
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10407936
openresty_shdict_free_space{dict="batched_reports"} 1032192
openresty_shdict_free_space{dict="batched_reports_locks"} 1032192
openresty_shdict_free_space{dict="cached_auths"} 1032192
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="limiter"} 1032192
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528

davidor · 2018-09-06T14:16:01Z

@gnunn1 , I'm using the version in the master branch and it works for me. I made a request with a valid user_key and another with an invalid one and this is what I get:

# HELP backend_response Response status codes from 3scale's backend
# TYPE backend_response counter
backend_response{status="2xx"} 1
backend_response{status="4xx"} 1

Keep in mind that the backend_response counter only shows status codes received from the 3scale backend. The backend_response counter does not show the status codes received by whoever is calling APIcast. I think this is what is causing confusion here.

When a request does not match any mapping rules, APIcast does not need to contact the 3scale backend because mapping rules are stored in the APIcast configuration. APIcast only needs to call backend to validate credentials (user_key, app_key, etc.) and to report metrics.

gnunn1 · 2018-09-06T22:41:49Z

@davidor The URL I am using to hit the service is:

http://apicast-rhoar.apps.ocplab.com/orders/3

This returns a 200 since order 3 is an available item. However if I change 3 to 5 as follows:

http://apicast-rhoar.apps.ocplab.com/orders/5

The backend service returns a 404 since order 5 doesn't exist. Interestingly the prometheus metrics increments the 2xx response as a result of this despite postman showing that 404 is returned. Is this a case where a 404 is considered "successful" since in REST calls it can be a valid response? Doesn't feel intuitive though if this is the case and maybe deserving of it's own category?

Validating this with curl against apicast:

curl -i -H "user-key:xxxx" http://apicast-rhoar.apps.ocplab.com/orders/5
HTTP/1.1 404 
Server: openresty/1.13.6.2
Date: Thu, 06 Sep 2018 22:29:59 GMT
Content-Type: application/json
Content-Length: 35
X-Application-Context: gateway:kubernetes
Set-Cookie: 49d08fc35ccdc462e0e3e881ac73eef6=8b49e0c49ad92d0f89a37ec8b6bca4a4; path=/; HttpOnly
Cache-control: private

404 - Requested order doesn't exist

And then directly against the backend:

curl -i -H "user-key:xxxx" http://gateway-rhoar.apps.ocplab.com/orders/5
HTTP/1.1 404 
X-Application-Context: gateway:kubernetes
Date: Thu, 06 Sep 2018 22:31:56 GMT
Content-Type: application/json
Content-Length: 35
Set-Cookie: 565785e74d5ae03867cd44d8c8709949=e9cb74129f1f17d3a9280501f4c6d9cf; path=/; HttpOnly
Cache-control: private

404 - Requested order doesn't exist

If I change the user-key to an invalid entry then the 4xx is incremented in response to 403 forbidden as per your findings.

With regards to your explanation about why the Not Matching mapping rules scenario doesn't increment the counter, I'm curious why a bad user-key increments the 4xx counter on a 403 since presumably apicast never calls the backend in this scenario either?

mikz · 2018-09-07T08:33:22Z

@gnunn1 the metric backend_response is the 3scale backend response, not your upstream response. You can see it increment 403 when you use wrong user key for example.

But it is a good point to rename the metric, as backend_response does not indicate it is 3scale specific.

andrewdavidmackenzie · 2018-09-07T08:51:25Z

"backend" is an internal term we try to avoid using "externally" (customer visible).

(3scale) "Service Management API" is the official term of the API that is used and returns that response.

Either a generic "3scale" or "authrep request" or something is needed to clarify this.

davidor · 2018-10-16T14:19:05Z

We've added several metrics in different PRs. All of them are linked in this issue.
You can check the full list of metrics exported in this document: https://github.com/3scale/apicast/blob/master/doc/prometheus-metrics.md

gnunn1 · 2018-10-16T18:00:59Z

@davidor I installed the apicast from master and can see the response times however if I am reading them correctly they are global to the gateway rather then service or mapping/metric specific. Are there any plans to make these more granular so we could build more specific dashboards in something like grafana?

# HELP total_response_time_seconds Time needed to sent a response to the client (in seconds).
# TYPE total_response_time_seconds histogram
total_response_time_seconds_bucket{le="00.005"} 5
total_response_time_seconds_bucket{le="00.010"} 5
total_response_time_seconds_bucket{le="00.020"} 5
total_response_time_seconds_bucket{le="00.030"} 5
total_response_time_seconds_bucket{le="00.050"} 5
total_response_time_seconds_bucket{le="00.075"} 5
total_response_time_seconds_bucket{le="00.100"} 5
total_response_time_seconds_bucket{le="00.200"} 5
total_response_time_seconds_bucket{le="00.300"} 6
total_response_time_seconds_bucket{le="00.400"} 6
total_response_time_seconds_bucket{le="00.500"} 6
total_response_time_seconds_bucket{le="00.750"} 6
total_response_time_seconds_bucket{le="01.000"} 6
total_response_time_seconds_bucket{le="01.500"} 6
total_response_time_seconds_bucket{le="02.000"} 6
total_response_time_seconds_bucket{le="03.000"} 6
total_response_time_seconds_bucket{le="04.000"} 6
total_response_time_seconds_bucket{le="05.000"} 6
total_response_time_seconds_bucket{le="10.000"} 6
total_response_time_seconds_bucket{le="+Inf"} 6
total_response_time_seconds_count 6
total_response_time_seconds_sum 0.3
# HELP upstream_response_time_seconds Response times from upstream servers
# TYPE upstream_response_time_seconds histogram
upstream_response_time_seconds_bucket{le="00.005"} 5
upstream_response_time_seconds_bucket{le="00.010"} 5
upstream_response_time_seconds_bucket{le="00.020"} 5
upstream_response_time_seconds_bucket{le="00.030"} 5
upstream_response_time_seconds_bucket{le="00.050"} 5
upstream_response_time_seconds_bucket{le="00.075"} 5
upstream_response_time_seconds_bucket{le="00.100"} 6
upstream_response_time_seconds_bucket{le="00.200"} 6
upstream_response_time_seconds_bucket{le="00.300"} 6
upstream_response_time_seconds_bucket{le="00.400"} 6
upstream_response_time_seconds_bucket{le="00.500"} 6
upstream_response_time_seconds_bucket{le="00.750"} 6
upstream_response_time_seconds_bucket{le="01.000"} 6
upstream_response_time_seconds_bucket{le="01.500"} 6
upstream_response_time_seconds_bucket{le="02.000"} 6
upstream_response_time_seconds_bucket{le="03.000"} 6
upstream_response_time_seconds_bucket{le="04.000"} 6
upstream_response_time_seconds_bucket{le="05.000"} 6
upstream_response_time_seconds_bucket{le="10.000"} 6
upstream_response_time_seconds_bucket{le="+Inf"} 6
upstream_response_time_seconds_count 6
upstream_response_time_seconds_sum 0.1
# HELP upstream_status HTTP status from upstream servers
# TYPE upstream_status counter
upstream_status{status="200"} 6

davidor · 2018-10-17T08:26:48Z

@gnunn1 Including services, upstreams, metrics, etc. in the Prometheus labels is something we'll evaluate in the future. Prometheus might not be the right tool to store that kind of information.

According to the Prometheus guidelines it is not recommended to use labels for dimensions that can have a large number of values, and in some deployments, the number of services, upstreams, and 3scale metrics can be very high.

mikz added this to the 3.3 milestone Aug 22, 2018

davidor self-assigned this Aug 28, 2018

davidor mentioned this issue Aug 29, 2018

Metrics policy #860

Merged

davidor removed this from the 3.3 milestone Aug 31, 2018

davidor removed their assignment Sep 14, 2018

davidor mentioned this issue Sep 18, 2018

Add Prometheus metrics for the 3scale batching policy #902

Merged

This was referenced Oct 1, 2018

metrics: rename backend_response to threescale_backend_response #917

Merged

Add Prometheus metrics for the upstream #918

Merged

davidor self-assigned this Oct 3, 2018

davidor mentioned this issue Oct 3, 2018

Metrics: add the type of 3scale backend call (auth, authrep, report) #919

Merged

davidor added this to the 3.4 milestone Oct 4, 2018

This was referenced Oct 15, 2018

Add Prometheus metric for request response times #930

Merged

Document exported Prometheus metrics #931

Merged

davidor closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic prometheus metrics #745

Basic prometheus metrics #745

jmprusi commented Jun 5, 2018

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

jmprusi commented Jun 12, 2018

davidor commented Jul 26, 2018

mikz commented Jul 31, 2018

andrewdavidmackenzie commented Jul 31, 2018

MarkCheshire commented Aug 27, 2018

davidor commented Aug 31, 2018

andrewdavidmackenzie commented Aug 31, 2018

gnunn1 commented Sep 5, 2018

mikz commented Sep 6, 2018

gnunn1 commented Sep 6, 2018

davidor commented Sep 6, 2018 •

edited

Loading

gnunn1 commented Sep 6, 2018

mikz commented Sep 7, 2018

andrewdavidmackenzie commented Sep 7, 2018

davidor commented Oct 16, 2018

gnunn1 commented Oct 16, 2018

davidor commented Oct 17, 2018 •

edited

Loading

Basic prometheus metrics #745

Basic prometheus metrics #745

Comments

jmprusi commented Jun 5, 2018

Counters

Histogram

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

mikz commented Jun 7, 2018

andrewdavidmackenzie commented Jun 7, 2018

jmprusi commented Jun 12, 2018

davidor commented Jul 26, 2018

mikz commented Jul 31, 2018

andrewdavidmackenzie commented Jul 31, 2018

MarkCheshire commented Aug 27, 2018

davidor commented Aug 31, 2018

andrewdavidmackenzie commented Aug 31, 2018

gnunn1 commented Sep 5, 2018

mikz commented Sep 6, 2018

gnunn1 commented Sep 6, 2018

davidor commented Sep 6, 2018 • edited Loading

gnunn1 commented Sep 6, 2018

mikz commented Sep 7, 2018

andrewdavidmackenzie commented Sep 7, 2018

davidor commented Oct 16, 2018

gnunn1 commented Oct 16, 2018

davidor commented Oct 17, 2018 • edited Loading

davidor commented Sep 6, 2018 •

edited

Loading

davidor commented Oct 17, 2018 •

edited

Loading