Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nginx] VTS metrics breaks prometheus endpoint :10254/metrics (0.9.0-beta.3) #448

Closed
MaikuMori opened this issue Mar 15, 2017 · 15 comments · Fixed by #456
Closed

[nginx] VTS metrics breaks prometheus endpoint :10254/metrics (0.9.0-beta.3) #448

MaikuMori opened this issue Mar 15, 2017 · 15 comments · Fixed by #456

Comments

@MaikuMori
Copy link

After deploying 0.9.0-beta.3 and enabling vts the metrics endpoint is broken.

Going to :10245/metrics throws 500 with following text:

An error has occurred during metrics gathering:

1120 error(s) occurred:
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"10.0.10.62:4000" > label:<name:"status_code" value:"1xx" > label:<name:"upstream" value:"redacted-redacted-staging-redacted" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"10.0.10.62:4000" > label:<name:"status_code" value:"2xx" > label:<name:"upstream" value:"redacted-redacted-staging-redacted" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"10.0.10.62:4000" > label:<name:"status_code" value:"3xx" > label:<name:"upstream" value:"redacted-redacted-staging-redacted" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"10.0.10.62:4000" > label:<name:"status_code" value:"4xx" > label:<name:"upstream" value:"redacted-redacted-staging-redacted" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"10.0.10.62:4000" > label:<name:"status_code" value:"5xx" > label:<name:"upstream" value:"redacted-redacted-staging-redacted" > counter:<value:0 >  was collected before with the same name and label values

... snip ... (they're all the same error, just different values)
@MaikuMori
Copy link
Author

Possible related people: @gianrubio @aledbf

Could it be that that particular Ingress has somewhat complex setup?

Heavily redacted Ingress resource in question:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: redacted-staging
  namespace: redacted
  labels:
    environment: staging
    project: redacted
  annotations:
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
    - hosts:
        - host1.redacted.com
        - host2.redacted.com
        - host3.redacted.com
        - host4.redacted.com
        - host5.redacted.com
        - host6.redacted.com
      secretName: redacted-tls
  rules:
    - host: host1.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-1
          - path: /admin
            backend:
              serviceName: redacted-staging
              servicePort: port-1-admin
    - host: host2.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-1
          - path: /admin
            backend:
              serviceName: redacted-staging
              servicePort: port-1-admin
    - host: host3.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-2
    - host: host4.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-3
    - host: host5.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-4
    - host: host6.redacted.com
      http:
        paths:
          - path: /
            backend:
              serviceName: redacted-staging
              servicePort: port-1

All 1120 erors are from 2 different versions of this same ingress resource.

@MaikuMori MaikuMori changed the title [nginx] VTS metrics break prometheus endpoint :10254/metrics (0.9.0-beta.3) [nginx] VTS metrics breaks prometheus endpoint :10254/metrics (0.9.0-beta.3) Mar 15, 2017
@aledbf
Copy link
Member

aledbf commented Mar 15, 2017

@MaikuMori please check the vts output in url :18080/nginx_status and :18080/nginx_status/format/json (if you can, please redact the upstream names and hosts and send the json)

@MaikuMori
Copy link
Author

Status endpoint works, that's the first thing I checked. I'm sending you email with json. It's somewhat big because we have more Ingreses besides this one.

@gianrubio
Copy link
Contributor

gianrubio commented Mar 15, 2017 via email

@MaikuMori
Copy link
Author

I don't use custom template.

Also working via email with @aledbf where I provided him some additional debug info.

@aledbf
Copy link
Member

aledbf commented Mar 15, 2017

@gianrubio I think the issue is related to the the reuse of the upstreams inside the same zone (multiple ingress pointing to the same service).
I will try to reproduce this using the echoheaders service

@gianrubio
Copy link
Contributor

I just reproduce the error, the controller is duplicating the server upstream. I'm looking how to fix this

nginx.conf

    upstream default-default-http-backend-port-1 {
        least_conn;
        server 172.17.0.8:8080 max_fails=0 fail_timeout=0;
        server 172.17.0.8:8080 max_fails=0 fail_timeout=0;
    }
    upstream default-default-http-backend-port-1-admin {
        least_conn;
        server 172.17.0.8:8080 max_fails=0 fail_timeout=0;
        server 172.17.0.8:8080 max_fails=0 fail_timeout=0;
    }

Error

$ curl 192.168.99.100:10254/metrics
An error has occurred during metrics gathering:

28 error(s) occurred:
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"172.17.0.8:8080" > label:<name:"status_code" value:"1xx" > label:<name:"upstream" value:"default-default-http-backend-port-1" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"172.17.0.8:8080" > label:<name:"status_code" value:"2xx" > label:<name:"upstream" value:"default-default-http-backend-port-1" > counter:<value:0 >  was collected before with the same name and label values
* collected metric nginx_nginx_all_upstream_responses_total label:<name:"server" value:"172.17.0.8:8080" > label:<name:"status_code" value:"3xx" > label:<name:"upstream" value:"default-default-http-backend-port-1" > counter:<value:0 >  was collected before with the same name and label values

@MaikuMori
Copy link
Author

I concur this is probably the error since we have multiple upstream servers with same and/or different ports.

@gianrubio
Copy link
Contributor

@MaikuMori just to confirm, could you share your upstream for redacted-redacted-staging-redacted ?

@aledbf
Copy link
Member

aledbf commented Mar 15, 2017

I just reproduce the error, the controller is duplicating the server upstream. I'm looking how to fix this

is not duplicating the upstream, the ports are different (names) https://github.com/kubernetes/ingress/blob/master/core/pkg/ingress/controller/controller.go#L739

@gianrubio
Copy link
Contributor

is not duplicating the upstream, the ports are different (names)

Sorry, it's duplicating the server

server 172.17.0.8:8080 max_fails=0 fail_timeout=0;
server 172.17.0.8:8080 max_fails=0 fail_timeout=0;

@aledbf
Copy link
Member

aledbf commented Mar 15, 2017

@gianrubio right, but that is ok. The current implementation allows different configuration for the same service if is used by different ingress rules like sticky sessions
Maybe we need to preprocess the stats in order to avoid this issue?

@aledbf
Copy link
Member

aledbf commented Mar 16, 2017

@MaikuMori please test the image quay.io/aledbf/nginx-ingress-controller:0.78

@aledbf
Copy link
Member

aledbf commented Mar 16, 2017

@MaikuMori @gianrubio this issue is related to #455

@MaikuMori
Copy link
Author

Yep, this indeed fixes the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants