Add long jobs in exponential backoff providers #626

yvespp · 2016-08-18T11:46:34Z

Traefik dies if it can't reach the Kubernetes API server for some reason (network outage, API server down). This very unfortunate because it causes an outage even though the backends in the cluster are actually still up and could be reached.

It would be nice if Traefik could be more resilient and not depend on the API server being available all the time.

Tested with 1.0.0 and 1.0.1.

Log:

time="2016-08-18T07:18:56+02:00" level=debug msg="Skipping event from kubernetes map[type:MODIFIED object:map[kind:Endpoints apiVersion:v1 metadata:map[name:vvn-baustein namespace:poz-uat selfLink:/api/v1/namespaces/poz-uat/endpoints/vvn-baustein uid:2f76c98c-483b-11e6-a5a8-005056b207ba resourceVersion:14087443 creationTimestamp:2016-07-12T14:16:11Z labels:map[appId:252696]] subsets:[map[notReadyAddresses:[map[ip:172.30.192.25 targetRef:map[name:aps-vvn-baustein-126968-uulp8 uid:2fa62cf5-604d-11e6-8cfc-005056b207ba resourceVersion:14087340 kind:Pod namespace:poz-uat]] map[ip:172.31.32.16 targetRef:map[namespace:poz-uat name:aps-vvn-baustein-126970-58j3o uid:861e2fe9-5f8f-11e6-8cfc-005056b207ba resourceVersion:14087442 kind:Pod]]] ports:[map[protocol:TCP name:http port:8080]]]]]]"
time="2016-08-18T07:19:51+02:00" level=error msg="Error watching kubernetes events: failed to create watch: failed to do version request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : failed to create request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : [Get https://172.23.0.1:443/apis/extensions/v1beta1/ingresses: dial tcp 172.23.0.1:443: getsockopt: connection refused]"
time="2016-08-18T07:19:52+02:00" level=fatal msg="Cannot connect to Kubernetes server failed to create watch: failed to do version request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : failed to create request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : [Get https://172.23.0.1:443/apis/extensions/v1beta1/ingresses: dial tcp 172.23.0.1:443: getsockopt: connection refused]"

The text was updated successfully, but these errors were encountered:

emilevauge · 2016-08-18T11:59:45Z

@yvespp I may know where it comes from. Can you confirm that you got a lot of Error watching kubernetes even logs before it dies?

yvespp · 2016-08-18T14:00:43Z

No, I only see what I posted above.
The log.Fatalf in kubernetes.go#L157 causes Traefik to exit (see logger.go#L160). So it's by design.

emilevauge · 2016-08-18T14:04:21Z

@yvespp indeed it should not be a Fatalf, but it should only ends there after having tried multiple times using exponential backoff. Can you give us all your logs ?

yvespp · 2016-08-18T14:46:59Z

Ah, ok. Here the log: https://gist.githubusercontent.com/yvespp/461d4fed9decc697f6c5e502d63b8042/raw/ddb85d4c274d5c68902b08c4d7b7b27b5a878421/traefik.log
It's just from the last few hours, I have to see if I can get the whole log...

Here the config:

    logLevel = "INFO"

    defaultEntryPoints = ["http", "https"]
    accessLogsFile = "/proc/1/fd/1"

    [entryPoints]
      [entryPoints.http]
      address = ":80"
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
      address = ":443"
        [entryPoints.https.tls]
          [[entryPoints.https.tls.certificates]]
          CertFile = "/etc/ssl/tls.crt"
          KeyFile = "/etc/ssl/tls.key"

    [web]
    address = ":8080"
    ReadOnly = true

    [kubernetes]

yvespp · 2016-08-18T16:14:09Z

Here the whole log from a test run. I stopped the API Server on 18:05:58 and Traefik died: https://gist.githubusercontent.com/yvespp/3b4278c7c99e2e47711659a50bd26ff4/raw/da3e0798a3e181d60ad6b3ce84e3b4e0c395a89e/traefik.log2

yvespp · 2016-08-18T16:24:39Z

I noticed that a fresh instance of Traefik, that was just started, doesn't die when I stop the API Server: https://gist.githubusercontent.com/yvespp/9fb879812f1b815574975c8ba71b5982/raw/919087d47ab99b787d4f2e0e2412eb21ca7db648/traefik.log3

In this log you can see Traefik surviving several API Server restarts. I then waited an hour an restarted the API Server again and Traefik died immediately: https://gist.githubusercontent.com/yvespp/3e7aac2990d1124a0bd631326eaa2ff7/raw/a10c28cd0639434ab9b30dec84a7583525d6dc64/traefik.log4
There where two instances (Pods) of Traefik running and both died at the same time.

emilevauge · 2016-08-18T18:14:37Z

Ok thanks, to be clear, if you restarts traefik, it works again right?

yvespp · 2016-08-18T18:16:28Z

Yes, if the API Server is up again.
As long as the API Server is down Kubernetes can't restart the Pod.

emilevauge added the investigation-needed label Aug 18, 2016

emilevauge self-assigned this Aug 18, 2016

emilevauge changed the title ~~Kubernets: Traefik dies when it can't connect to the api server~~ Add long jobs in exponential backoff providers Aug 18, 2016

emilevauge added bug and removed investigation-needed labels Aug 18, 2016

emilevauge mentioned this issue Aug 19, 2016

Add long job exponential backoff #627

Merged

vdemeester closed this as completed in #627 Aug 19, 2016

This was referenced Aug 19, 2016

Add JobBackOff cenkalti/backoff#27

Closed

Migrate to JobBackOff #628

Merged

ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add long jobs in exponential backoff providers #626

Add long jobs in exponential backoff providers #626

yvespp commented Aug 18, 2016 •

edited by emilevauge

Loading

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016 •

edited

Loading

yvespp commented Aug 18, 2016 •

edited

Loading

yvespp commented Aug 18, 2016 •

edited

Loading

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016 •

edited

Loading

Add long jobs in exponential backoff providers #626

Add long jobs in exponential backoff providers #626

Comments

yvespp commented Aug 18, 2016 • edited by emilevauge Loading

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016 • edited Loading

yvespp commented Aug 18, 2016 • edited Loading

yvespp commented Aug 18, 2016 • edited Loading

emilevauge commented Aug 18, 2016

yvespp commented Aug 18, 2016 • edited Loading

yvespp commented Aug 18, 2016 •

edited by emilevauge

Loading

yvespp commented Aug 18, 2016 •

edited

Loading

yvespp commented Aug 18, 2016 •

edited

Loading

yvespp commented Aug 18, 2016 •

edited

Loading

yvespp commented Aug 18, 2016 •

edited

Loading