traefik hangs - stops handling requests #662

r0bj · 2016-09-02T13:46:36Z

traefik version: 1.0.2

Set open files limit to 1000000:
ulimit -n 1000000

traefik config:

traefikLogsFile = "/var/log/traefik.log"
accessLogsFile = "/var/log/traefik-access.log"
logLevel = "DEBUG"

defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8000"

[retry]

[web]
address = ":8080"

[file]

[backends]
  [backends.backend1]
    [backends.backend1.servers.server1]
    url = "http://localhost:80"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Host:test-nginx.example.net"

# curl -s http://localhost:8080/api | jq .
{
  "file": {
    "frontends": {
      "frontend1": {
        "priority": 0,
        "routes": {
          "test_1": {
            "rule": "Host:test-nginx.example.net"
          }
        },
        "backend": "backend1",
        "entryPoints": [
          "http"
        ]
      }
    },
    "backends": {
      "backend1": {
        "loadBalancer": {
          "method": "wrr"
        },
        "servers": {
          "server1": {
            "weight": 0,
            "url": "http://localhost:80"
          }
        }
      }
    }
  }
}

After start everything works:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: text/html
< Date: Fri, 02 Sep 2016 11:56:33 GMT
< Last-Modified: Tue, 04 Mar 2014 11:46:45 GMT
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
<
{ [data not shown]
* Connection #0 to host localhost left intact

Then lets test it with wrk:
wrk -t30 -c400 -d30s -H "Host: test-nginx.example.net" http://localhost:8000

After some time (one or few attempts) traefik is unresponsive on port 8000:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net

So traffic is no longer processed.

Health API is also unresponsive:

# curl -svo /dev/null http://localhost:8080/health
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*

Whats interesting dashboard is responsive (but without data):

# curl -svo /dev/null http://localhost:8080/dashboard/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /dashboard/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 1388
< Content-Type: text/html; charset=utf-8
< Last-Modified: Tue, 02 Aug 2016 17:29:47 GMT
< Date: Fri, 02 Sep 2016 12:03:29 GMT
<
{ [data not shown]
* Connection #0 to host localhost left intact

Traefik access logs just stop writing:

127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23814 "frontend1" "http://localhost:80" 218.853406ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23813 "frontend1" "http://localhost:80" 219.370314ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23812 "frontend1" "http://localhost:80" 219.933964ms

Logs (severity DEBUG) shows in log file all the time even for requests without reply:

time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 9.730448ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.423115ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.664671ms"

strace during attempt to send request to traefik (curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000):
https://gist.github.com/r0bj/b618c74b1bc0db5c11f78db08c34fc15

So it seems that request hits backend but response isn't sent to original sender.

There are many connections in CLOSE_WAIT status:
https://gist.github.com/r0bj/c647c76fe65a562ffd2e024e11a260cd

Restart treafik daemon fixes this issue.

It's easy to replicate this issue:

start vagrant host with ubuntu from official ubuntu/trusty64 image with default settings
ulimit -n 1000000
install nginx
start traefik with above config
hit it with wrk benchmark one or more times

One can also replicate this issue with sending wrk requests to non existing backend (resulting 404):
wrk -t30 -c400 -d30 http://localhost:8000/

The text was updated successfully, but these errors were encountered:

emilevauge · 2016-09-14T09:59:52Z

@r0bj for high performance, I recommend to set a higher value for MaxIdleConnsPerHost:

# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
# MaxIdleConnsPerHost = 200

Can you try with this change (for example 1000)?

r0bj · 2016-09-14T10:25:47Z

Changing MaxIdleConnsPerHost value to 1000 didn't fix the issue. Daemon stop handling traffic.

r0bj · 2016-09-14T10:34:15Z

Just to be clear, traefik stays in freeze state even after wrk benchmark is over.

emilevauge · 2016-09-14T10:36:12Z

Hum, this is really weird... There must be a regression here. I will investigate on this.

emilevauge · 2016-09-14T11:46:02Z

I cannot reproduce this issue. I used your toml file and launched several times wrk -t30 -c400 -d30 -H "Host:test-nginx.example.net" http://localhost:8000/ but I still don't have any error... But Iused traefik master version. I will try with v.1.0.2 later. Maybe you could try disabling retry in the toml file?

r0bj · 2016-09-14T13:09:11Z

I've disabled retry but result was the same.

emilevauge · 2016-09-15T11:08:49Z

@r0bj I still cannot reproduce this issue on my laptop, even using v1.0.2 😕 Traefik continues to respond after benchmarks. Which architecture are you using? amd64? Are you on bare metal or on a VM?
@containous/traefik can anyone try to reproduce this at home?

r0bj · 2016-09-15T11:22:37Z

I hit this issue on bare metal amd64 host. Then I try to reproduce this on vagrant/virtual box VM (vanilla ubuntu/trusty64) with success.

abhishekamralkar · 2016-09-16T08:52:39Z

I am having the same issue, I end up writing a script which check Traefik log files and if the log file is older than 3 min it reloads Traefik.

emilevauge · 2016-09-16T10:02:02Z

@abhishekamralkar Which version are you using? Is it something new? As I still cannot reproduce this behavior, I need as much information as possible to investigate :)

SvenAbels · 2016-09-19T21:14:39Z

I can confirm this with the latest release of traefik using
wrk -t40 -c400 -d30 http://...

using e.g.
wrk -t10 -c40 -d30 http://...
works still fine.

I was able to reproduce this on 2 physical machines with Ubuntu 16.04 running Traefik in Docker 1.12.1

As soon as the error orrurs Traefik stops reacting - even after wrk is finised. In the logs of traefik the following appears:

...
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"

http://httpd:80 is the backend that I have configured. Despite the error message, the backend is still there and reacts fine if I reach it manually:

# curl httpd
<html><body><h1>It works!</h1></body></html>

When I shutdown the traefik container I see:

time="2016-09-19T20:51:06Z" level=info msg="I have to go... terminated"
time="2016-09-19T20:51:06Z" level=info msg="Stopping server"

When I restart traefik everything workd fine again (I don't need to restart the httpd backend server)

Do you need any additional information?

abhishekamralkar · 2016-09-20T17:38:31Z

traefik version
Version: v1.0.2
Codename: reblochon
Go version: go1.6.2
Built: 2016-08-02_05:29:50PM
OS/Arch: linux/amd64

This is the version I am using. Not sure whats going on but it just hanged and nothing works

emilevauge · 2016-09-21T15:53:30Z

I think I found the issue. This seems due to a race condition in https://github.com/thoas/stats. It produces when accessing to the /health endpoint of traefik's web ui, and at the same time, making requests to traefik reverse proxy.

Could you confirm that you are accessing /health during your tests (with a healthcheck or if you have the web ui opened) ?

A workaround is to avoid accessing webui during tests and change your healthcheck to /api endpoint.

I'm investigating if the issue is still present in the master branch.

SvenAbels · 2016-09-21T16:01:09Z

@emilevauge Yes, indeed that has been the case for me. I had the health UI check opened during the tests all the time.

r0bj · 2016-09-21T16:09:13Z

@emilevauge Yes, It seems that that was the case. I was accessing /health during the wrk test - this was actually a way of determinig if traefik is still working. And also I pointed marathon healtcheck to traefik /health.

ryanleary · 2016-09-21T18:59:02Z

We are experiencing similar behavior. We see this despite our healthcheck using the /api endpoint.

emilevauge · 2016-09-21T19:05:46Z

@ryanleary then could you give as much details as possible?

SvenAbels · 2016-09-21T21:54:21Z

@emilevauge I did some load tests and I can confirm that the health UI was indeed the problem - at least in my case. I can reproduce the problem many times when the health UI is opened but when it's closed, I cannot reproduce the problem anymore - no matter what I try.

SvenAbels · 2016-09-22T13:51:23Z

@emilevauge Thanks a lot. Do you know when the next traefik release will be out, which has this fix included?

emilevauge · 2016-09-22T13:53:29Z

Already released: https://github.com/containous/traefik/releases/tag/v1.0.3 :)
Just waiting for docker to merge: docker-library/official-images#2169.
You can use containous/traefik:v1.0.3 in the meantime.

SvenAbels · 2016-09-22T14:27:34Z

Perfect, thanks a lot. :-)

r0bj · 2016-09-22T14:29:41Z

Thanks. I can confirm that this issue is no longer valid with v1.0.3.

errm added the investigation-needed label Sep 12, 2016

emilevauge added the bug label Sep 21, 2016

emilevauge mentioned this issue Sep 22, 2016

Fix health race #693

Merged

emilevauge closed this as completed Sep 22, 2016

ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017

ldez removed the investigation-needed label May 12, 2017

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

traefik hangs - stops handling requests #662

traefik hangs - stops handling requests #662

r0bj commented Sep 2, 2016 •

edited

Loading

emilevauge commented Sep 14, 2016 •

edited

Loading

r0bj commented Sep 14, 2016

r0bj commented Sep 14, 2016

emilevauge commented Sep 14, 2016

emilevauge commented Sep 14, 2016

r0bj commented Sep 14, 2016

emilevauge commented Sep 15, 2016

r0bj commented Sep 15, 2016 •

edited

Loading

abhishekamralkar commented Sep 16, 2016

emilevauge commented Sep 16, 2016

SvenAbels commented Sep 19, 2016 •

edited

Loading

abhishekamralkar commented Sep 20, 2016 •

edited

Loading

emilevauge commented Sep 21, 2016 •

edited

Loading

SvenAbels commented Sep 21, 2016

r0bj commented Sep 21, 2016

ryanleary commented Sep 21, 2016

emilevauge commented Sep 21, 2016

SvenAbels commented Sep 21, 2016

SvenAbels commented Sep 22, 2016

emilevauge commented Sep 22, 2016

SvenAbels commented Sep 22, 2016

r0bj commented Sep 22, 2016

traefik hangs - stops handling requests #662

traefik hangs - stops handling requests #662

Comments

r0bj commented Sep 2, 2016 • edited Loading

emilevauge commented Sep 14, 2016 • edited Loading

r0bj commented Sep 14, 2016

r0bj commented Sep 14, 2016

emilevauge commented Sep 14, 2016

emilevauge commented Sep 14, 2016

r0bj commented Sep 14, 2016

emilevauge commented Sep 15, 2016

r0bj commented Sep 15, 2016 • edited Loading

abhishekamralkar commented Sep 16, 2016

emilevauge commented Sep 16, 2016

SvenAbels commented Sep 19, 2016 • edited Loading

abhishekamralkar commented Sep 20, 2016 • edited Loading

emilevauge commented Sep 21, 2016 • edited Loading

SvenAbels commented Sep 21, 2016

r0bj commented Sep 21, 2016

ryanleary commented Sep 21, 2016

emilevauge commented Sep 21, 2016

SvenAbels commented Sep 21, 2016

SvenAbels commented Sep 22, 2016

emilevauge commented Sep 22, 2016

SvenAbels commented Sep 22, 2016

r0bj commented Sep 22, 2016

r0bj commented Sep 2, 2016 •

edited

Loading

emilevauge commented Sep 14, 2016 •

edited

Loading

r0bj commented Sep 15, 2016 •

edited

Loading

SvenAbels commented Sep 19, 2016 •

edited

Loading

abhishekamralkar commented Sep 20, 2016 •

edited

Loading

emilevauge commented Sep 21, 2016 •

edited

Loading