Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traefik hangs - stops handling requests #662

Closed
r0bj opened this issue Sep 2, 2016 · 22 comments
Closed

traefik hangs - stops handling requests #662

r0bj opened this issue Sep 2, 2016 · 22 comments
Labels

Comments

@r0bj
Copy link

r0bj commented Sep 2, 2016

traefik version: 1.0.2

Set open files limit to 1000000:
ulimit -n 1000000

traefik config:

traefikLogsFile = "/var/log/traefik.log"
accessLogsFile = "/var/log/traefik-access.log"
logLevel = "DEBUG"

defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8000"

[retry]

[web]
address = ":8080"

[file]

[backends]
  [backends.backend1]
    [backends.backend1.servers.server1]
    url = "http://localhost:80"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Host:test-nginx.example.net"
# curl -s http://localhost:8080/api | jq .
{
  "file": {
    "frontends": {
      "frontend1": {
        "priority": 0,
        "routes": {
          "test_1": {
            "rule": "Host:test-nginx.example.net"
          }
        },
        "backend": "backend1",
        "entryPoints": [
          "http"
        ]
      }
    },
    "backends": {
      "backend1": {
        "loadBalancer": {
          "method": "wrr"
        },
        "servers": {
          "server1": {
            "weight": 0,
            "url": "http://localhost:80"
          }
        }
      }
    }
  }
}

After start everything works:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: text/html
< Date: Fri, 02 Sep 2016 11:56:33 GMT
< Last-Modified: Tue, 04 Mar 2014 11:46:45 GMT
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
<
{ [data not shown]
* Connection #0 to host localhost left intact

Then lets test it with wrk:
wrk -t30 -c400 -d30s -H "Host: test-nginx.example.net" http://localhost:8000

After some time (one or few attempts) traefik is unresponsive on port 8000:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net

So traffic is no longer processed.

Health API is also unresponsive:

# curl -svo /dev/null http://localhost:8080/health
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*

Whats interesting dashboard is responsive (but without data):

# curl -svo /dev/null http://localhost:8080/dashboard/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /dashboard/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 1388
< Content-Type: text/html; charset=utf-8
< Last-Modified: Tue, 02 Aug 2016 17:29:47 GMT
< Date: Fri, 02 Sep 2016 12:03:29 GMT
<
{ [data not shown]
* Connection #0 to host localhost left intact

screen shot 2016-09-02 at 16 04 59

Traefik access logs just stop writing:

127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23814 "frontend1" "http://localhost:80" 218.853406ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23813 "frontend1" "http://localhost:80" 219.370314ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23812 "frontend1" "http://localhost:80" 219.933964ms

Logs (severity DEBUG) shows in log file all the time even for requests without reply:

time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 9.730448ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.423115ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.664671ms"

strace during attempt to send request to traefik (curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000):
https://gist.github.com/r0bj/b618c74b1bc0db5c11f78db08c34fc15

So it seems that request hits backend but response isn't sent to original sender.

There are many connections in CLOSE_WAIT status:
https://gist.github.com/r0bj/c647c76fe65a562ffd2e024e11a260cd

Restart treafik daemon fixes this issue.

It's easy to replicate this issue:

  1. start vagrant host with ubuntu from official ubuntu/trusty64 image with default settings
  2. ulimit -n 1000000
  3. install nginx
  4. start traefik with above config
  5. hit it with wrk benchmark one or more times

One can also replicate this issue with sending wrk requests to non existing backend (resulting 404):
wrk -t30 -c400 -d30 http://localhost:8000/

@emilevauge
Copy link
Member

emilevauge commented Sep 14, 2016

@r0bj for high performance, I recommend to set a higher value for MaxIdleConnsPerHost:

# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
# MaxIdleConnsPerHost = 200

Can you try with this change (for example 1000)?

@r0bj
Copy link
Author

r0bj commented Sep 14, 2016

Changing MaxIdleConnsPerHost value to 1000 didn't fix the issue. Daemon stop handling traffic.

@r0bj
Copy link
Author

r0bj commented Sep 14, 2016

Just to be clear, traefik stays in freeze state even after wrk benchmark is over.

@emilevauge
Copy link
Member

Hum, this is really weird... There must be a regression here. I will investigate on this.

@emilevauge
Copy link
Member

I cannot reproduce this issue. I used your toml file and launched several times wrk -t30 -c400 -d30 -H "Host:test-nginx.example.net" http://localhost:8000/ but I still don't have any error... But Iused traefik master version. I will try with v.1.0.2 later. Maybe you could try disabling retry in the toml file?

@r0bj
Copy link
Author

r0bj commented Sep 14, 2016

I've disabled retry but result was the same.

@emilevauge
Copy link
Member

@r0bj I still cannot reproduce this issue on my laptop, even using v1.0.2 😕 Traefik continues to respond after benchmarks. Which architecture are you using? amd64? Are you on bare metal or on a VM?
@containous/traefik can anyone try to reproduce this at home?

@r0bj
Copy link
Author

r0bj commented Sep 15, 2016

I hit this issue on bare metal amd64 host. Then I try to reproduce this on vagrant/virtual box VM (vanilla ubuntu/trusty64) with success.

@abhishekamralkar
Copy link

I am having the same issue, I end up writing a script which check Traefik log files and if the log file is older than 3 min it reloads Traefik.

@emilevauge
Copy link
Member

@abhishekamralkar Which version are you using? Is it something new? As I still cannot reproduce this behavior, I need as much information as possible to investigate :)

@SvenAbels
Copy link

SvenAbels commented Sep 19, 2016

I can confirm this with the latest release of traefik using
wrk -t40 -c400 -d30 http://...

using e.g.
wrk -t10 -c40 -d30 http://...
works still fine.

I was able to reproduce this on 2 physical machines with Ubuntu 16.04 running Traefik in Docker 1.12.1

As soon as the error orrurs Traefik stops reacting - even after wrk is finised. In the logs of traefik the following appears:

...
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"

http://httpd:80 is the backend that I have configured. Despite the error message, the backend is still there and reacts fine if I reach it manually:

# curl httpd
<html><body><h1>It works!</h1></body></html>

When I shutdown the traefik container I see:

time="2016-09-19T20:51:06Z" level=info msg="I have to go... terminated"
time="2016-09-19T20:51:06Z" level=info msg="Stopping server"

When I restart traefik everything workd fine again (I don't need to restart the httpd backend server)

Do you need any additional information?

@abhishekamralkar
Copy link

abhishekamralkar commented Sep 20, 2016

traefik version
Version: v1.0.2
Codename: reblochon
Go version: go1.6.2
Built: 2016-08-02_05:29:50PM
OS/Arch: linux/amd64

This is the version I am using. Not sure whats going on but it just hanged and nothing works

@emilevauge
Copy link
Member

emilevauge commented Sep 21, 2016

I think I found the issue. This seems due to a race condition in https://github.com/thoas/stats. It produces when accessing to the /health endpoint of traefik's web ui, and at the same time, making requests to traefik reverse proxy.

Could you confirm that you are accessing /health during your tests (with a healthcheck or if you have the web ui opened) ?

A workaround is to avoid accessing webui during tests and change your healthcheck to /api endpoint.

I'm investigating if the issue is still present in the master branch.

@emilevauge emilevauge added the bug label Sep 21, 2016
@SvenAbels
Copy link

@emilevauge Yes, indeed that has been the case for me. I had the health UI check opened during the tests all the time.

@r0bj
Copy link
Author

r0bj commented Sep 21, 2016

@emilevauge Yes, It seems that that was the case. I was accessing /health during the wrk test - this was actually a way of determinig if traefik is still working. And also I pointed marathon healtcheck to traefik /health.

@ryanleary
Copy link
Contributor

We are experiencing similar behavior. We see this despite our healthcheck using the /api endpoint.

@emilevauge
Copy link
Member

@ryanleary then could you give as much details as possible?

@SvenAbels
Copy link

@emilevauge I did some load tests and I can confirm that the health UI was indeed the problem - at least in my case. I can reproduce the problem many times when the health UI is opened but when it's closed, I cannot reproduce the problem anymore - no matter what I try.

@SvenAbels
Copy link

@emilevauge Thanks a lot. Do you know when the next traefik release will be out, which has this fix included?

@emilevauge
Copy link
Member

Already released: https://github.com/containous/traefik/releases/tag/v1.0.3 :)
Just waiting for docker to merge: docker-library/official-images#2169.
You can use containous/traefik:v1.0.3 in the meantime.

@SvenAbels
Copy link

Perfect, thanks a lot. :-)

@r0bj
Copy link
Author

r0bj commented Sep 22, 2016

Thanks. I can confirm that this issue is no longer valid with v1.0.3.

@ldez ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants