Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packetbeat stops logging data with too many open files for pids #226

Closed
portante opened this issue Sep 4, 2015 · 23 comments
Closed

packetbeat stops logging data with too many open files for pids #226

portante opened this issue Sep 4, 2015 · 23 comments
Labels

Comments

@portante
Copy link

portante commented Sep 4, 2015

Saw this on my install of packetbeat, "Packetbeat version 1.0.0-beta2 (amd64)" (the amd64 is wrong, it is an x86_64 CPU):

[root@perf34 ~]# systemctl status packetbeat
packetbeat.service - LSB: Packetbeat agent
   Loaded: loaded (/etc/rc.d/init.d/packetbeat)
   Active: active (running) since Thu 2015-08-06 11:04:29 UTC; 4 weeks 0 days ago
  Process: 53011 ExecStart=/etc/rc.d/init.d/packetbeat start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/packetbeat.service
           ├─53023 packetbeat-god -r / -n -p /var/run/packetbeat.pid -- /usr/bin/packetbeat -c /etc/packetbeat/packetbeat.yml
           └─53024 /usr/bin/packetbeat -c /etc/packetbeat/packetbeat.yml

Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2088/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2156/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2344/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2345/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2346/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2347/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2348/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for httpd: Open /proc: open /proc: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for app: Open /proc: open /proc: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for glusterfs: Open /proc: open /proc: too many open files
@urso
Copy link

urso commented Sep 4, 2015

I see a missing close in FindSocketsOfPid.

Which publisher have you configured. Can you check with lsof if your packetbeat instance holds loads of file descriptors into proc?

@portante
Copy link
Author

portante commented Sep 4, 2015

@urso, checking ... well, I just restarted it about 25 minutes ago and I don't see a lot of open files to /proc, see yaml config below, though not sure what publisher I am using. This is a RHEL 7.1 box.

[root@perf34 ~]# lsof -p 8272
COMMAND    PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
packetbea 8272 root  cwd       DIR               8,33     4096         2 /
packetbea 8272 root  rtd       DIR               8,33     4096         2 /
packetbea 8272 root  txt       REG               8,33  7819784   1716812 /usr/bin/packetbeat
packetbea 8272 root  mem       REG                0,6          285184391 socket:[285184391] (stat: No such file or directory)
packetbea 8272 root  mem       REG               8,33  2107760   1706259 /usr/lib64/libc-2.17.so
packetbea 8272 root  mem       REG               8,33   141616   1709732 /usr/lib64/libpthread-2.17.so
packetbea 8272 root  mem       REG               8,33   266432   1720747 /usr/lib64/libpcap.so.1.5.3
packetbea 8272 root  mem       REG               8,33   164336   1704521 /usr/lib64/ld-2.17.so
packetbea 8272 root    0u     unix 0xffff88125adb6cc0      0t0 285178300 socket
packetbea 8272 root    1w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    2w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    3u  a_inode                0,9        0      6287 [eventpoll]
packetbea 8272 root    4w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    5u     unix 0xffff88125adb0000      0t0 285178301 socket
packetbea 8272 root    6u     unix 0xffff88125adb4b00      0t0 285178302 socket
packetbea 8272 root    7u     unix 0xffff88125adb30c0      0t0 285178303 socket
packetbea 8272 root    8u     unix 0xffff88125adb5280      0t0 285178304 socket
packetbea 8272 root    9u     unix 0xffff88125adb5dc0      0t0 285178305 socket
packetbea 8272 root   10u     unix 0xffff88125adb4740      0t0 285178306 socket
packetbea 8272 root   11u     unix 0xffff88125adb7080      0t0 285178307 socket
packetbea 8272 root   12u     IPv4          285178308      0t0       TCP perf34:63103->172.18.40.3:wap-wsp (ESTABLISHED)
packetbea 8272 root   13u     IPv4          285071110      0t0       TCP perf34:33352->172.18.40.4:wap-wsp (ESTABLISHED)
packetbea 8272 root   14u     IPv4          285173205      0t0       TCP perf34:64389->172.18.40.5:wap-wsp (ESTABLISHED)
packetbea 8272 root   15u     IPv4          285162055      0t0       TCP perf34:65457->172.18.40.6:wap-wsp (ESTABLISHED)
packetbea 8272 root   16u     pack          285184391      0t0       ALL type=SOCK_DGRAM
packetbea 8272 root   17r      DIR                0,3        0         1 /proc
packetbea 8272 root   18r      DIR                0,3        0         1 /proc
packetbea 8272 root   19r      DIR                0,3        0         1 /proc
packetbea 8272 root   20u     IPv4          285119695      0t0       TCP perf34:63229->172.18.40.7:wap-wsp (ESTABLISHED)
packetbea 8272 root   21u     IPv4          285159709      0t0       TCP perf34:50465->172.18.40.8:wap-wsp (ESTABLISHED)
packetbea 8272 root   34u     IPv4          285183245      0t0       TCP perf34:63113->172.18.40.3:wap-wsp (ESTABLISHED)
packetbea 8272 root   40r      DIR                0,3        0         1 /proc

Here is the .yml file without comments:

shipper:
 name:
 tags: ["vos-apache"]
interfaces:
 device: any
protocols:
  http:
    ports: [80, 8080, 8081]
  mysql:
    ports: [3306]
  pgsql:
    ports: [5432]
  redis:
    ports: [6379]
  thrift:
    ports: [9090]
  mongodb:
    ports: [27017]
output:
  elasticsearch:
    enabled: true
    hosts: ["172.18.40.3:9200","172.18.40.4:9200","172.18.40.5:9200","172.18.40.6:9200","172.18.40.7:9200","172.18.40.8:9200"]
    save_topology: true
procs:
  enabled: true
  monitored:
    - process: httpd
      cmdline_grep: httpd
    - process: app
      cmdline_grep: gunicorn
    - process: glusterfs
      cmdline_grep: glusterfs

@portante
Copy link
Author

portante commented Sep 4, 2015

BTW, I am using a 6-node ES cluster, but I see 7 connections, the first node has two connections to it.

@portante
Copy link
Author

portante commented Sep 4, 2015

@urso, on one of my other nodes I see close to 100 open FDs to the ES instances. We have had to reboot some of the ES nodes, perhaps the connections are not be closed down properly?

@urso
Copy link

urso commented Sep 4, 2015

For sending requests we use the golang net/http package. The publisher tries to put data round robin on configured notes. If one notes becomes unavailable round robin load balancing happens on remaining nodes.

Not sure if actual open sockets are reused or just a new connection is made if one insert takes much too long.

If one packetbeat has 100+ parallel running connections it might be an indicator of packetbeat producing more data then ES can consume right now.

Makes me wonder about overal balance in your system. Are these 100 open FDs mostly to ES? Are they about evenly distributed between your ES instances or is one getting slow?

Can you also check your ES instances for number of open sockets, memory and cpu usage?

Have you considered this kind of setup: https://www.elastic.co/guide/en/beats/packetbeat/current/packetbeat-logstash.html?

@portante
Copy link
Author

portante commented Sep 4, 2015

For the six nodes, here are the totals, not quite 100, but currently 77:

      1 172.18.40.3
     17 172.18.40.4
     12 172.18.40.5
     18 172.18.40.6
     14 172.18.40.7
     15 172.18.40.8

We just rebooted yesterday 172.18.40.3, so that may be why there is only one to that node.

So the ES instances are mostly idle, though at times there are spurts of indexing going on, but it is not too heavy. Here is the open socket distribution on the ES instances to clients:

--- gprfs003 ---  12
--- gprfs004 ---  30
--- gprfs005 ---  23
--- gprfs006 ---  28
--- gprfs007 ---  27
--- gprfs008 ---  29

Here is the memory and CPU usage, very low:

--- gprfs003 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    20632604      808792       25448    27839624    28226716
Swap:      24772604           0    24772604
 01:47:35 up 2 days,  7:34,  3 users,  load average: 0.23, 0.25, 0.23
--- gprfs004 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27854424      258896      165424    21167700    20863932
Swap:      24772604      252108    24520496
 01:47:37 up 29 days,  5:02,  3 users,  load average: 0.30, 0.26, 0.24
--- gprfs005 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27782592      703080      190536    20795348    20909700
Swap:      24772604      338824    24433780
 01:47:38 up 29 days,  4:52,  2 users,  load average: 0.15, 0.20, 0.22
--- gprfs006 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27880616      570600      179724    20829804    20823388
Swap:      24772604      203504    24569100
 01:47:40 up 29 days,  4:47,  3 users,  load average: 0.18, 0.26, 0.24
--- gprfs007 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27836304     6964628      192196    14480088    20855508
Swap:      24772604      341468    24431136
 01:47:41 up 29 days,  4:26,  1 user,  load average: 0.11, 0.19, 0.22
--- gprfs008 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27667072     1019140      187636    20594808    21028176
Swap:      24772604      162820    24609784
 01:47:43 up 29 days,  4:26,  2 users,  load average: 0.27, 0.23, 0.23

We have considered that kind of setup, the redis-->logstash-->ES, we are probably going to use Kafka instead of redis at this point.

@tsg tsg added the bug label Sep 4, 2015
@urso
Copy link

urso commented Sep 7, 2015

That's a load of connections. Might be the beat is creating new HTTP connections all the time to push data concurrently due to network latencies (TCP connection per request can become quite expensive).

@portante
Copy link
Author

portante commented Sep 7, 2015

@urso, we had to restart our ES cluster due to hardware maintenance, so I no longer have it running in this state. I'll keep an eye out for this state in the future, and post updates if you think it will be helpful.

Thanks!

@urso
Copy link

urso commented Sep 8, 2015

Thanks.

If possible check for excessive HTTP connections (TCP-SYN and Fin packets between packetbeat and elasticsearch servers.).

@portante
Copy link
Author

portante commented Sep 8, 2015

@urso, roger that.

@portante
Copy link
Author

portante commented Oct 9, 2015

@urso, an update: have not seen problems with packetbeat for the past month. I suspect that the original problems were do to losing connections to the ElasticSearch instances. Still watching, you can close this issue if you'd like.

@ruflin
Copy link
Contributor

ruflin commented Oct 9, 2015

@portante Good to hear. I closed the issue. Please reopen the issue in case the problems occur again.

@ruflin ruflin closed this as completed Oct 9, 2015
@urso
Copy link

urso commented Oct 10, 2015

I'd like to keep this open. packetbeat seems misbehaving by open an unbounded number of connections to elasticsearch basically bypassing TCP congestion control by generating more traffic. After some time packetbeat is killed by OS for resource exhaustion (number of file descriptors). That is a bad network state or a failing elasticsearch instance can bring down packetbeat. Recent additions to libbeat for lumberjack can be leveraged to implement a better load balancing behavior with exactly one connection per configured elasticsearch host.

@urso urso reopened this Oct 10, 2015
@tsg tsg added the 1.0.0-GA label Oct 12, 2015
@portante
Copy link
Author

@urso, I'll recheck my packetbeat instances today to see what their current open connections are like.

@portante
Copy link
Author

I have fifteen nodes running packetbeat to a 6-node ES cluster, where I configured each packetbeat host to talk to all 6 ES nodes directly. Well, I am also running packetbeat on the 6 ES nodes and they are only talking to their localhost instance.

I would expect 1 connection to each of the 6-nodes from each non-local packetbeat instance, and only 1 connection to each local packetbeat instance.

However, I am finding that there are 7 connections established on all the non-local hosts, where the first node in the configuration list is connected twice. And on the local hosts, I am seeing three established connections.

There is one exception to the above: one host has 13 connections, two to each ES node, with the first ES node in the configuration list connected three times.

@urso
Copy link

urso commented Oct 14, 2015

@portante I kinda expected this. So whenever ES or network generates some backpreasure, packetbeat opens a new connection to ES (without limiting number of connections).

@portante
Copy link
Author

@urso, and it appears to do this right from the beginning with the first connection.

@urso
Copy link

urso commented Oct 14, 2015

maybe due to TCP slow-start?

@portante
Copy link
Author

I just updated to -beta4 and I am now seeing errors like:

procs.go:271: Open: open /proc/net/tcp: too many open files

I see over 500+ files open for /proc//fd and /proc//net/tcp from one or two of the instances of packetbeat deployed.

@portante
Copy link
Author

@urso, I'll open a new issue for the too many open files issue. As for this issue, I have not seen any of the extra connections.

@urso
Copy link

urso commented Oct 26, 2015

ok, thanks.

@urso
Copy link

urso commented Oct 26, 2015

@portante can you link the new issue here?

@andrewkroh
Copy link
Member

The issue is #335, and the associated PR to fix it is #337.

@ruflin ruflin closed this as completed Oct 28, 2015
ruflin added a commit that referenced this issue Dec 2, 2015
tsg pushed a commit to tsg/beats that referenced this issue Jan 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants