packetbeat stops logging data with too many open files for pids #226

portante · 2015-09-04T00:09:47Z

Saw this on my install of packetbeat, "Packetbeat version 1.0.0-beta2 (amd64)" (the amd64 is wrong, it is an x86_64 CPU):

[root@perf34 ~]# systemctl status packetbeat
packetbeat.service - LSB: Packetbeat agent
   Loaded: loaded (/etc/rc.d/init.d/packetbeat)
   Active: active (running) since Thu 2015-08-06 11:04:29 UTC; 4 weeks 0 days ago
  Process: 53011 ExecStart=/etc/rc.d/init.d/packetbeat start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/packetbeat.service
           ├─53023 packetbeat-god -r / -n -p /var/run/packetbeat.pid -- /usr/bin/packetbeat -c /etc/packetbeat/packetbeat.yml
           └─53024 /usr/bin/packetbeat -c /etc/packetbeat/packetbeat.yml

Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2088/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2156/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2344/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2345/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2346/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2347/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:288: FindSocketsOfPid: Open: open /proc/2348/fd: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for httpd: Open /proc: open /proc: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for app: Open /proc: open /proc: too many open files
Sep 03 10:55:37 perf34 /usr/bin/packetbeat[53024]: procs.go:153: Error finding PID files for glusterfs: Open /proc: open /proc: too many open files

The text was updated successfully, but these errors were encountered:

urso · 2015-09-04T00:20:42Z

I see a missing close in FindSocketsOfPid.

Which publisher have you configured. Can you check with lsof if your packetbeat instance holds loads of file descriptors into proc?

portante · 2015-09-04T00:32:41Z

@urso, checking ... well, I just restarted it about 25 minutes ago and I don't see a lot of open files to /proc, see yaml config below, though not sure what publisher I am using. This is a RHEL 7.1 box.

[root@perf34 ~]# lsof -p 8272
COMMAND    PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
packetbea 8272 root  cwd       DIR               8,33     4096         2 /
packetbea 8272 root  rtd       DIR               8,33     4096         2 /
packetbea 8272 root  txt       REG               8,33  7819784   1716812 /usr/bin/packetbeat
packetbea 8272 root  mem       REG                0,6          285184391 socket:[285184391] (stat: No such file or directory)
packetbea 8272 root  mem       REG               8,33  2107760   1706259 /usr/lib64/libc-2.17.so
packetbea 8272 root  mem       REG               8,33   141616   1709732 /usr/lib64/libpthread-2.17.so
packetbea 8272 root  mem       REG               8,33   266432   1720747 /usr/lib64/libpcap.so.1.5.3
packetbea 8272 root  mem       REG               8,33   164336   1704521 /usr/lib64/ld-2.17.so
packetbea 8272 root    0u     unix 0xffff88125adb6cc0      0t0 285178300 socket
packetbea 8272 root    1w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    2w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    3u  a_inode                0,9        0      6287 [eventpoll]
packetbea 8272 root    4w     FIFO                0,8      0t0 285183233 pipe
packetbea 8272 root    5u     unix 0xffff88125adb0000      0t0 285178301 socket
packetbea 8272 root    6u     unix 0xffff88125adb4b00      0t0 285178302 socket
packetbea 8272 root    7u     unix 0xffff88125adb30c0      0t0 285178303 socket
packetbea 8272 root    8u     unix 0xffff88125adb5280      0t0 285178304 socket
packetbea 8272 root    9u     unix 0xffff88125adb5dc0      0t0 285178305 socket
packetbea 8272 root   10u     unix 0xffff88125adb4740      0t0 285178306 socket
packetbea 8272 root   11u     unix 0xffff88125adb7080      0t0 285178307 socket
packetbea 8272 root   12u     IPv4          285178308      0t0       TCP perf34:63103->172.18.40.3:wap-wsp (ESTABLISHED)
packetbea 8272 root   13u     IPv4          285071110      0t0       TCP perf34:33352->172.18.40.4:wap-wsp (ESTABLISHED)
packetbea 8272 root   14u     IPv4          285173205      0t0       TCP perf34:64389->172.18.40.5:wap-wsp (ESTABLISHED)
packetbea 8272 root   15u     IPv4          285162055      0t0       TCP perf34:65457->172.18.40.6:wap-wsp (ESTABLISHED)
packetbea 8272 root   16u     pack          285184391      0t0       ALL type=SOCK_DGRAM
packetbea 8272 root   17r      DIR                0,3        0         1 /proc
packetbea 8272 root   18r      DIR                0,3        0         1 /proc
packetbea 8272 root   19r      DIR                0,3        0         1 /proc
packetbea 8272 root   20u     IPv4          285119695      0t0       TCP perf34:63229->172.18.40.7:wap-wsp (ESTABLISHED)
packetbea 8272 root   21u     IPv4          285159709      0t0       TCP perf34:50465->172.18.40.8:wap-wsp (ESTABLISHED)
packetbea 8272 root   34u     IPv4          285183245      0t0       TCP perf34:63113->172.18.40.3:wap-wsp (ESTABLISHED)
packetbea 8272 root   40r      DIR                0,3        0         1 /proc

Here is the .yml file without comments:

shipper:
 name:
 tags: ["vos-apache"]
interfaces:
 device: any
protocols:
  http:
    ports: [80, 8080, 8081]
  mysql:
    ports: [3306]
  pgsql:
    ports: [5432]
  redis:
    ports: [6379]
  thrift:
    ports: [9090]
  mongodb:
    ports: [27017]
output:
  elasticsearch:
    enabled: true
    hosts: ["172.18.40.3:9200","172.18.40.4:9200","172.18.40.5:9200","172.18.40.6:9200","172.18.40.7:9200","172.18.40.8:9200"]
    save_topology: true
procs:
  enabled: true
  monitored:
    - process: httpd
      cmdline_grep: httpd
    - process: app
      cmdline_grep: gunicorn
    - process: glusterfs
      cmdline_grep: glusterfs

portante · 2015-09-04T00:43:43Z

BTW, I am using a 6-node ES cluster, but I see 7 connections, the first node has two connections to it.

portante · 2015-09-04T00:53:25Z

@urso, on one of my other nodes I see close to 100 open FDs to the ES instances. We have had to reboot some of the ES nodes, perhaps the connections are not be closed down properly?

urso · 2015-09-04T01:08:58Z

For sending requests we use the golang net/http package. The publisher tries to put data round robin on configured notes. If one notes becomes unavailable round robin load balancing happens on remaining nodes.

Not sure if actual open sockets are reused or just a new connection is made if one insert takes much too long.

If one packetbeat has 100+ parallel running connections it might be an indicator of packetbeat producing more data then ES can consume right now.

Makes me wonder about overal balance in your system. Are these 100 open FDs mostly to ES? Are they about evenly distributed between your ES instances or is one getting slow?

Can you also check your ES instances for number of open sockets, memory and cpu usage?

Have you considered this kind of setup: https://www.elastic.co/guide/en/beats/packetbeat/current/packetbeat-logstash.html?

portante · 2015-09-04T01:49:31Z

For the six nodes, here are the totals, not quite 100, but currently 77:

      1 172.18.40.3
     17 172.18.40.4
     12 172.18.40.5
     18 172.18.40.6
     14 172.18.40.7
     15 172.18.40.8

We just rebooted yesterday 172.18.40.3, so that may be why there is only one to that node.

So the ES instances are mostly idle, though at times there are spurts of indexing going on, but it is not too heavy. Here is the open socket distribution on the ES instances to clients:

--- gprfs003 ---  12
--- gprfs004 ---  30
--- gprfs005 ---  23
--- gprfs006 ---  28
--- gprfs007 ---  27
--- gprfs008 ---  29

Here is the memory and CPU usage, very low:

--- gprfs003 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    20632604      808792       25448    27839624    28226716
Swap:      24772604           0    24772604
 01:47:35 up 2 days,  7:34,  3 users,  load average: 0.23, 0.25, 0.23
--- gprfs004 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27854424      258896      165424    21167700    20863932
Swap:      24772604      252108    24520496
 01:47:37 up 29 days,  5:02,  3 users,  load average: 0.30, 0.26, 0.24
--- gprfs005 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27782592      703080      190536    20795348    20909700
Swap:      24772604      338824    24433780
 01:47:38 up 29 days,  4:52,  2 users,  load average: 0.15, 0.20, 0.22
--- gprfs006 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27880616      570600      179724    20829804    20823388
Swap:      24772604      203504    24569100
 01:47:40 up 29 days,  4:47,  3 users,  load average: 0.18, 0.26, 0.24
--- gprfs007 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27836304     6964628      192196    14480088    20855508
Swap:      24772604      341468    24431136
 01:47:41 up 29 days,  4:26,  1 user,  load average: 0.11, 0.19, 0.22
--- gprfs008 --- 
              total        used        free      shared  buff/cache   available
Mem:       49281020    27667072     1019140      187636    20594808    21028176
Swap:      24772604      162820    24609784
 01:47:43 up 29 days,  4:26,  2 users,  load average: 0.27, 0.23, 0.23

We have considered that kind of setup, the redis-->logstash-->ES, we are probably going to use Kafka instead of redis at this point.

urso · 2015-09-07T10:17:34Z

That's a load of connections. Might be the beat is creating new HTTP connections all the time to push data concurrently due to network latencies (TCP connection per request can become quite expensive).

portante · 2015-09-07T23:25:30Z

@urso, we had to restart our ES cluster due to hardware maintenance, so I no longer have it running in this state. I'll keep an eye out for this state in the future, and post updates if you think it will be helpful.

Thanks!

urso · 2015-09-08T11:49:20Z

Thanks.

If possible check for excessive HTTP connections (TCP-SYN and Fin packets between packetbeat and elasticsearch servers.).

portante · 2015-09-08T12:04:49Z

@urso, roger that.

portante · 2015-10-09T16:48:00Z

@urso, an update: have not seen problems with packetbeat for the past month. I suspect that the original problems were do to losing connections to the ElasticSearch instances. Still watching, you can close this issue if you'd like.

ruflin · 2015-10-09T17:31:43Z

@portante Good to hear. I closed the issue. Please reopen the issue in case the problems occur again.

urso · 2015-10-10T09:34:28Z

I'd like to keep this open. packetbeat seems misbehaving by open an unbounded number of connections to elasticsearch basically bypassing TCP congestion control by generating more traffic. After some time packetbeat is killed by OS for resource exhaustion (number of file descriptors). That is a bad network state or a failing elasticsearch instance can bring down packetbeat. Recent additions to libbeat for lumberjack can be leveraged to implement a better load balancing behavior with exactly one connection per configured elasticsearch host.

portante · 2015-10-14T11:41:22Z

@urso, I'll recheck my packetbeat instances today to see what their current open connections are like.

portante · 2015-10-14T13:12:00Z

I have fifteen nodes running packetbeat to a 6-node ES cluster, where I configured each packetbeat host to talk to all 6 ES nodes directly. Well, I am also running packetbeat on the 6 ES nodes and they are only talking to their localhost instance.

I would expect 1 connection to each of the 6-nodes from each non-local packetbeat instance, and only 1 connection to each local packetbeat instance.

However, I am finding that there are 7 connections established on all the non-local hosts, where the first node in the configuration list is connected twice. And on the local hosts, I am seeing three established connections.

There is one exception to the above: one host has 13 connections, two to each ES node, with the first ES node in the configuration list connected three times.

urso · 2015-10-14T13:21:30Z

@portante I kinda expected this. So whenever ES or network generates some backpreasure, packetbeat opens a new connection to ES (without limiting number of connections).

portante · 2015-10-14T13:40:44Z

@urso, and it appears to do this right from the beginning with the first connection.

urso · 2015-10-14T13:48:14Z

maybe due to TCP slow-start?

portante · 2015-10-24T04:20:04Z

I just updated to -beta4 and I am now seeing errors like:

procs.go:271: Open: open /proc/net/tcp: too many open files

I see over 500+ files open for /proc//fd and /proc//net/tcp from one or two of the instances of packetbeat deployed.

portante · 2015-10-24T11:51:18Z

@urso, I'll open a new issue for the too many open files issue. As for this issue, I have not seen any of the extra connections.

urso · 2015-10-26T15:30:03Z

ok, thanks.

urso · 2015-10-26T16:50:29Z

@portante can you link the new issue here?

andrewkroh · 2015-10-26T16:57:01Z

The issue is #335, and the associated PR to fix it is #337.

respect * debug selector in IsDebug

tsg added the bug label Sep 4, 2015

ruflin closed this as completed Oct 9, 2015

urso reopened this Oct 10, 2015

tsg added the 1.0.0-GA label Oct 12, 2015

urso mentioned this issue Oct 23, 2015

Fail to publish to elasticsearch #174

Closed

ruflin closed this as completed Oct 28, 2015

ruflin added a commit that referenced this issue Dec 2, 2015

Merge pull request #226 from urso/bug/339-debug-star-selector

b553416

respect * debug selector in IsDebug

tsg pushed a commit to tsg/beats that referenced this issue Jan 20, 2016

Merge pull request elastic#226 from urso/bug/339-debug-star-selector

2df3fcd

respect * debug selector in IsDebug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

packetbeat stops logging data with too many open files for pids #226

packetbeat stops logging data with too many open files for pids #226

portante commented Sep 4, 2015

urso commented Sep 4, 2015

portante commented Sep 4, 2015

portante commented Sep 4, 2015

portante commented Sep 4, 2015

urso commented Sep 4, 2015

portante commented Sep 4, 2015

urso commented Sep 7, 2015

portante commented Sep 7, 2015

urso commented Sep 8, 2015

portante commented Sep 8, 2015

portante commented Oct 9, 2015

ruflin commented Oct 9, 2015

urso commented Oct 10, 2015

portante commented Oct 14, 2015

portante commented Oct 14, 2015

urso commented Oct 14, 2015

portante commented Oct 14, 2015

urso commented Oct 14, 2015

portante commented Oct 24, 2015

portante commented Oct 24, 2015

urso commented Oct 26, 2015

urso commented Oct 26, 2015

andrewkroh commented Oct 26, 2015

packetbeat stops logging data with too many open files for pids #226

packetbeat stops logging data with too many open files for pids #226

Comments

portante commented Sep 4, 2015

urso commented Sep 4, 2015

portante commented Sep 4, 2015

portante commented Sep 4, 2015

portante commented Sep 4, 2015

urso commented Sep 4, 2015

portante commented Sep 4, 2015

urso commented Sep 7, 2015

portante commented Sep 7, 2015

urso commented Sep 8, 2015

portante commented Sep 8, 2015

portante commented Oct 9, 2015

ruflin commented Oct 9, 2015

urso commented Oct 10, 2015

portante commented Oct 14, 2015

portante commented Oct 14, 2015

urso commented Oct 14, 2015

portante commented Oct 14, 2015

urso commented Oct 14, 2015

portante commented Oct 24, 2015

portante commented Oct 24, 2015

urso commented Oct 26, 2015

urso commented Oct 26, 2015

andrewkroh commented Oct 26, 2015