Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read error looking ack & i/o timeout (since 0.4) #438

Closed
knalli opened this issue Apr 15, 2015 · 45 comments
Closed

Read error looking ack & i/o timeout (since 0.4) #438

knalli opened this issue Apr 15, 2015 · 45 comments

Comments

@knalli
Copy link

knalli commented Apr 15, 2015

I'm faced with both Read error looking for ack: read tcp $host:$port: i/o timeout and Failed to tls handshake with $host read tcp $host:$ip: i/o timeout since a couple of day.

[The affective result is I'm loosing the events (read: no output in logstash.stdout due debug): no more events on LS side after some seconds/minutes. Restarting LS does not help any more, and the stop of LS even will not made in time (eventually a LS issue itself?). I also notice sometimes a high cpu usage of the logstash daemon (100%, full load on a tiny VM). It also looks there is been a huge memory issue.]

Running a LS 1.5RC2, the last changes since last week were:

  • update logstash-forwarder from 0.3 to latest 0.4
  • activate (aka using) two filters (geoip, useragent)

Neither disabling the new filters nor reduction to only one active logstash-forwarder agent did help anything... and then I've re-installed the older logstash-forwarder 0.3: The flow of events is being processed successfully again. If I start a logstash-forwarder 0.4 in addition (another "app host"), all logstash-forwards (incl. the 0.3) have been the same issue. If I restart all services again except the logstash-forwarder 0.4, it will work again without any issues.

So tl;dr: Only one forwarder agent (0.4) seems to break all other agents (regardless 0.3 or 0.4).

Site notes:

  • All machines are running the latest
    ** Amazon AMI with Ruby 2p864, Oracle Java 8 40
    ** Elasticsearch 1.4.4, Logstash 1.5 RC2
  • the list of installed plugins (/opt/logstash/bin/plugin --installed --verbose) is exactly the same

My config looks equal in both versions

{
  "network": {
    "servers": [ "$host:$port" ],
    "timeout": 15,
    "ssl ca": "/etc/pki/tls/certs/$cert"
  },
  "files": [ {
...    list of about 10 path-type-items with exact file patterns (no wildcards)
  } ]
}

There are already related issues, but I'm not aware of relevancy to logstash-forwarder 0.4: #421, #429, #431, #91

Let me know which additional information you need. Because the forwarder was making the trouble, I've created the issue at this place. I'm quite not sure which component is responsible..

@knalli knalli changed the title Read error looking ack & i/o timeout Read error looking ack & i/o timeout (since 0.4) Apr 15, 2015
@torrancew
Copy link

@knalli it sounds like your Logstash is under-performing. You mentioned EC2 and tiny VMs - what size intsance are you running LS against, and how many forwarders are reporting to it? Did the # of forwarders increase in the last week? Did you change the timeout setting in LSF in the last week?

@knalli
Copy link
Author

knalli commented Apr 15, 2015

My dev node running ELK (Elasticsearch, Logstash, Kibana, all with defaults) is/was a very tiny instance (actually not EC2, but a real AMI on a local Xen Server). It was 3GB ram, but I had it already increased to 6GB yesterday. All local dev/stagings hosts are forwarding; that are 5-6 hosts each with an forwarder pushing 2-6 logfiles. In sum, that are less than 50 events per minute. However, since the pushing had been stopped for some time, after re-enabling everything again there was a peak of 6000 messages within a timeframe of 10 minutes (actually less).

bildschirmfoto 2015-04-15 um 22 01 11

You see can see clearly when I have made the rollback. :)

And again: Even with the 6GB RAM and only one forwarder the connections have been dropped/timed out immediately (within seconds).

And I have forgotten one additional information/hint: The LS agent produces memory overcommitment issues (in my case about 8-9GB) with the default setup under this environment circumstances.

The test/prod setup are both running on a m3.medium (3,75 GB, one core). That's definitely tiny. However, while the prod node is running >90% idle and has only about 1gb memory comitted (fair enough?), the test node has a base load of 20-30% (completely logstash agent) and is already overcomitting again (and no output on logstash.stdout while new events are being produced/tried to push).

If you say I need more memory or cores: Yeah, that sounds okay. But I cannot make a reasonable upgrade unless I know what the setup really requires.

  • My already upgraded local dev host (formly 3 GB, now 6GB ram) is issued with the same problem running with a minimal setup (and one forwarder 0.4).
    • Running forwarder 0.3, everything is okay.
  • My test and prod host are configured equally (specs, config), except there are 3 more forwarders which currently do not producing a notable amount of events (not the app itself) .
    • test host is running out of control with 0.4
    • prod host is running smoothly (0.3)

Did the # of forwarders increase in the last week?

Do you mean effective instances of forwarders aka forwarding hosts? Nope, the "cluster size" has not changed for weeks.

Did you change the timeout setting in LSF in the last week?

I had left the default of 15, however, I have changed this for debugging. Increasing the timeout seems not to change anything. The connection drops.. I would say regardless of the timeout?

@driskell
Copy link
Contributor

Have you tried deleting the state file .logstash-forwarder on test to reset everything so it only tails logs? (You WILL lose logs here though!)

Might be test is so far behind now it sends too many logs at once to Logstash and explodes memory where as prod is trickling fine. Deleting the state file will reset so it tails the logs from end rather than trying to resume potentially from a while back.

Maybe also reduce spool size.

Generally, scale your spool size based on Logstash memory and average event size and number of forwarders. Scale your timeout based on time Logstash needs to process spool size number of events. There's not exact math and different for everyone.

Problem is forwarder is not graceful performance wise. Because it resend on timeout and Logstash doesn't notice Logstash keeps processing previous send, but now ALSO process second send... Then third... So on. Thus memory grow and timeouts get worse.

@knalli
Copy link
Author

knalli commented Apr 15, 2015

Maybe also reduce spool size.

Oops, when replacing my own shim rpm with the new 0.4 one, I'd lost my config with the pre defined spool size of 100. What is the default? Without resetting the forwarders this works only for 2-3 minutes until i/o error. But when starting the agents not in parallel, it works (performance issue on side of Logstash?)

Have you tried deleting the state file .logstash-forwarder on test to reset everything so it only tails logs? (You WILL lose logs here though!)

I can try this, loosing events is not the problem at this moment.
Are you sure deleting the file should tail? Because after deleting that file, the restarted forwarder is still trying to recover everything.

If a spool size of 100 is not the default, the many Registrar: processing 100 events would be explained by your last paragraph.

Generally, scale your spool size based on Logstash memory and average event size and number of forwarders. Scale your timeout based on time Logstash needs to process spool size number of events. There's not exact math and different for everyone.

Okay, I understand your remarks "it depends". But at least some example would be helpful. If I get you right, the huge memory/cpu consumption (on side of LS) will be because of a performance issue (well, memory/file descriptors, incoming events/logs?). And this results into logstash itself cannot processing anything anymore while eating memory and cpu?

I only want to understand whether I'm faced with a real issue or only a performance issue. Since I can compare both 0.3 and 0.4 setups on the same cluster (dev) as well as side by side setups (test and prod), I was (am?) in doubt that throwing more memory could help.

@knalli
Copy link
Author

knalli commented Apr 16, 2015

Update: After all, the forwarder (0.4) did not recover the logs correctly and/or the logstash daemon itself was unresponsive (and will not come back, even after a night).

I have done a rollback to 0.3 on my test environment, and now everything looks fine. Will have to monitor this for some hours.

Have you tried deleting the state file .logstash-forwarder on test to reset everything so it only tails logs? (You WILL lose logs here though!)

Indeed, I've noticed forwarder 0.3 has been tailing logs when no .logstash-forwarder file was found, but 0.4 did not doing this (last night, I have seen Registrar: processing 100 events for minutes only at one forwarder).

Additionally, now the logstash daemon is at the same cpu/mem level like my still working production cluster.

@driskell
Copy link
Contributor

Ah. I think this is the tailOnRotate / from-beginning problem. Previous 0.3 would always start file from the end, even if it just rotated, so could lose logs. 0.4 now always starts from beginning, even if not rotated. So without .logstash-forwarder it will start from beginning unless tailOnRotate is given as command line. It's still wrong but somewhat better. To make it behave same you'll need to modify init script to specify tailOnRotate=true. Really this is a bug and it should start from end if .logstash-forwarder does not exist, and from then on resume or for new files start from beginning.

Default spool-size is 1024 so looks like you've already reduced it. Regarding memory, if you have an average line size of 1KB and have 50 forwarders, and spool size 1000 - that means before compression, 1KB_50_1000 = 50MB. Since you'll need to account for the decompressing it'll be approx 1.5*50 = 75MB. So you can see the more forwarders the more memory, and the bigger the log lines the more memory. Difficult to calculate sometimes.

As mentioned previously, main issue you'll have is the timeout problem. If forwarder timeout, Logstash keeps processing, and forwarder resends. This doubles the load. Then if that timeout it sends again and that triples. So it's important to get the timeout setting right and increase that as much as necessary so that the timeout value is bigger than the time it takes Logstash to process spool size events when at full load (in your case, 100). Easiest thing to do is start really really high and then put everything under load, then lower it slightly until you hit timeouts, bringing it back up.

@knalli
Copy link
Author

knalli commented Apr 16, 2015

Okay, this means: How can I get a good timeout? Is there any information/metric exposed by logstash about the current load/queue? Perhaps via JMX?

@driskell
Copy link
Contributor

My experience has been trial and error since it depends entirely on CPU / Memory(GC Runs) / Filters in use / Outputs in use / CPU on output server / Memory on output server. So many factors involved in throughput and it's based on throughput.

In the end I implemented Partial ACK so timeout was irrelevant, Logstash would tell forwarder "wait" as a form of flow control. This was in #180 but was contentious and very big patch and during a time of low maintenance of the project - it won't apply anymore as its very old. I maintain it in my fork though, Log Courier

@knalli
Copy link
Author

knalli commented Apr 16, 2015

@driskell woha, another log publisher. Looks interestingly, thanks.

@afzal273
Copy link

I'm getting these errors and logstash stops processing logs from logstash-forwarder -

2015/04/21 22:46:56.557989 Failed to tls handshake with 10.10.10.10 read tcp 10.10.10.10:5000: i/o timeout
2015/04/21 22:46:57.559003 Connecting to [10.10.10.10]:5000 (name.domain.com)
2015/04/21 22:47:12.589579 Failed to tls handshake with 10.10.10.10 read tcp 10.10.10.10:5000: i/o timeout
2015/04/21 22:47:13.606307 Connecting to [10.10.10.10]:5000 (name.domain.com)
2015/04/21 22:47:28.636868 Failed to tls handshake with 10.10.10.10 read tcp 10.10.10.10:5000: i/o timeout

It seems like logstash is not keeping up with logstash forwarders. Logstash node is in GCC and the forwarders are running on nodes in AWS, there are about 40k events/sec sent to logstash. There are no errors in /var/log/logstash showing that logstash is having issues and the cpu/memory and disk I/o on the node are not very high. It is an 8 CPU machine with 8GB RAM.

How many events/sec can logstash handle? What are recommendations for me? Should I be creating more logstash instances and putting all the instances in the logstash forwarder config? Can other logstash instances use the same cert/key as the first logstash instance?

@torrancew
Copy link

@afzal273 it may help to attempt to tweak the network timeout (and possibly, spool_size) of your forwarders. The default timeout is 15, which may be too aggressive if you have particularly latent networks, or somewhat sluggish hosts, etc.

@afzal273
Copy link

I could try that, but I'm running logstash forwarders on several nodes and they all show the same error after a few days of running at the same time. Then when I restart logstash on the logstash node and also stop logstash forwarder on several nodes thats when it starts working again. If I only restart logstash and keep all the logstash forwarders running I still see the same error on the logstash forwarder side.

@torrancew
Copy link

@afzal273 That's pretty consistent with similar issues I've had. Effectively, if things do start timing out, based on what I've seen, a mild socket leak breaks out on the Logstash server, which causes the timeout to be almost unresolvable in the face of the oncoming swarm of connections. In my case, greatly increasing the timeout (60s for my EC2, multi-AZ architecture) seems to have resolved the connectivity issues (though as you mentioned, I had to help things along initially, by stopping the existing forwarders).

@afzal273
Copy link

Sweet, I did both - increased the time out and also added another logstash instance. In the forwarder I put both the logstash node names. Also had to create a wild card cert and replace certs everywhere as in the earlier case the CN was the hostname. Will see how things go now!

@afzal273
Copy link

Also what are the indications that logstash is having a problem when it's giving an I/o timeout? Is there a way to tell from CPU/memory or any logs?

@torrancew
Copy link

In theory, it could be any of the above, plus a few other things :( - open socket count (particularly things in some type of WAIT state) can contribute, iowait as well.

@easescript
Copy link

upgrading Logstash from 1.4.2 to 1.5.0-rc3 seems to fix the "read error" problem.

2015/04/27 16:36:34.457570 Read error looking for ack: read tcp LOG-SERVER:5043: i/o timeout
2015/04/27 16:36:34.457828 Setting trusted CA from file: /etc/pki/tls/certs/LOG-SERVER.crt
2015/04/27 16:36:34.459485 Connecting to [LOG-SERVER]:5043 (LOG-SERVER)
2015/04/27 16:36:34.534151 Connected to LOG-SERVER
2015/04/27 16:37:04.534516 Read error looking for ack: read tcp LOG-SERVER:5043: i/o timeout
2015/04/27 16:37:04.534716 Setting trusted CA from file: /etc/pki/tls/certs/LOG-SERVER.crt
2015/04/27 16:37:04.541027 Connecting to [LOG-SERVER]:5043 (LOG-SERVER)
2015/04/27 16:37:04.761202 Connected to LOG-SERVER
2015/04/27 16:37:13.168328 Registrar: processing 574 events
2015/04/27 16:37:13.180002 Registrar: processing 2 events
2015/04/27 16:37:13.182860 Registrar: processing 1 events

@knalli
Copy link
Author

knalli commented Apr 28, 2015

@easescript Hm.. but the error (at 16:37:04) is still being logged?


Meanwhile, I'm faced with the issue with both forwarder 0.3 and 0.4. I've found two particular issues so far, both probably more on logstash (agent) side:

  • It looks like a problem in filters produce such communication errors, i.e.
    • trying to anonymize an ip where actually no ip is available (checked: restart with and without)
    • trying to parse a specific log with multiline (checked: restart with and without, other multiline configs are not affected)
  • different versions of plugins (first install vs ./plugin update)
  • I've tried both forwarder 0.3, 0.4 and even log-courier. All work fine until some specific data/events is being transmitted.

Currently I'm not sure what the issue is exactly. But so far, my machines can (if everything works) process 8k events per minute with same setup like above (while recovering some days of logs). IMHO in my case any issues seems to be configuration and/or plugin issues. Unfortunately, there are no (error) messages except the std output of processed events. It is still not ideal both components are blowing up each other. However, from my point of view, it looks not like a specific 0.4 issue.

@driskell
Copy link
Contributor

@knalli Is your problem that Logstash crash or that it just stops processing?

If you can start it and run kill -QUIT 1234 where 1234 is Logstash PID, when things are working fine, it will write a trace of threads to Logstash log file. Then when it hangs do same again. If you put those in a gist it can help isolate why things are hanging (comparing first and second sets to see if any threads have died, and using second set to detect hung threads.)

@knalli
Copy link
Author

knalli commented Apr 28, 2015

@driskell Ah, thank you. I'll try anything ;)

Ehm.. and logstash seems to be stops. No more processing (no std output, and actually no events in the database), gaining memory and cpu and standard service stop script will not be successfully any more (=> kill -9)

@knalli
Copy link
Author

knalli commented Apr 28, 2015

@driskell : https://gist.github.com/knalli/2fa29e1080dadbb6da0b

Logstash 1.5 RC2 (I've downgraded again) -- upgrading to RC3 itself does not solved this one

@driskell
Copy link
Contributor

@knalli Looks like the worker thread died without raising an error. Have you tried running Logstash on command line and reproducing in case it's sending error to wrong place? I remember some unit scripts losing stderr and only logging stdout.

I think it's:
./bin/logstash -f /etc/logstash/logstash.conf

@knalli
Copy link
Author

knalli commented Apr 28, 2015

I'm using the rpm and all three logs (err, stdout, log) will be used, i.e. a configuration problem (missing plugin) will be logged in err. That's what you mean?

@driskell
Copy link
Contributor

OK so I would say it should be logging to err if there's a thread blow out. Can you post a gist of your filters config? Maybe can check each part of it. I know you mentioned anonymize earlier - looking at the plugin source I think it would actually crash out if the field was not an IP address. Maybe wrap it in regex conditional that only match IP? Can raise an issue in the plugin repository too https://github.com/logstash-plugins/logstash-filter-anonymize

@knalli
Copy link
Author

knalli commented Apr 28, 2015

OK so I would say it should be logging to err if there's a thread blow out.

It logs out always

Apr 28, 2015 7:14:53 AM org.elasticsearch.plugins.PluginsService <init>
INFORMATION: [hostname.domain.tld-4013-15074] loaded [], sites []

Can you post a gist of your filters config?

Here is the relevant one: https://gist.github.com/knalli/a5a5f68cd755e2f33f66
I can provide you all if you thing that is being required.

The last event of stdout is this here: https://gist.github.com/knalli/851ec3721d9ef4f0a6be

Can raise an issue in the plugin repository too

Yeah, I've done this already.

@driskell
Copy link
Contributor

I've commented on your gist to try using date filter rather than using mutate to replace @timestamp with another field that is actually a string. date filter will ensure the type etc is correct and prevent any issue with type conversions in the pipeline due to something, which could be what's happening.

Are you setting any fields on the events in logstash-forwarder/log-courier configs?

@driskell
Copy link
Contributor

@knalli Also - it could be worth running Logstash on command line, and passing --debug as a command line. This will spam standard out but it should mean it dies when a thread crashes rather than limping along dead, and it MAY produce more output (worth a try failing all else!)

@knalli
Copy link
Author

knalli commented Apr 28, 2015

Are you setting any fields on the events in logstash-forwarder/log-courier configs?

I think you mean this here?

{
    "paths": [ "/var/log/tomcat/catalina.out" ],
    "fields": { "type": "tomcat" }
}

Regarding the other changes: I'll try this next.

@driskell
Copy link
Contributor

Yep that looks fine. Type is fine. I just know setting some things that Logstash use internally can be problematic (like tags). If the date filter does not help still try running with --debug

@knalli
Copy link
Author

knalli commented Apr 28, 2015

I've replaced my filter setup.. so far it's still running and I can even restart logstash w/o problems. If this is still working tomorrow, I will can retry w/0.4 to eliminate the actual forwarder issue.

Regarding --debug: Already checked: but the log files have not produced any output (stout, log, err). But we will see, for the next time I'm better prepared. ;)

@driskell
Copy link
Contributor

@knalli Good to see some progress. It is strange how its not logged though. Did you try running Logstash on console directly just to rule out buffering in the log files etc?

@knalli
Copy link
Author

knalli commented Apr 28, 2015

Not yet. Will work this.

@knalli
Copy link
Author

knalli commented Apr 28, 2015

Still figuring how I can actually use the "directly" way. Tomorrow..

Meanwhile, I've tried again LS 1.5RC3. Nope, that is even more worse. While the fix with the date filter is still working with RC2, same setup "breaks" with RC3. Just like before without any error. After a rollback to RC2 (fresh, no update/downgrade) it is still running for hours.

Also I've noticed it can be good to uninstall and re-install the complete package (rpm) instead updating. I'd suggest plugins aren't the same after an update RC2->RC3 and a fresh install RC3. Anyway, without further information that is only guessing.

@driskell
Copy link
Contributor

./bin/logstash --debug -f /path/to/logstash.conf

@driskell
Copy link
Contributor

The rpm should have same default plugins but probably newer versions in the newer RPM. Though can't see any reason for failure in RC3.

Probably best bet is try to actually get the error to spit out with the old config. If we can run above command to run on console we can rule out error trapped in a buffer. Failing that I can offer patches but don't know how far you'll want to take this.

@easescript
Copy link

just found RC3 had an issue on CentOS-6: "service logstash stop" won't stop the logstash; I had to "kill -9 PID" to stop it.

@afzal273
Copy link

Lots of times I see this -

thostname %{syslog_hostname}
tmessage %{syslog_message}
tsyslog_facility user-level
#syslog_facility_code 1
tsyslog_severity notice
#syslog_severity_code 5
?tags _grokparsefailure

Does that mean my logstash is overloaded? Because from the same host there are logs which are both parse failure and some that work.

@driskell
Copy link
Contributor

Parse failure is unrelated to timeout or overload and means your grok isn't matching. Best to raise new issue in Logstash repo for help of head to the mailing list.

@knalli
Copy link
Author

knalli commented Apr 29, 2015

@driskell

Running RC3, I've found this one. Particularly for you ;)

Exception in filterworker {"exception"=>#<NameError: undefined local variable or method `event' for #<LogStash::Pipeline:0x722ecf12>>, "backtrace"=>["(eval):32:in `initialize'", "org/jruby/RubyProc.java:271:in `call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:309:in `flush_filters'", "org/jruby/RubyArray.java:1613:in `each'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:308:in `flush_filters'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:317:in `flush_filters_to_output!'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:223:in `filterworker'", "org/jruby/ext/thread/Mutex.java:149:in `synchronize'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:223:in `filterworker'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.0.rc3-java/lib/logstash/pipeline.rb:156:in `start_filters'"], :level=>:error, :file=>"logstash/pipeline.rb", :line=>"231", :method=>"filterworker", "plugin"=>"input/courier", "port"=>5001}
config LogStash::Codecs::Plain/@charset = "UTF-8" {:level=>:debug, :file=>"logstash/config/mixin.rb", :line=>"112", :method=>"config_init", "plugin"=>"input/courier", "port"=>5001}

That's actually a little bit strange, because log-courier is only being available but actually not in use (5000 lumberjack, 5001 log-courier).

$ netstat -atnp | grep 5001
tcp        0      0 :::5001                     :::*                        LISTEN      25525/java

Anyway.. now the daemon has stopped to process. Dump here: https://gist.github.com/knalli/98c751e7652dc2678156

@driskell
Copy link
Contributor

OK. Did that appear in the log file? Or did you have to run from command line? I'll take a look but that's not my code! And it's odd that the filterworker logger is reporting errors as if they were from an input!

@knalli
Copy link
Author

knalli commented Apr 29, 2015

Ah, sorry. Via command line and with debug enabled.

And I thought that because of the plugin's name.

@driskell
Copy link
Contributor

Ah so when run as service, the log file doesn't report the error, but when run on command line, it does report the error. Definitely a buffering problem in the logger if that's the case - I'll take a look later and raise with Logstash.

Yes the plugin name there is actually added by my plugin - but looks like it gets added to EVERY log, even those not belong to the plugin! I can fix this in the next release and do the logging properly - sorry for the confusion.

Will look at the actual error more closely soon.

@knalli
Copy link
Author

knalli commented Apr 29, 2015

Hm, not sure. I've added the --debug flag also. And with the colorized output it is easier to find the errors indeed.

@driskell
Copy link
Contributor

@knalli The error in RC3 is known and fixed for next release: elastic/logstash#3100

I'll work on courier logging and also try reproduce the log buffering and see if can fix upstream.

@knalli
Copy link
Author

knalli commented Jun 3, 2015

So far, it is working much better in the 1.5 GA. I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants