-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read error looking ack & i/o timeout (since 0.4) #438
Comments
@knalli it sounds like your Logstash is under-performing. You mentioned EC2 and tiny VMs - what size intsance are you running LS against, and how many forwarders are reporting to it? Did the # of forwarders increase in the last week? Did you change the timeout setting in LSF in the last week? |
My dev node running ELK (Elasticsearch, Logstash, Kibana, all with defaults) is/was a very tiny instance (actually not EC2, but a real AMI on a local Xen Server). It was 3GB ram, but I had it already increased to 6GB yesterday. All local dev/stagings hosts are forwarding; that are 5-6 hosts each with an forwarder pushing 2-6 logfiles. In sum, that are less than 50 events per minute. However, since the pushing had been stopped for some time, after re-enabling everything again there was a peak of 6000 messages within a timeframe of 10 minutes (actually less). You see can see clearly when I have made the rollback. :) And again: Even with the 6GB RAM and only one forwarder the connections have been dropped/timed out immediately (within seconds). And I have forgotten one additional information/hint: The LS agent produces memory overcommitment issues (in my case about 8-9GB) with the default setup under this environment circumstances. The test/prod setup are both running on a m3.medium (3,75 GB, one core). That's definitely tiny. However, while the prod node is running >90% idle and has only about 1gb memory comitted (fair enough?), the test node has a base load of 20-30% (completely logstash agent) and is already overcomitting again (and no output on logstash.stdout while new events are being produced/tried to push). If you say I need more memory or cores: Yeah, that sounds okay. But I cannot make a reasonable upgrade unless I know what the setup really requires.
Do you mean effective instances of forwarders aka forwarding hosts? Nope, the "cluster size" has not changed for weeks.
I had left the default of |
Have you tried deleting the state file .logstash-forwarder on test to reset everything so it only tails logs? (You WILL lose logs here though!) Might be test is so far behind now it sends too many logs at once to Logstash and explodes memory where as prod is trickling fine. Deleting the state file will reset so it tails the logs from end rather than trying to resume potentially from a while back. Maybe also reduce spool size. Generally, scale your spool size based on Logstash memory and average event size and number of forwarders. Scale your timeout based on time Logstash needs to process spool size number of events. There's not exact math and different for everyone. Problem is forwarder is not graceful performance wise. Because it resend on timeout and Logstash doesn't notice Logstash keeps processing previous send, but now ALSO process second send... Then third... So on. Thus memory grow and timeouts get worse. |
Oops, when replacing my own shim rpm with the new 0.4 one, I'd lost my config with the pre defined spool size of 100. What is the default? Without resetting the forwarders this works only for 2-3 minutes until i/o error. But when starting the agents not in parallel, it works (performance issue on side of Logstash?)
I can try this, loosing events is not the problem at this moment. If a spool size of 100 is not the default, the many
Okay, I understand your remarks "it depends". But at least some example would be helpful. If I get you right, the huge memory/cpu consumption (on side of LS) will be because of a performance issue (well, memory/file descriptors, incoming events/logs?). And this results into logstash itself cannot processing anything anymore while eating memory and cpu? I only want to understand whether I'm faced with a real issue or only a performance issue. Since I can compare both 0.3 and 0.4 setups on the same cluster (dev) as well as side by side setups (test and prod), I was (am?) in doubt that throwing more memory could help. |
Update: After all, the forwarder (0.4) did not recover the logs correctly and/or the logstash daemon itself was unresponsive (and will not come back, even after a night). I have done a rollback to 0.3 on my test environment, and now everything looks fine. Will have to monitor this for some hours.
Indeed, I've noticed forwarder 0.3 has been tailing logs when no .logstash-forwarder file was found, but 0.4 did not doing this (last night, I have seen Additionally, now the logstash daemon is at the same cpu/mem level like my still working production cluster. |
Ah. I think this is the tailOnRotate / from-beginning problem. Previous 0.3 would always start file from the end, even if it just rotated, so could lose logs. 0.4 now always starts from beginning, even if not rotated. So without .logstash-forwarder it will start from beginning unless tailOnRotate is given as command line. It's still wrong but somewhat better. To make it behave same you'll need to modify init script to specify tailOnRotate=true. Really this is a bug and it should start from end if .logstash-forwarder does not exist, and from then on resume or for new files start from beginning. Default spool-size is 1024 so looks like you've already reduced it. Regarding memory, if you have an average line size of 1KB and have 50 forwarders, and spool size 1000 - that means before compression, 1KB_50_1000 = 50MB. Since you'll need to account for the decompressing it'll be approx 1.5*50 = 75MB. So you can see the more forwarders the more memory, and the bigger the log lines the more memory. Difficult to calculate sometimes. As mentioned previously, main issue you'll have is the timeout problem. If forwarder timeout, Logstash keeps processing, and forwarder resends. This doubles the load. Then if that timeout it sends again and that triples. So it's important to get the timeout setting right and increase that as much as necessary so that the timeout value is bigger than the time it takes Logstash to process spool size events when at full load (in your case, 100). Easiest thing to do is start really really high and then put everything under load, then lower it slightly until you hit timeouts, bringing it back up. |
Okay, this means: How can I get a good timeout? Is there any information/metric exposed by logstash about the current load/queue? Perhaps via JMX? |
My experience has been trial and error since it depends entirely on CPU / Memory(GC Runs) / Filters in use / Outputs in use / CPU on output server / Memory on output server. So many factors involved in throughput and it's based on throughput. In the end I implemented Partial ACK so timeout was irrelevant, Logstash would tell forwarder "wait" as a form of flow control. This was in #180 but was contentious and very big patch and during a time of low maintenance of the project - it won't apply anymore as its very old. I maintain it in my fork though, Log Courier |
@driskell woha, another log publisher. Looks interestingly, thanks. |
I'm getting these errors and logstash stops processing logs from logstash-forwarder - 2015/04/21 22:46:56.557989 Failed to tls handshake with 10.10.10.10 read tcp 10.10.10.10:5000: i/o timeout It seems like logstash is not keeping up with logstash forwarders. Logstash node is in GCC and the forwarders are running on nodes in AWS, there are about 40k events/sec sent to logstash. There are no errors in /var/log/logstash showing that logstash is having issues and the cpu/memory and disk I/o on the node are not very high. It is an 8 CPU machine with 8GB RAM. How many events/sec can logstash handle? What are recommendations for me? Should I be creating more logstash instances and putting all the instances in the logstash forwarder config? Can other logstash instances use the same cert/key as the first logstash instance? |
@afzal273 it may help to attempt to tweak the network timeout (and possibly, spool_size) of your forwarders. The default timeout is 15, which may be too aggressive if you have particularly latent networks, or somewhat sluggish hosts, etc. |
I could try that, but I'm running logstash forwarders on several nodes and they all show the same error after a few days of running at the same time. Then when I restart logstash on the logstash node and also stop logstash forwarder on several nodes thats when it starts working again. If I only restart logstash and keep all the logstash forwarders running I still see the same error on the logstash forwarder side. |
@afzal273 That's pretty consistent with similar issues I've had. Effectively, if things do start timing out, based on what I've seen, a mild socket leak breaks out on the Logstash server, which causes the timeout to be almost unresolvable in the face of the oncoming swarm of connections. In my case, greatly increasing the timeout (60s for my EC2, multi-AZ architecture) seems to have resolved the connectivity issues (though as you mentioned, I had to help things along initially, by stopping the existing forwarders). |
Sweet, I did both - increased the time out and also added another logstash instance. In the forwarder I put both the logstash node names. Also had to create a wild card cert and replace certs everywhere as in the earlier case the CN was the hostname. Will see how things go now! |
Also what are the indications that logstash is having a problem when it's giving an I/o timeout? Is there a way to tell from CPU/memory or any logs? |
In theory, it could be any of the above, plus a few other things :( - open socket count (particularly things in some type of WAIT state) can contribute, iowait as well. |
upgrading Logstash from 1.4.2 to 1.5.0-rc3 seems to fix the "read error" problem.
|
@easescript Hm.. but the error (at 16:37:04) is still being logged? Meanwhile, I'm faced with the issue with both forwarder 0.3 and 0.4. I've found two particular issues so far, both probably more on logstash (agent) side:
Currently I'm not sure what the issue is exactly. But so far, my machines can (if everything works) process 8k events per minute with same setup like above (while recovering some days of logs). IMHO in my case any issues seems to be configuration and/or plugin issues. Unfortunately, there are no (error) messages except the std output of processed events. It is still not ideal both components are blowing up each other. However, from my point of view, it looks not like a specific 0.4 issue. |
@knalli Is your problem that Logstash crash or that it just stops processing? If you can start it and run |
@driskell Ah, thank you. I'll try anything ;) Ehm.. and logstash seems to be stops. No more processing (no std output, and actually no events in the database), gaining memory and cpu and standard service stop script will not be successfully any more (=> |
@driskell : https://gist.github.com/knalli/2fa29e1080dadbb6da0b Logstash 1.5 RC2 (I've downgraded again) -- upgrading to RC3 itself does not solved this one |
@knalli Looks like the worker thread died without raising an error. Have you tried running Logstash on command line and reproducing in case it's sending error to wrong place? I remember some unit scripts losing stderr and only logging stdout. I think it's: |
I'm using the rpm and all three logs (err, stdout, log) will be used, i.e. a configuration problem (missing plugin) will be logged in err. That's what you mean? |
OK so I would say it should be logging to err if there's a thread blow out. Can you post a gist of your filters config? Maybe can check each part of it. I know you mentioned anonymize earlier - looking at the plugin source I think it would actually crash out if the field was not an IP address. Maybe wrap it in regex conditional that only match IP? Can raise an issue in the plugin repository too https://github.com/logstash-plugins/logstash-filter-anonymize |
It logs out always
Here is the relevant one: https://gist.github.com/knalli/a5a5f68cd755e2f33f66 The last event of stdout is this here: https://gist.github.com/knalli/851ec3721d9ef4f0a6be
Yeah, I've done this already. |
I've commented on your gist to try using Are you setting any fields on the events in logstash-forwarder/log-courier configs? |
@knalli Also - it could be worth running Logstash on command line, and passing |
I think you mean this here?
Regarding the other changes: I'll try this next. |
Yep that looks fine. Type is fine. I just know setting some things that Logstash use internally can be problematic (like |
I've replaced my filter setup.. so far it's still running and I can even restart logstash w/o problems. If this is still working tomorrow, I will can retry w/0.4 to eliminate the actual forwarder issue. Regarding |
@knalli Good to see some progress. It is strange how its not logged though. Did you try running Logstash on console directly just to rule out buffering in the log files etc? |
Not yet. Will work this. |
Still figuring how I can actually use the "directly" way. Tomorrow.. Meanwhile, I've tried again LS 1.5RC3. Nope, that is even more worse. While the fix with the date filter is still working with RC2, same setup "breaks" with RC3. Just like before without any error. After a rollback to RC2 (fresh, no update/downgrade) it is still running for hours. Also I've noticed it can be good to uninstall and re-install the complete package (rpm) instead updating. I'd suggest plugins aren't the same after an update RC2->RC3 and a fresh install RC3. Anyway, without further information that is only guessing. |
|
The rpm should have same default plugins but probably newer versions in the newer RPM. Though can't see any reason for failure in RC3. Probably best bet is try to actually get the error to spit out with the old config. If we can run above command to run on console we can rule out error trapped in a buffer. Failing that I can offer patches but don't know how far you'll want to take this. |
just found RC3 had an issue on CentOS-6: "service logstash stop" won't stop the logstash; I had to "kill -9 PID" to stop it. |
Lots of times I see this - thostname %{syslog_hostname} Does that mean my logstash is overloaded? Because from the same host there are logs which are both parse failure and some that work. |
Parse failure is unrelated to timeout or overload and means your grok isn't matching. Best to raise new issue in Logstash repo for help of head to the mailing list. |
Running RC3, I've found this one. Particularly for you ;)
That's actually a little bit strange, because log-courier is only being available but actually not in use (5000 lumberjack, 5001 log-courier). $ netstat -atnp | grep 5001
tcp 0 0 :::5001 :::* LISTEN 25525/java Anyway.. now the daemon has stopped to process. Dump here: https://gist.github.com/knalli/98c751e7652dc2678156 |
OK. Did that appear in the log file? Or did you have to run from command line? I'll take a look but that's not my code! And it's odd that the filterworker logger is reporting errors as if they were from an input! |
Ah, sorry. Via command line and with debug enabled. And I thought that because of the plugin's name. |
Ah so when run as service, the log file doesn't report the error, but when run on command line, it does report the error. Definitely a buffering problem in the logger if that's the case - I'll take a look later and raise with Logstash. Yes the plugin name there is actually added by my plugin - but looks like it gets added to EVERY log, even those not belong to the plugin! I can fix this in the next release and do the logging properly - sorry for the confusion. Will look at the actual error more closely soon. |
Hm, not sure. I've added the |
@knalli The error in RC3 is known and fixed for next release: elastic/logstash#3100 I'll work on courier logging and also try reproduce the log buffering and see if can fix upstream. |
So far, it is working much better in the 1.5 GA. I'll close this. |
I'm faced with both
Read error looking for ack: read tcp $host:$port: i/o timeout
andFailed to tls handshake with $host read tcp $host:$ip: i/o timeout
since a couple of day.[The affective result is I'm loosing the events (read: no output in logstash.stdout due debug): no more events on LS side after some seconds/minutes. Restarting LS does not help any more, and the stop of LS even will not made in time (eventually a LS issue itself?). I also notice sometimes a high cpu usage of the logstash daemon (100%, full load on a tiny VM). It also looks there is been a huge memory issue.]
Running a LS 1.5RC2, the last changes since last week were:
Neither disabling the new filters nor reduction to only one active logstash-forwarder agent did help anything... and then I've re-installed the older logstash-forwarder 0.3: The flow of events is being processed successfully again. If I start a logstash-forwarder 0.4 in addition (another "app host"), all logstash-forwards (incl. the 0.3) have been the same issue. If I restart all services again except the logstash-forwarder 0.4, it will work again without any issues.
So tl;dr: Only one forwarder agent (0.4) seems to break all other agents (regardless 0.3 or 0.4).
Site notes:
** Amazon AMI with Ruby 2p864, Oracle Java 8 40
** Elasticsearch 1.4.4, Logstash 1.5 RC2
My config looks equal in both versions
There are already related issues, but I'm not aware of relevancy to logstash-forwarder 0.4: #421, #429, #431, #91
Let me know which additional information you need. Because the forwarder was making the trouble, I've created the issue at this place. I'm quite not sure which component is responsible..
The text was updated successfully, but these errors were encountered: