-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash fails to restart with the persistent queuing enabled #7538
Comments
I can confirm this. If you try to adjust |
@nicknameforever Thanks for the report. A few questions:
|
@shoggeh thanks for your input. Did you in fact experience this specific error by just changing the
Specifically the "Page file size 0" ? |
this is very weird - reviewing the related code it seems the only way a page size 0 can exists if an exception is thrown right after the page file creation and before or in the logstash/logstash-core/src/main/java/org/logstash/ackedqueue/io/MmapPageIO.java Lines 71 to 73 in b3b9e60
The @nicknameforever @shoggeh can you see any error or exception in the logs that would indicate such a condition?? |
This could be a separate issue, but is most likely due to the issue #6970 |
@colinsurprenant yes, I think I'm running into different issue than @nicknameforever, although resulting in similar error.
And then situation was as follows:
With example change from: 256 to 512mb: PRIOR to change:
<< actual change >>
However the size of page file remained at it's former value:
END RESULT: Logstash resumed all of the operations properly.
PRIOR TO CHANGE:
<< actual change >>
END RESULT: Logstash not starting. If it's not possible to not enforce |
@shoggeh ok, thanks for the details - so yes, this is a different issue than the zero page file size which seems to be triggered by an exception which prevents the correct page file "initialization". But your situation is definitely a problem we need to also fix, either with better documentation, better error message or by revisiting this limitation. Would you mind creating a separate issue with your specific problem description here? - basically copy-paste your previous comment. |
@nicknameforever thanks for the info. Ok, agree we have potentially 2 different issues here (one leading to the other):
We will address both separately and keep focus ont the PQ corruption for this issue here. |
How can we get logstash to start again in case of an empty page file?
|
@martijnvermaat could you please run the following command from the logstash root dir and report the result? $ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }' |
@martijnvermaat sorry, I should have given more explanation: the above command will just dump the content of the checkpoint files and will help me in seeing if we can come up with a manual way to repair the broken queue state. The checkpoint files do not contain any ingested data but only metadata about the queue state. |
@colinsurprenant I since moved the queue dir to some other location and restarted logstash (twice, because after ~ 15min the same thing happened). Here are the contents of the checkpoint files I moved: At this point it's not a priority to recover the events, but it's a bit worrying that this happened twice. |
@martijnvermaat Great, thanks. In this Also, have you seen any error logs prior to this happening? Was Logstash somehow abnormally terminated? crashed? interrupted? Thanks. |
@colinsurprenant Hi! Colleague of @martijnvermaat here, jumping in since he's away for a few days. Here's the directory listing for the queue mentioned above: queue_ls.txt The reason for restarting Logstash was that the process was "hanging", meaning the pipeline(s) stopped.
I suspect the actual hang of the pipeline to be related to some JVM config (perhaps the issue raised above) or some overflow related to the high volume throughput through a single TCP input. Not sure how to reproduce it yet. |
@onnos @martijnvermaat Thanks, very good info, this is helpful. |
@martijnvermaat @onnos Which version of LS are you running and on what platform? Thanks. |
@colinsurprenant LS 2.3.2 on Ubuntu 14.04. |
@martijnvermaat Are you sure? Persistent Queue was only introduced in LS 5.4 !? (in 5.1 in beta) |
That's the retired setup, this is 5.5.0. :) Sorry for the confusion, we're transitioning between setups. Running on Ubuntu 14.04 with Xenial 4.4.0 kernel. Again referring to the problem of the initial hang: I split the incoming stream into two separate TCP inputs (from rsyslog to logstash), and this has helped throughput under load tremendously. I suspect we won't be able to recreate the crash. |
@martijnvermaat @onnos Here's what I can make of your queue state:
I see 2 problems here: 1- Page 3470 is stuck behind with a bunch of unacked events. The only explanation I see is that this batch at seq num 1669595327 got stuck in a stalled output up until LS crashed or was restarted. I suspect that if it wasn't for the 0 byte page size, upon restart these events in page 3470 would have been replayed and the queue would have been correctly purged up to the head page. It is definitely worth for us to better evaluate such scenarios and think about how to mitigate it. At first thought I'd say that this is more a stuck output problem than a queue problem per-se. Will followup on this separately. 2- Page 3471 is in fact fully acked and should be purged completely but it is zero byte. I suspect the delete operation was, for some reason, not complete or something like that. I will investigate in that area see if I can spot such a condition. The good news in this case is that simply deleting The other good news is that I will see if we could add a condition at queue initialization that if a page file has zero byte and the checkpoint says it is fully acked, we could just clean it up. |
@nicknameforever do you still have your problematic queue dir intact? If so would it also be possible for you to run the same command to list the content of the checkpoints (adjust queue dir path if necessary) and see if we have the same situation here? $ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }' |
@colinsurprenant Unfortunately not. |
create issue #7809 specifically for the zero byte & fully acked page file case. |
Above I mentioned a condition of stalled plugin, issue #7796 was created to get more insights on these. |
@colinsurprenant I'm running into the same problem as the original reporter. Didn't change the queue size, just restarted logstash and now it's failing to start with the page file being 0 bytes. Running 5.6.4 of logstash from the elastic repos. |
@andrewmiskell thanks for the report. Has did LS shutdown prior to the restart? did it complete a clean shutdown or was it forced stopped or crashed? Do you have any logs related to that? |
@colinsurprenant Not sure if I have the logs, but it was observed after issuing a simple service logstash restart command on both my logstash nodes. I still have the queues, I moved them to a temporary location for the time being so I could see if they could be fixed and imported into ElasticSearch later. |
@andrewmiskell ah! great that you preserved the queue dir. I am optimistic that we should be able to recover it and it will help diagnose the problem. In the mean time, I'd really appreciate if you could check the logs to see if LS completed a clean/normal shutdown or if the service stop ended up killing the LS process. Could you list the queue dir content and also run the following command from LS home dir, (adjust the queue dir path) to display the content of the checkpoints: $ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }' |
@colinsurprenant Attached to comment. It looks like there was a hung grok thread and it was killed forcefully by the service script. |
@andrewmiskell could you also list the queue dir files showing the file sizes please? ( |
@andrewmiskell better yet, could you run this new command from the LS home dir, adjust the
|
For some reason #7809 which originally fixed the I will go ahead and close this issue, feel free to re-open if needed. |
After restart of Logstash it fails to start with the message:
page.136
is indeed 0 bytesThe text was updated successfully, but these errors were encountered: