Logstash fails to restart with the persistent queuing enabled #7538

nicknameforever · 2017-06-26T15:08:59Z

After restart of Logstash it fails to start with the message:

[2017-06-26T16:24:04,802][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /path/to/data/queue/main/page.136", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:136:in `open'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/path/to/install/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/path/to/install/logstash/logstash-core/lib/logstash/pipeline.rb:148:in `initialize'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:274:in `execute'", "/path/to/install/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", "/path/to/install/logsta
sh/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/path/to/install/logstash/lib/bootstrap
/environment.rb:71:in `(root)'"]}
[2017-06-26T16:24:04,810][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured
page capacity 262144000 for /path/to/data/queue/main/page.136"}

Logstash 5.4.0
SLES 11 (x86_64)
Config:

queue.type: persisted
queue.checkpoint.writes: 1024
queue.max_bytes: 1024mb

In the queue/main/ folder the file page.136 is indeed 0 bytes

The text was updated successfully, but these errors were encountered:

shoggeh · 2017-06-26T23:58:52Z

I can confirm this. If you try to adjust queue.page_capacity while there are any events stored in on-disk queue this error will be triggered. Value can be adjusted up & down if only disk queue is empty.

colinsurprenant · 2017-06-27T15:33:04Z

@nicknameforever Thanks for the report. A few questions:

Did you modify queue.page_capacity as @shoggeh mentioned?
Could you list the queue directory content? is page.136 the last page?
Was Logstash somehow abnormally terminated? crashed? interrupted?

colinsurprenant · 2017-06-27T15:35:15Z

@shoggeh thanks for your input. Did you in fact experience this specific error by just changing the queue.page_capacity ? I will try to reproduce that case.

"Page file size 0 different to configured page capacity xxx for ...

Specifically the "Page file size 0" ?

colinsurprenant · 2017-06-27T19:10:52Z

this is very weird - reviewing the related code it seems the only way a page size 0 can exists if an exception is thrown right after the page file creation and before or in the map() call in

logstash/logstash-core/src/main/java/org/logstash/ackedqueue/io/MmapPageIO.java

Lines 71 to 73 in b3b9e60

    
           RandomAccessFile raf = new RandomAccessFile(this.file, "rw"); 
        
           this.channel = raf.getChannel(); 
        
           this.buffer = this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity);

The new RandomAccessFile(this.file, "rw"); will create the page size 0 and then the this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity) should extend it to the configured page capacity.

@nicknameforever @shoggeh can you see any error or exception in the logs that would indicate such a condition??

nicknameforever · 2017-06-28T06:47:33Z

@colinsurprenant

I did not modify queue.page_capacity
files in the queue:

250M page.135
34 checkpoint.135
34 checkpoint.136
0 page.136
34 checkpoint.head
250M page.137

Yes, Logstash was crashed with OutOfMemory:

[2017-06-24T02:18:30,480][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...
[2017-06-24T02:18:28,756][ERROR][logstash.filters.grok    ] Error while attempting to check/cancel excessively long grok patterns {:message=>"Java heap space", :class=>"Java::JavaLang::OutOfMemoryError", :backtrace=>[]}
[2017-06-24T02:18:35,271][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...

This could be a separate issue, but is most likely due to the issue #6970

shoggeh · 2017-06-28T16:40:05Z

@colinsurprenant yes, I think I'm running into different issue than @nicknameforever, although resulting in similar error.

I did modify queue.page_capacity size on logstash instance processing data,
I used non-zero values (usually 256mb, 512mb and 1024mb) as test for it.

And then situation was as follows:

EMPTY QUEUE: If nothing was stored in on-disk PQ at the moment of page_capacity size change, it was POSSIBLE to restart logstash without any issue. By "nothing" I mean _node/stats monitoring endpoint showing 0 under pipeline.queue.events

With example change from: 256 to 512mb:

PRIOR to change:

    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:21:57,852][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 536870912,

However the size of page file remained at it's former value:

-rw-r--r-- 1 logstash logstash 250M Jun 28 16:20 /var/lib/logstash/queue/main/page.505

END RESULT: Logstash resumed all of the operations properly.

SOMETHING IN QUEUE: However, if there only was anything stored in PQ then subsequent logstash restarts errored and logstash didn't start until I either:
a) reverted queue.page_capacity to its former value
b) manually removed contents of /var/lib/logstash/queue and let logstash to recreate it

PRIOR TO CHANGE:

    "queue" : {
      "events" : 22237,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:44:16,848][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 262144000 different to configured page capacity 536870912 for /var/lib/logstash/queue/main/page.505"}

END RESULT: Logstash not starting.

If it's not possible to not enforce page_capacity size on already created pages, then maybe it's just worth mentioning in docs that queue needs to be drained first to allow you to change it?

colinsurprenant · 2017-06-29T15:57:05Z

@shoggeh ok, thanks for the details - so yes, this is a different issue than the zero page file size which seems to be triggered by an exception which prevents the correct page file "initialization".

But your situation is definitely a problem we need to also fix, either with better documentation, better error message or by revisiting this limitation.

Would you mind creating a separate issue with your specific problem description here? - basically copy-paste your previous comment.

colinsurprenant · 2017-06-29T16:02:31Z

@nicknameforever thanks for the info. Ok, agree we have potentially 2 different issues here (one leading to the other):

a crash which created "corruption" in the PQ (this issue)
a problem with LS_HEAP_SIZE vs LS_JAVA_OPTS as described in Dead references to LS_HEAP_SIZE #6970

We will address both separately and keep focus ont the PQ corruption for this issue here.

martijnvermaat · 2017-07-14T16:45:19Z

How can we get logstash to start again in case of an empty page file?

$ du -sh /run/shm/logstash/queue/main/page*
250M	/run/shm/logstash/queue/main/page.3470
0	/run/shm/logstash/queue/main/page.3471
250M	/run/shm/logstash/queue/main/page.3511
250M	/run/shm/logstash/queue/main/page.3525
250M	/run/shm/logstash/queue/main/page.3898
4.0K	/run/shm/logstash/queue/main/page.3899
$ tail -2 /var/log/logstash/logstash-plain.log
[2017-07-14T16:47:57,434][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:133:in `open'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/usr/share/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:159:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:314:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:209:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}
[2017-07-14T16:47:57,439][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471"}

colinsurprenant · 2017-07-14T19:02:26Z

@martijnvermaat could you please run the following command from the logstash root dir and report the result?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant · 2017-07-14T19:11:42Z

@martijnvermaat sorry, I should have given more explanation: the above command will just dump the content of the checkpoint files and will help me in seeing if we can come up with a manual way to repair the broken queue state. The checkpoint files do not contain any ingested data but only metadata about the queue state.

martijnvermaat · 2017-07-14T20:59:40Z

@colinsurprenant I since moved the queue dir to some other location and restarted logstash (twice, because after ~ 15min the same thing happened). Here are the contents of the checkpoint files I moved:

checkpoints.txt

At this point it's not a priority to recover the events, but it's a bit worrying that this happened twice.

colinsurprenant · 2017-07-15T01:59:29Z

@martijnvermaat Great, thanks. In this checkpoints.txt, which page is now 0 bytes? (or you can also include a ls -la of data/queue/main/)

Also, have you seen any error logs prior to this happening? Was Logstash somehow abnormally terminated? crashed? interrupted?

Thanks.

onnos · 2017-07-15T11:10:31Z

@colinsurprenant Hi! Colleague of @martijnvermaat here, jumping in since he's away for a few days.

Here's the directory listing for the queue mentioned above: queue_ls.txt

The reason for restarting Logstash was that the process was "hanging", meaning the pipeline(s) stopped.

[2017-07-14T15:23:11,978][INFO ][logstash.outputs.file    ] Opening file {:path=>"/store/logstash/2017/07/14/message.log.gz"}
[2017-07-14T15:23:11,980][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"0", "backtrace"=>["org.logstash.dissect.DissectorErrorUtils.backtrace(DissectorErrorUtils.java:16)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.logException(JavaDissectorLibrary.java:224)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect(JavaDissectorLibrary.java:130)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect_multi(JavaDissectorLibrary.java:140)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.call(JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.gen)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb:182)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:43)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.IfNode.interpret(IfNode.java:118)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyArray.eachCommon(RubyArray.java:1606)", "org.jruby.RubyArray.each(RubyArray.java:1613)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.internal.runtime.methods.ProcMethod.call(ProcMethod.java:64)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:383)", "rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:159)", "org.jruby.runtime.CompiledBlock19.call(CompiledBlock19.java:87)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:312)", "rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:135)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyHash$13.visit(RubyHash.java:1355)", "org.jruby.RubyHash.visitLimited(RubyHash.java:648)", "org.jruby.RubyHash.visitAll(RubyHash.java:634)", "org.jruby.RubyHash.iteratorVisitAll(RubyHash.java:1306)", "org.jruby.RubyHash.each_pairCommon(RubyHash.java:1351)", "org.jruby.RubyHash.each19(RubyHash.java:1342)", "org.jruby.RubyHash$INVOKER$i$0$0$each19.call(RubyHash$INVOKER$i$0$0$each19.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:311)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:161)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.chained_0_rescue_1$RUBY$SYNTHETIC__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:382)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.ast.WhileNode.interpret(WhileNode.java:131)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)", "org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)", "org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call(RubyProc.java:230)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)", "java.lang.Thread.run(Thread.java:748)"]}

I suspect the actual hang of the pipeline to be related to some JVM config (perhaps the issue raised above) or some overflow related to the high volume throughput through a single TCP input. Not sure how to reproduce it yet.

colinsurprenant · 2017-07-15T17:26:06Z

@onnos @martijnvermaat Thanks, very good info, this is helpful.

colinsurprenant · 2017-07-19T14:43:48Z

@martijnvermaat @onnos Which version of LS are you running and on what platform? Thanks.

martijnvermaat · 2017-07-19T14:47:45Z

@colinsurprenant LS 2.3.2 on Ubuntu 14.04.

colinsurprenant · 2017-07-19T14:59:26Z

@martijnvermaat Are you sure? Persistent Queue was only introduced in LS 5.4 !? (in 5.1 in beta)

onnos · 2017-07-19T15:01:00Z

That's the retired setup, this is 5.5.0. :) Sorry for the confusion, we're transitioning between setups. Running on Ubuntu 14.04 with Xenial 4.4.0 kernel.

Again referring to the problem of the initial hang: I split the incoming stream into two separate TCP inputs (from rsyslog to logstash), and this has helped throughput under load tremendously. I suspect we won't be able to recreate the crash.

colinsurprenant · 2017-07-19T20:33:33Z

@martijnvermaat @onnos Here's what I can make of your queue state:

The oldest page is 3470 which contains the sequence numbers range of [1669303167..1669780635] and the First Unacked Sequence Num (FUS) is 1669595327, so there are 185309 (+/- 1, unimportant here) unacked events in that page.
The next page 3471 is the one with the 0 byte size and contains the sequence number range [1669780636..1670260155] and FUS is 1670260156 which means it is fully acked.
The next pages, 3472+ are all purged and the checkpoint says they are all fully acked up to page 3899 which is the head page.

I see 2 problems here:

1- Page 3470 is stuck behind with a bunch of unacked events. The only explanation I see is that this batch at seq num 1669595327 got stuck in a stalled output up until LS crashed or was restarted. I suspect that if it wasn't for the 0 byte page size, upon restart these events in page 3470 would have been replayed and the queue would have been correctly purged up to the head page. It is definitely worth for us to better evaluate such scenarios and think about how to mitigate it. At first thought I'd say that this is more a stuck output problem than a queue problem per-se. Will followup on this separately.

2- Page 3471 is in fact fully acked and should be purged completely but it is zero byte. I suspect the delete operation was, for some reason, not complete or something like that. I will investigate in that area see if I can spot such a condition.

The good news in this case is that simply deleting page.3471 should fix the problem.

The other good news is that I will see if we could add a condition at queue initialization that if a page file has zero byte and the checkpoint says it is fully acked, we could just clean it up.

colinsurprenant · 2017-07-19T20:39:44Z

@nicknameforever do you still have your problematic queue dir intact? If so would it also be possible for you to run the same command to list the content of the checkpoints (adjust queue dir path if necessary) and see if we have the same situation here?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

nicknameforever · 2017-07-25T10:30:40Z

@colinsurprenant Unfortunately not.

colinsurprenant · 2017-07-25T20:21:39Z

create issue #7809 specifically for the zero byte & fully acked page file case.

colinsurprenant · 2017-07-25T20:33:49Z

PR for #7809 in #7810

colinsurprenant · 2017-07-25T20:43:34Z

Above I mentioned a condition of stalled plugin, issue #7796 was created to get more insights on these.

andrewmiskell · 2017-11-27T14:37:43Z

@colinsurprenant I'm running into the same problem as the original reporter. Didn't change the queue size, just restarted logstash and now it's failing to start with the page file being 0 bytes. Running 5.6.4 of logstash from the elastic repos.

colinsurprenant · 2017-11-28T16:28:48Z

@andrewmiskell thanks for the report. Has did LS shutdown prior to the restart? did it complete a clean shutdown or was it forced stopped or crashed? Do you have any logs related to that?

andrewmiskell · 2017-11-28T16:30:57Z

@colinsurprenant Not sure if I have the logs, but it was observed after issuing a simple service logstash restart command on both my logstash nodes.

I still have the queues, I moved them to a temporary location for the time being so I could see if they could be fixed and imported into ElasticSearch later.

colinsurprenant · 2017-11-28T16:36:35Z

@andrewmiskell ah! great that you preserved the queue dir. I am optimistic that we should be able to recover it and it will help diagnose the problem. In the mean time, I'd really appreciate if you could check the logs to see if LS completed a clean/normal shutdown or if the service stop ended up killing the LS process.

Could you list the queue dir content and also run the following command from LS home dir, (adjust the queue dir path) to display the content of the checkpoints:

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

andrewmiskell · 2017-11-28T17:00:13Z

@colinsurprenant Attached to comment. It looks like there was a hung grok thread and it was killed forcefully by the service script.

logstash-queue.txt
logstash-log.txt

colinsurprenant · 2017-11-28T20:22:24Z

@andrewmiskell could you also list the queue dir files showing the file sizes please? (ls -la)
Thanks.

colinsurprenant · 2017-11-28T21:07:26Z

@andrewmiskell better yet, could you run this new command from the LS home dir, adjust the dir = "data/queue/main" part to your actual queue dir path.

vendor/jruby/bin/jruby -rpp -e 'dir = "data/queue/main"; Dir.glob("#{dir}/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); fa = firstUnackedSeq >= (minSeq + elementCount); ps = File.exist?("#{dir}/page.#{page}") ? File.size("#{dir}/page.#{page}") : nil; print("#{File.basename(checkpoint)}, #{fa ? "FA" : "UN"}, size: #{ps ? ps : "NA"}, "); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

colinsurprenant · 2017-11-30T19:49:35Z

For some reason #7809 which originally fixed the Page file size 0 different to configured ... startup exception was incorrectly merged into the 5.6 branch and this is why we see this problem again. I re-injected the fix and it will be part of the upcoming 5.6.5 release. This is really unfortunate.

I will go ahead and close this issue, feel free to re-open if needed.

suyograo added the bug label Jun 26, 2017

suyograo assigned colinsurprenant Jun 26, 2017

shoggeh mentioned this issue Jun 30, 2017

Logstash fails to restart after changing queue.page_capacity value #7581

Closed

colinsurprenant mentioned this issue Jul 20, 2017

[discuss] stalled filter or output #7763

Closed

colinsurprenant closed this as completed Nov 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash fails to restart with the persistent queuing enabled #7538

Logstash fails to restart with the persistent queuing enabled #7538

nicknameforever commented Jun 26, 2017

shoggeh commented Jun 26, 2017 •

edited

Loading

colinsurprenant commented Jun 27, 2017

colinsurprenant commented Jun 27, 2017

colinsurprenant commented Jun 27, 2017

nicknameforever commented Jun 28, 2017

shoggeh commented Jun 28, 2017 •

edited

Loading

colinsurprenant commented Jun 29, 2017

colinsurprenant commented Jun 29, 2017

martijnvermaat commented Jul 14, 2017 •

edited

Loading

colinsurprenant commented Jul 14, 2017

colinsurprenant commented Jul 14, 2017

martijnvermaat commented Jul 14, 2017

colinsurprenant commented Jul 15, 2017 •

edited

Loading

onnos commented Jul 15, 2017

colinsurprenant commented Jul 15, 2017

colinsurprenant commented Jul 19, 2017 •

edited

Loading

martijnvermaat commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

onnos commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

nicknameforever commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

andrewmiskell commented Nov 27, 2017

colinsurprenant commented Nov 28, 2017

andrewmiskell commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

andrewmiskell commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

colinsurprenant commented Nov 30, 2017

Logstash fails to restart with the persistent queuing enabled #7538

Logstash fails to restart with the persistent queuing enabled #7538

Comments

nicknameforever commented Jun 26, 2017

shoggeh commented Jun 26, 2017 • edited Loading

colinsurprenant commented Jun 27, 2017

colinsurprenant commented Jun 27, 2017

colinsurprenant commented Jun 27, 2017

nicknameforever commented Jun 28, 2017

shoggeh commented Jun 28, 2017 • edited Loading

colinsurprenant commented Jun 29, 2017

colinsurprenant commented Jun 29, 2017

martijnvermaat commented Jul 14, 2017 • edited Loading

colinsurprenant commented Jul 14, 2017

colinsurprenant commented Jul 14, 2017

martijnvermaat commented Jul 14, 2017

colinsurprenant commented Jul 15, 2017 • edited Loading

onnos commented Jul 15, 2017

colinsurprenant commented Jul 15, 2017

colinsurprenant commented Jul 19, 2017 • edited Loading

martijnvermaat commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

onnos commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

colinsurprenant commented Jul 19, 2017

nicknameforever commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

colinsurprenant commented Jul 25, 2017

andrewmiskell commented Nov 27, 2017

colinsurprenant commented Nov 28, 2017

andrewmiskell commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

andrewmiskell commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

colinsurprenant commented Nov 28, 2017

colinsurprenant commented Nov 30, 2017

shoggeh commented Jun 26, 2017 •

edited

Loading

shoggeh commented Jun 28, 2017 •

edited

Loading

martijnvermaat commented Jul 14, 2017 •

edited

Loading

colinsurprenant commented Jul 15, 2017 •

edited

Loading

colinsurprenant commented Jul 19, 2017 •

edited

Loading