Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash fails to restart with the persistent queuing enabled #7538

Closed
nicknameforever opened this issue Jun 26, 2017 · 33 comments
Closed

Logstash fails to restart with the persistent queuing enabled #7538

nicknameforever opened this issue Jun 26, 2017 · 33 comments
Assignees
Labels

Comments

@nicknameforever
Copy link

After restart of Logstash it fails to start with the message:

[2017-06-26T16:24:04,802][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /path/to/data/queue/main/page.136", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:136:in `open'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/path/to/install/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/path/to/install/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/path/to/install/logstash/logstash-core/lib/logstash/pipeline.rb:148:in `initialize'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:274:in `execute'", "/path/to/install/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/path/to/install/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", "/path/to/install/logsta
sh/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/path/to/install/logstash/lib/bootstrap
/environment.rb:71:in `(root)'"]}
[2017-06-26T16:24:04,810][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured
page capacity 262144000 for /path/to/data/queue/main/page.136"}
  • Logstash 5.4.0
  • SLES 11 (x86_64)
  • Config:
queue.type: persisted
queue.checkpoint.writes: 1024
queue.max_bytes: 1024mb
  • In the queue/main/ folder the file page.136 is indeed 0 bytes
@shoggeh
Copy link

shoggeh commented Jun 26, 2017

I can confirm this. If you try to adjust queue.page_capacity while there are any events stored in on-disk queue this error will be triggered. Value can be adjusted up & down if only disk queue is empty.

@colinsurprenant
Copy link
Contributor

@nicknameforever Thanks for the report. A few questions:

  • Did you modify queue.page_capacity as @shoggeh mentioned?
  • Could you list the queue directory content? is page.136 the last page?
  • Was Logstash somehow abnormally terminated? crashed? interrupted?

@colinsurprenant
Copy link
Contributor

@shoggeh thanks for your input. Did you in fact experience this specific error by just changing the queue.page_capacity ? I will try to reproduce that case.

"Page file size 0 different to configured page capacity xxx for ...

Specifically the "Page file size 0" ?

@colinsurprenant
Copy link
Contributor

this is very weird - reviewing the related code it seems the only way a page size 0 can exists if an exception is thrown right after the page file creation and before or in the map() call in

RandomAccessFile raf = new RandomAccessFile(this.file, "rw");
this.channel = raf.getChannel();
this.buffer = this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity);

The new RandomAccessFile(this.file, "rw"); will create the page size 0 and then the this.channel.map(FileChannel.MapMode.READ_WRITE, 0, this.capacity) should extend it to the configured page capacity.

@nicknameforever @shoggeh can you see any error or exception in the logs that would indicate such a condition??

@nicknameforever
Copy link
Author

@colinsurprenant

  • I did not modify queue.page_capacity
  • files in the queue:
250M page.135
34 checkpoint.135
34 checkpoint.136
0 page.136
34 checkpoint.head
250M page.137
  • Yes, Logstash was crashed with OutOfMemory:
[2017-06-24T02:18:30,480][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...
[2017-06-24T02:18:28,756][ERROR][logstash.filters.grok    ] Error while attempting to check/cancel excessively long grok patterns {:message=>"Java heap space", :class=>"Java::JavaLang::OutOfMemoryError", :backtrace=>[]}
[2017-06-24T02:18:35,271][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"Java heap space", "backtrace"=>[]}
...

This could be a separate issue, but is most likely due to the issue #6970

@shoggeh
Copy link

shoggeh commented Jun 28, 2017

@colinsurprenant yes, I think I'm running into different issue than @nicknameforever, although resulting in similar error.

  • I did modify queue.page_capacity size on logstash instance processing data,
  • I used non-zero values (usually 256mb, 512mb and 1024mb) as test for it.

And then situation was as follows:

  • EMPTY QUEUE: If nothing was stored in on-disk PQ at the moment of page_capacity size change, it was POSSIBLE to restart logstash without any issue. By "nothing" I mean _node/stats monitoring endpoint showing 0 under pipeline.queue.events

With example change from: 256 to 512mb:

PRIOR to change:

    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:21:57,852][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
    "queue" : {
      "events" : 0,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 536870912,

However the size of page file remained at it's former value:

-rw-r--r-- 1 logstash logstash 250M Jun 28 16:20 /var/lib/logstash/queue/main/page.505

END RESULT: Logstash resumed all of the operations properly.

  • SOMETHING IN QUEUE: However, if there only was anything stored in PQ then subsequent logstash restarts errored and logstash didn't start until I either:
    a) reverted queue.page_capacity to its former value
    b) manually removed contents of /var/lib/logstash/queue and let logstash to recreate it

PRIOR TO CHANGE:

    "queue" : {
      "events" : 22237,
      "type" : "persisted",
      "capacity" : {
        "page_capacity_in_bytes" : 262144000,

<< actual change >>

[2017-06-28T16:44:16,848][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 262144000 different to configured page capacity 536870912 for /var/lib/logstash/queue/main/page.505"}

END RESULT: Logstash not starting.

If it's not possible to not enforce page_capacity size on already created pages, then maybe it's just worth mentioning in docs that queue needs to be drained first to allow you to change it?

@colinsurprenant
Copy link
Contributor

@shoggeh ok, thanks for the details - so yes, this is a different issue than the zero page file size which seems to be triggered by an exception which prevents the correct page file "initialization".

But your situation is definitely a problem we need to also fix, either with better documentation, better error message or by revisiting this limitation.

Would you mind creating a separate issue with your specific problem description here? - basically copy-paste your previous comment.

@colinsurprenant
Copy link
Contributor

@nicknameforever thanks for the info. Ok, agree we have potentially 2 different issues here (one leading to the other):

We will address both separately and keep focus ont the PQ corruption for this issue here.

@martijnvermaat
Copy link

martijnvermaat commented Jul 14, 2017

How can we get logstash to start again in case of an empty page file?

$ du -sh /run/shm/logstash/queue/main/page*
250M	/run/shm/logstash/queue/main/page.3470
0	/run/shm/logstash/queue/main/page.3471
250M	/run/shm/logstash/queue/main/page.3511
250M	/run/shm/logstash/queue/main/page.3525
250M	/run/shm/logstash/queue/main/page.3898
4.0K	/run/shm/logstash/queue/main/page.3899
$ tail -2 /var/log/logstash/logstash-plain.log
[2017-07-14T16:47:57,434][ERROR][logstash.pipeline        ] Logstash failed to create queue {"exception"=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471", "backtrace"=>["org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:133:in `open'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:41:in `with_queue'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:30:in `create_file_based'", "/usr/share/logstash/logstash-core/lib/logstash/queue_factory.rb:29:in `create'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:159:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:286:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:95:in `register_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:314:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:209:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"]}
[2017-07-14T16:47:57,439][ERROR][logstash.agent           ] Cannot create pipeline {:reason=>"Page file size 0 different to configured page capacity 262144000 for /run/shm/logstash/queue/main/page.3471"}

@colinsurprenant
Copy link
Contributor

@martijnvermaat could you please run the following command from the logstash root dir and report the result?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

@colinsurprenant
Copy link
Contributor

@martijnvermaat sorry, I should have given more explanation: the above command will just dump the content of the checkpoint files and will help me in seeing if we can come up with a manual way to repair the broken queue state. The checkpoint files do not contain any ingested data but only metadata about the queue state.

@martijnvermaat
Copy link

@colinsurprenant I since moved the queue dir to some other location and restarted logstash (twice, because after ~ 15min the same thing happened). Here are the contents of the checkpoint files I moved:

checkpoints.txt

At this point it's not a priority to recover the events, but it's a bit worrying that this happened twice.

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jul 15, 2017

@martijnvermaat Great, thanks. In this checkpoints.txt, which page is now 0 bytes? (or you can also include a ls -la of data/queue/main/)

Also, have you seen any error logs prior to this happening? Was Logstash somehow abnormally terminated? crashed? interrupted?

Thanks.

@onnos
Copy link

onnos commented Jul 15, 2017

@colinsurprenant Hi! Colleague of @martijnvermaat here, jumping in since he's away for a few days.

Here's the directory listing for the queue mentioned above: queue_ls.txt

The reason for restarting Logstash was that the process was "hanging", meaning the pipeline(s) stopped.

[2017-07-14T15:23:11,978][INFO ][logstash.outputs.file    ] Opening file {:path=>"/store/logstash/2017/07/14/message.log.gz"}
[2017-07-14T15:23:11,980][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>"0", "backtrace"=>["org.logstash.dissect.DissectorErrorUtils.backtrace(DissectorErrorUtils.java:16)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.logException(JavaDissectorLibrary.java:224)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect(JavaDissectorLibrary.java:130)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect.dissect_multi(JavaDissectorLibrary.java:140)", "org.logstash.dissect.JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.call(JavaDissectorLibrary$RubyDissect$INVOKER$i$2$0$dissect_multi.gen)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb:182)", "rubyjit.LogStash::Filters::Dissect$$multi_filter_1299793c6e70636e916ab55a5dfbe2aaaf57a1ae1808253012.__file__(/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-dissect-1.0.11/lib/logstash/filters/dissect.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:43)", "rubyjit.LogStash::FilterDelegator$$multi_filter_43640ebf68de601b56cb618392ab9de0b4f8c58a1808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.IfNode.interpret(IfNode.java:118)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyArray.eachCommon(RubyArray.java:1606)", "org.jruby.RubyArray.each(RubyArray.java:1613)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)", "org.jruby.ast.DAsgnNode.interpret(DAsgnNode.java:110)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.internal.runtime.methods.ProcMethod.call(ProcMethod.java:64)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:383)", "rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:159)", "org.jruby.runtime.CompiledBlock19.call(CompiledBlock19.java:87)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call19(RubyProc.java:281)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(RubyProc$INVOKER$i$0$0$call19.gen)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.block_0$RUBY$__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:312)", "rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__.call(rubyjit$LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012$block_0$RUBY$__file__)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:135)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyHash$13.visit(RubyHash.java:1355)", "org.jruby.RubyHash.visitLimited(RubyHash.java:648)", "org.jruby.RubyHash.visitAll(RubyHash.java:634)", "org.jruby.RubyHash.iteratorVisitAll(RubyHash.java:1306)", "org.jruby.RubyHash.each_pairCommon(RubyHash.java:1351)", "org.jruby.RubyHash.each19(RubyHash.java:1342)", "org.jruby.RubyHash$INVOKER$i$0$0$each19.call(RubyHash$INVOKER$i$0$0$each19.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:311)", "rubyjit.LogStash::Util::WrappedAckedQueue::ReadBatch$$each_bb4ac8c0302164a124895743b2247f356865d5581808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:161)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.chained_0_rescue_1$RUBY$SYNTHETIC__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:382)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "rubyjit.LogStash::Pipeline$$filter_batch_479eb809eb20a7a0994a246c039a1bbdaffd05521808253012.__file__(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.ast.WhileNode.interpret(WhileNode.java:131)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)", "org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)", "org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)", "org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)", "org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)", "org.jruby.ast.BlockNode.interpret(BlockNode.java:71)", "org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)", "org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)", "org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)", "org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)", "org.jruby.runtime.Block.call(Block.java:101)", "org.jruby.RubyProc.call(RubyProc.java:300)", "org.jruby.RubyProc.call(RubyProc.java:230)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:103)", "java.lang.Thread.run(Thread.java:748)"]}

I suspect the actual hang of the pipeline to be related to some JVM config (perhaps the issue raised above) or some overflow related to the high volume throughput through a single TCP input. Not sure how to reproduce it yet.

@colinsurprenant
Copy link
Contributor

@onnos @martijnvermaat Thanks, very good info, this is helpful.

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jul 19, 2017

@martijnvermaat @onnos Which version of LS are you running and on what platform? Thanks.

@martijnvermaat
Copy link

@colinsurprenant LS 2.3.2 on Ubuntu 14.04.

@colinsurprenant
Copy link
Contributor

@martijnvermaat Are you sure? Persistent Queue was only introduced in LS 5.4 !? (in 5.1 in beta)

@onnos
Copy link

onnos commented Jul 19, 2017

That's the retired setup, this is 5.5.0. :) Sorry for the confusion, we're transitioning between setups. Running on Ubuntu 14.04 with Xenial 4.4.0 kernel.

Again referring to the problem of the initial hang: I split the incoming stream into two separate TCP inputs (from rsyslog to logstash), and this has helped throughput under load tremendously. I suspect we won't be able to recreate the crash.

@colinsurprenant
Copy link
Contributor

@martijnvermaat @onnos Here's what I can make of your queue state:

  • The oldest page is 3470 which contains the sequence numbers range of [1669303167..1669780635] and the First Unacked Sequence Num (FUS) is 1669595327, so there are 185309 (+/- 1, unimportant here) unacked events in that page.

  • The next page 3471 is the one with the 0 byte size and contains the sequence number range [1669780636..1670260155] and FUS is 1670260156 which means it is fully acked.

  • The next pages, 3472+ are all purged and the checkpoint says they are all fully acked up to page 3899 which is the head page.

I see 2 problems here:

1- Page 3470 is stuck behind with a bunch of unacked events. The only explanation I see is that this batch at seq num 1669595327 got stuck in a stalled output up until LS crashed or was restarted. I suspect that if it wasn't for the 0 byte page size, upon restart these events in page 3470 would have been replayed and the queue would have been correctly purged up to the head page. It is definitely worth for us to better evaluate such scenarios and think about how to mitigate it. At first thought I'd say that this is more a stuck output problem than a queue problem per-se. Will followup on this separately.

2- Page 3471 is in fact fully acked and should be purged completely but it is zero byte. I suspect the delete operation was, for some reason, not complete or something like that. I will investigate in that area see if I can spot such a condition.

The good news in this case is that simply deleting page.3471 should fix the problem.

The other good news is that I will see if we could add a condition at queue initialization that if a page file has zero byte and the checkpoint says it is fully acked, we could just clean it up.

@colinsurprenant
Copy link
Contributor

@nicknameforever do you still have your problematic queue dir intact? If so would it also be possible for you to run the same command to list the content of the checkpoints (adjust queue dir path if necessary) and see if we have the same situation here?

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

@nicknameforever
Copy link
Author

@colinsurprenant Unfortunately not.

@colinsurprenant
Copy link
Contributor

create issue #7809 specifically for the zero byte & fully acked page file case.

@colinsurprenant
Copy link
Contributor

PR for #7809 in #7810

@colinsurprenant
Copy link
Contributor

Above I mentioned a condition of stalled plugin, issue #7796 was created to get more insights on these.

@andrewmiskell
Copy link

@colinsurprenant I'm running into the same problem as the original reporter. Didn't change the queue size, just restarted logstash and now it's failing to start with the page file being 0 bytes. Running 5.6.4 of logstash from the elastic repos.

@colinsurprenant
Copy link
Contributor

@andrewmiskell thanks for the report. Has did LS shutdown prior to the restart? did it complete a clean shutdown or was it forced stopped or crashed? Do you have any logs related to that?

@andrewmiskell
Copy link

@colinsurprenant Not sure if I have the logs, but it was observed after issuing a simple service logstash restart command on both my logstash nodes.

I still have the queues, I moved them to a temporary location for the time being so I could see if they could be fixed and imported into ElasticSearch later.

@colinsurprenant
Copy link
Contributor

@andrewmiskell ah! great that you preserved the queue dir. I am optimistic that we should be able to recover it and it will help diagnose the problem. In the mean time, I'd really appreciate if you could check the logs to see if LS completed a clean/normal shutdown or if the service stop ended up killing the LS process.

Could you list the queue dir content and also run the following command from LS home dir, (adjust the queue dir path) to display the content of the checkpoints:

$ vendor/jruby/bin/jruby -rpp -e 'Dir.glob("data/queue/main/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); puts File.basename(checkpoint); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

@andrewmiskell
Copy link

@colinsurprenant Attached to comment. It looks like there was a hung grok thread and it was killed forcefully by the service script.

logstash-queue.txt
logstash-log.txt

@colinsurprenant
Copy link
Contributor

@andrewmiskell could you also list the queue dir files showing the file sizes please? (ls -la)
Thanks.

@colinsurprenant
Copy link
Contributor

@andrewmiskell better yet, could you run this new command from the LS home dir, adjust the dir = "data/queue/main" part to your actual queue dir path.

vendor/jruby/bin/jruby -rpp -e 'dir = "data/queue/main"; Dir.glob("#{dir}/checkpoint.*").sort_by { |x| x[/[0-9]+$/].to_i}.each { |checkpoint| data = File.read(checkpoint); version, page, firstUnackedPage, firstUnackedSeq, minSeq, elementCount, crc32 = data.unpack("nNNQ>Q>NN"); fa = firstUnackedSeq >= (minSeq + elementCount); ps = File.exist?("#{dir}/page.#{page}") ? File.size("#{dir}/page.#{page}") : nil; print("#{File.basename(checkpoint)}, #{fa ? "FA" : "UN"}, size: #{ps ? ps : "NA"}, "); p(version: version, page: page, firstUnackedPage: firstUnackedPage, firstUnackedSeq: firstUnackedSeq, minSeq: minSeq, elementCount: elementCount, crc32: crc32) }'

@colinsurprenant
Copy link
Contributor

For some reason #7809 which originally fixed the Page file size 0 different to configured ... startup exception was incorrectly merged into the 5.6 branch and this is why we see this problem again. I re-injected the fix and it will be part of the upcoming 5.6.5 release. This is really unfortunate.

I will go ahead and close this issue, feel free to re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants