Timeout/Error when processing text books #642

KnowledgeGarden · 2020-09-26T17:51:48Z

Latest master 0.6.2 running on a xeon 32gb 1U ubuntu 18.0.4
Fed it a 374mb biology text book from java client on Macbook Pro. It finished one time and made a 1.7mb xml file that looks clean.
Fed it 3 more (smaller) text books. Crashed.
Restarted and fed it all 4 books, one at a time. Crashed on each, including the biology textbook it completed before.
Set the client to 1 thread. No change. Did not experiment with settings on server.
Gist of the failure is
here

Brief summary:
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin4205259084261240168.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/e4XfWM2BZM.lxml]
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

ERROR [2020-09-26 17:40:37,590] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out
! at org.grobid.core.document.DocumentSource.processPdfToXmlServerMode(DocumentSource.java:237)

KnowledgeGarden · 2020-09-26T18:15:52Z

I have some evidence to support that this is a memory issue.
I just ran 6 smaller pdf reports in a batch mode and it ran fine.
The confusing aspect of this issue is that it did, in fact, process a 374mb pdf just fine, just once, and never again.

kermitt2 · 2020-09-26T18:29:34Z

Hello @KnowledgeGarden !

There is no crash, the PDF parsing part (done by an external process with pdfalto) is just timing out - it's actually a protection to avoid crash and keep the system on.

Grobid is currently designed for independent articles, chapters, short reports and this kind of short documents, not books, full proceedings or phd thesis. There was an effort for supporting books with an additional model (to segment the book/proceedings), but it is not progressing currently.

You should be able to process the fat document by increasing the timeout (see in grobid/grobid-home/config/grobid.properties - probably not necessary you can increase the memory limit for the PDF parsing too

grobid.3rdparty.pdf2xml.memory.limit.mb=6096
grobid.3rdparty.pdf2xml.timeout.sec=60

), but it won't look good because it's a text book and the areas will be messed up.

KnowledgeGarden · 2020-09-26T18:50:03Z

Thanks very much! I raised mb to 16384 and sec to 300. It failed again but not with a [TIMEOUT] anywhere to be found; the test was on the very same 374mb pdf that it did successfully read once before.

The gist is here

Summary:
INFO [2020-09-26 18:40:33,278] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin229373444390319906.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/ZzevrbmVzB.lxml]
ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)

KnowledgeGarden · 2020-09-26T18:52:19Z

Would be happy to compress and upload the completed text book if interested.

kermitt2 · 2020-09-26T18:54:59Z

Thank you! error 143 for the external process, it can be a bit anything (it means the OS has killed the process to avoid something bad :).

So yes, at this stage having the text book PDF will help to understand the problem at the level of pdfalto.

KnowledgeGarden · 2020-09-26T18:56:27Z

xbiologyConcepts-lr.tei.xml.zip

KnowledgeGarden · 2020-09-26T18:59:22Z

The original text book is the pdf on this page

KnowledgeGarden · 2020-09-26T19:35:36Z

Side note: it read an 800mb text book, but not without this error message
ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse
file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

kermitt2 · 2020-09-26T20:02:59Z

Thank you @KnowledgeGarden

With 1447 pages and 374MB PDF, we are clearly not with the kind of document that can be structured for the moment by grobid and grobid is basically rather tuned to fail in a safe manner with this kind of PDF than to run them entirely.

You will have more chance to have it processed in batch mode, which is single thread, than with the server because the server has additional safety mechanisms to stop the process (to keep the server safe and continue processing more documents).

Having said that, when running directly with pdfalto, there's no error and 14.5s only:

time ~/grobid/grobid-home/pdf2xml/lin-64/pdfalto -noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 /home/lopez/Downloads/Biology2e-WEB_ICOFkGu.pdf ~/tmp/ZzevrbmVzB.lxml

real	0m14.568s

So it's really the Grobid call process which stops/kills the PDF parsing by safety/paranoia. I think it makes sense to review this part only when a model for book-level will be available and we will start supporting full "monograph".

(By the way, what a really great text book!)

ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse
file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

There's some not-well formed XML generated by padalto for the files capturing the outline and annotations in the PDF (there's all sort of PDF dirt there), but it won't impact the rest and main content. It will be fixed in pdfalto at some point in the future :/

KnowledgeGarden · 2020-09-26T21:36:25Z

Thank you @kermitt2 !!!
I ran it in batch (command line mode). Raised memory on config to 12g, and ended up adding a 0 to max tokens (it ran out).
It gave a marvelously clean xml file, so clean that there was no content in the block. This event occurred in the console:
SEVERE: Cannot parse file: /Users/jackpark/Documents/gitprojects5/grobid/grobid-home/tmp/j8q2dxn7WB.lxml_annot.xml

This run is on my 16gb MacBook Pro. I shall next try building pdfalto and see how that works.

My goal is to populate my OpenSherlock machine reading platform with textbooks before diving into publications. I hope that Grobid will rise to the occasion.

KnowledgeGarden · 2020-09-27T03:18:22Z

Sigh. pdfalto works but it really feels like I'd be better off just exporting those PDFs to plain text and reading that. There are plenty of libraries to split out paragraphs and sentences; I don't need fonts, positions and all that other data. pdfalto really shows off the power of those models in grobid. I'll just have to wait for that to handle files > 100mb. It seems to work on files less than that.

kermitt2 changed the title ~~Unknown error-Fulltext model~~ Timeout when processing text books Sep 26, 2020

kermitt2 changed the title ~~Timeout when processing text books~~ Timeout/Error when processing text books Sep 26, 2020

keto33 mentioned this issue Nov 17, 2020

How to extract body only? #674

Open

andrei-volkau mentioned this issue Jan 7, 2021

[Question] PDF to XML conversion timed out while timeout is set to a large number and lots of RAM is available #690

Open

lfoppiano mentioned this issue Jun 10, 2022

pdfalto error: Syntax Warning: Invalid entry in bfchar block in ToUnicode CMap #923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout/Error when processing text books #642

Timeout/Error when processing text books #642

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 27, 2020

Timeout/Error when processing text books #642

Timeout/Error when processing text books #642

Comments

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

kermitt2 commented Sep 26, 2020

KnowledgeGarden commented Sep 26, 2020

KnowledgeGarden commented Sep 27, 2020