Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout/Error when processing text books #642

Open
KnowledgeGarden opened this issue Sep 26, 2020 · 11 comments
Open

Timeout/Error when processing text books #642

KnowledgeGarden opened this issue Sep 26, 2020 · 11 comments

Comments

@KnowledgeGarden
Copy link

Latest master 0.6.2 running on a xeon 32gb 1U ubuntu 18.0.4
Fed it a 374mb biology text book from java client on Macbook Pro. It finished one time and made a 1.7mb xml file that looks clean.
Fed it 3 more (smaller) text books. Crashed.
Restarted and fed it all 4 books, one at a time. Crashed on each, including the biology textbook it completed before.
Set the client to 1 thread. No change. Did not experiment with settings on server.
Gist of the failure is
here

Brief summary:
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin4205259084261240168.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/e4XfWM2BZM.lxml]
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

ERROR [2020-09-26 17:40:37,590] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out
! at org.grobid.core.document.DocumentSource.processPdfToXmlServerMode(DocumentSource.java:237)

@KnowledgeGarden
Copy link
Author

I have some evidence to support that this is a memory issue.
I just ran 6 smaller pdf reports in a batch mode and it ran fine.
The confusing aspect of this issue is that it did, in fact, process a 374mb pdf just fine, just once, and never again.

@kermitt2
Copy link
Owner

Hello @KnowledgeGarden !

There is no crash, the PDF parsing part (done by an external process with pdfalto) is just timing out - it's actually a protection to avoid crash and keep the system on.

Grobid is currently designed for independent articles, chapters, short reports and this kind of short documents, not books, full proceedings or phd thesis. There was an effort for supporting books with an additional model (to segment the book/proceedings), but it is not progressing currently.

You should be able to process the fat document by increasing the timeout (see in grobid/grobid-home/config/grobid.properties - probably not necessary you can increase the memory limit for the PDF parsing too

grobid.3rdparty.pdf2xml.memory.limit.mb=6096
grobid.3rdparty.pdf2xml.timeout.sec=60

), but it won't look good because it's a text book and the areas will be messed up.

@kermitt2 kermitt2 changed the title Unknown error-Fulltext model Timeout when processing text books Sep 26, 2020
@KnowledgeGarden
Copy link
Author

Thanks very much! I raised mb to 16384 and sec to 300. It failed again but not with a [TIMEOUT] anywhere to be found; the test was on the very same 374mb pdf that it did successfully read once before.

The gist is here

Summary:
INFO [2020-09-26 18:40:33,278] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin229373444390319906.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/ZzevrbmVzB.lxml]
ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)

@KnowledgeGarden
Copy link
Author

Would be happy to compress and upload the completed text book if interested.

@kermitt2
Copy link
Owner

Thank you! error 143 for the external process, it can be a bit anything (it means the OS has killed the process to avoid something bad :).

So yes, at this stage having the text book PDF will help to understand the problem at the level of pdfalto.

@KnowledgeGarden
Copy link
Author

@kermitt2 kermitt2 changed the title Timeout when processing text books Timeout/Error when processing text books Sep 26, 2020
@KnowledgeGarden
Copy link
Author

The original text book is the pdf on this page

@KnowledgeGarden
Copy link
Author

Side note: it read an 800mb text book, but not without this error message
ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse
file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

@kermitt2
Copy link
Owner

Thank you @KnowledgeGarden

With 1447 pages and 374MB PDF, we are clearly not with the kind of document that can be structured for the moment by grobid and grobid is basically rather tuned to fail in a safe manner with this kind of PDF than to run them entirely.

You will have more chance to have it processed in batch mode, which is single thread, than with the server because the server has additional safety mechanisms to stop the process (to keep the server safe and continue processing more documents).

Having said that, when running directly with pdfalto, there's no error and 14.5s only:

time ~/grobid/grobid-home/pdf2xml/lin-64/pdfalto -noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 /home/lopez/Downloads/Biology2e-WEB_ICOFkGu.pdf ~/tmp/ZzevrbmVzB.lxml

real	0m14.568s

So it's really the Grobid call process which stops/kills the PDF parsing by safety/paranoia. I think it makes sense to review this part only when a model for book-level will be available and we will start supporting full "monograph".

(By the way, what a really great text book!)

ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse
file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

There's some not-well formed XML generated by padalto for the files capturing the outline and annotations in the PDF (there's all sort of PDF dirt there), but it won't impact the rest and main content. It will be fixed in pdfalto at some point in the future :/

@KnowledgeGarden
Copy link
Author

Thank you @kermitt2 !!!
I ran it in batch (command line mode). Raised memory on config to 12g, and ended up adding a 0 to max tokens (it ran out).
It gave a marvelously clean xml file, so clean that there was no content in the block. This event occurred in the console:
SEVERE: Cannot parse file: /Users/jackpark/Documents/gitprojects5/grobid/grobid-home/tmp/j8q2dxn7WB.lxml_annot.xml

This run is on my 16gb MacBook Pro. I shall next try building pdfalto and see how that works.

My goal is to populate my OpenSherlock machine reading platform with textbooks before diving into publications. I hope that Grobid will rise to the occasion.

@KnowledgeGarden
Copy link
Author

Sigh. pdfalto works but it really feels like I'd be better off just exporting those PDFs to plain text and reading that. There are plenty of libraries to split out paragraphs and sentences; I don't need fonts, positions and all that other data. pdfalto really shows off the power of those models in grobid. I'll just have to wait for that to handle files > 100mb. It seems to work on files less than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants