-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout/Error when processing text books #642
Comments
I have some evidence to support that this is a memory issue. |
Hello @KnowledgeGarden ! There is no crash, the PDF parsing part (done by an external process with Grobid is currently designed for independent articles, chapters, short reports and this kind of short documents, not books, full proceedings or phd thesis. There was an effort for supporting books with an additional model (to segment the book/proceedings), but it is not progressing currently. You should be able to process the fat document by increasing the timeout (see in
), but it won't look good because it's a text book and the areas will be messed up. |
Thanks very much! I raised mb to 16384 and sec to 300. It failed again but not with a [TIMEOUT] anywhere to be found; the test was on the very same 374mb pdf that it did successfully read once before. The gist is here Summary: ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) |
Would be happy to compress and upload the completed text book if interested. |
Thank you! error 143 for the external process, it can be a bit anything (it means the OS has killed the process to avoid something bad :). So yes, at this stage having the text book PDF will help to understand the problem at the level of pdfalto. |
The original text book is the pdf on this page |
Side note: it read an 800mb text book, but not without this error message |
Thank you @KnowledgeGarden With 1447 pages and 374MB PDF, we are clearly not with the kind of document that can be structured for the moment by grobid and grobid is basically rather tuned to fail in a safe manner with this kind of PDF than to run them entirely. You will have more chance to have it processed in batch mode, which is single thread, than with the server because the server has additional safety mechanisms to stop the process (to keep the server safe and continue processing more documents). Having said that, when running directly with pdfalto, there's no error and 14.5s only:
So it's really the Grobid call process which stops/kills the PDF parsing by safety/paranoia. I think it makes sense to review this part only when a model for book-level will be available and we will start supporting full "monograph". (By the way, what a really great text book!)
There's some not-well formed XML generated by padalto for the files capturing the outline and annotations in the PDF (there's all sort of PDF dirt there), but it won't impact the rest and main content. It will be fixed in pdfalto at some point in the future :/ |
Thank you @kermitt2 !!! This run is on my 16gb MacBook Pro. I shall next try building pdfalto and see how that works. My goal is to populate my OpenSherlock machine reading platform with textbooks before diving into publications. I hope that Grobid will rise to the occasion. |
Sigh. pdfalto works but it really feels like I'd be better off just exporting those PDFs to plain text and reading that. There are plenty of libraries to split out paragraphs and sentences; I don't need fonts, positions and all that other data. pdfalto really shows off the power of those models in grobid. I'll just have to wait for that to handle files > 100mb. It seems to work on files less than that. |
Latest master 0.6.2 running on a xeon 32gb 1U ubuntu 18.0.4
Fed it a 374mb biology text book from java client on Macbook Pro. It finished one time and made a 1.7mb xml file that looks clean.
Fed it 3 more (smaller) text books. Crashed.
Restarted and fed it all 4 books, one at a time. Crashed on each, including the biology textbook it completed before.
Set the client to 1 thread. No change. Did not experiment with settings on server.
Gist of the failure is
here
Brief summary:
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin4205259084261240168.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/e4XfWM2BZM.lxml]
ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:
ERROR [2020-09-26 17:40:37,590] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out
! at org.grobid.core.document.DocumentSource.processPdfToXmlServerMode(DocumentSource.java:237)
The text was updated successfully, but these errors were encountered: