[BUG] Document stuck at "processing" #1278

zandadoum · 2022-07-25T07:27:19Z

Description

certain PDF causes paperless to get stuck at "processing"
i attach the file

i just recently started using paperless, i consumed 20 documents with 0 issues so far, but paperless refuses to consume this file
it is stuck at "processing" for some minutes and then if i refresh page (f5) it is like it did nothing

Steps to reproduce

put attached pdf file in consume folder
thats it

Webserver logs

[2022-07-25 09:21:19,890] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf

[2022-07-25 09:21:19,896] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-07-25 09:21:19,910] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-07-25 09:21:19,923] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES.pdf...

[2022-07-25 09:22:04,047] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/B1fDMHqLFES.pdf

[2022-07-25 09:22:05,264] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/B1fDMHqLFES.pdf', 'output_file': '/tmp/paperless/paperless-6gm3biet/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'spa', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-6gm3biet/sidecar.txt'}

[2022-07-25 09:22:44,856] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.

[2022-07-25 09:23:29,129] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-6gm3biet/archive.pdf

[2022-07-25 09:23:29,130] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.

[2022-07-25 09:23:29,131] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/B1fDMHqLFES.pdf', 'output_file': '/tmp/paperless/paperless-6gm3biet/archive-fallback.pdf', 'use_threads': True, 'jobs': 1, 'language': 'spa', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-6gm3biet/sidecar-fallback.txt'}

Paperless-ngx version

1.7.1

Host OS

synology DSM 218+ docker

Installation method

Docker - official image

Browser

Chrome

Configuration changes

No response

Other

No response

shamoon · 2022-07-25T14:18:34Z

I would guess your machine is just running out of resources to process this. It processed fine for me but note the logs, the pages are images, etc:

[2022-07-25 07:15:38,046] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf
[2022-07-25 07:15:51,224] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-07-25 07:16:07,035] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-25 07:16:08,626] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,629] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,630] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,634] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,635] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,635] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,641] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,645] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,648] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,653] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,657] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,662] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,502] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,571] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,625] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,713] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:27,513] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:27,617] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:31,096] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:31,096] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:32,928] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:32,928] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:34,213] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:44,846] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:44,924] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:51,155] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-07-25 07:16:56,500] [WARNING] [ocrmypdf._validation] The output file size is 2.23× larger than the input file.
Possible reasons for this include:
The argument --force-ocr was issued, causing transcoding.
The argument --deskew was issued, causing transcoding.
PDF/A conversion was enabled. (Try `--output-type pdf`.)
[2022-07-25 07:16:58,282] [INFO] [paperless.handlers] Assigning document type Report to 2020-07-11 B1fDMHqLFES
[2022-07-25 07:16:58,414] [INFO] [paperless.consumer] Document 2020-07-11 B1fDMHqLFES consumption finished
07:16:58 [Q] INFO Process-1:58 stopped doing work
07:16:58 [Q] INFO Processed [B1fDMHqLFES.pdf]
07:16:59 [Q] INFO recycled worker Process-1:58

stumpylog · 2022-07-25T17:31:11Z

I'm also able to successfully load the document.

wittd19 · 2022-07-29T21:42:34Z

I installed paperless-ngx a few weeks ago, and I too am running into problems processing documents - same as OP where they get stuck in "processing" and then just seem to fail silently. This has happened on multiple documents ... initially I thought it was size related as all the documents I've seen fail have been > 5 MB, however I too get errors when trying to process the OPs file.

I'm deployed in Unraid using the lsio docker image.
I've tried all sorts of changes to the startup attributes

Here are the logs when trying to process the OP file by dropping it in the consume folder

[2022-07-29 20:13:11,377] [INFO] [paperless.management.consumer] Adding /data/consume/B1fDMHqLFES-test4.pdf to the task queue.
[2022-07-29 20:13:11,381] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /data/consume
[2022-07-29 20:13:11,648] [INFO] [paperless.consumer] Consuming B1fDMHqLFES-test4.pdf
[2022-07-29 20:13:11,658] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-07-29 20:13:11,662] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-07-29 20:13:11,664] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES-test4.pdf...
[2022-07-29 20:13:24,620] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/B1fDMHqLFES-test4.pdf
[2022-07-29 20:13:25,029] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/B1fDMHqLFES-test4.pdf', 'output_file': '/tmp/paperless/paperless-d5cnpdg0/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-d5cnpdg0/sidecar.txt'}
[2022-07-29 20:13:35,822] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-07-29 20:13:48,585] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-d5cnpdg0/archive.pdf
[2022-07-29 20:13:48,587] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-29 20:13:48,587] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/data/consume/B1fDMHqLFES-test4.pdf', 'output_file': '/tmp/paperless/paperless-d5cnpdg0/archive-fallback.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-d5cnpdg0/sidecar-fallback.txt'}

Dropping it in the UI results in this:

and seemingly same logs output:

[2022-07-29 20:30:35,821] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf
[2022-07-29 20:30:35,824] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-07-29 20:30:35,826] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-07-29 20:30:35,828] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES.pdf...
[2022-07-29 20:30:48,672] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-csbzt7_j
[2022-07-29 20:30:48,794] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-csbzt7_j', 'output_file': '/tmp/paperless/paperless-t48pb7wb/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-t48pb7wb/sidecar.txt'}
[2022-07-29 20:30:59,523] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-07-29 20:31:11,832] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-t48pb7wb/archive.pdf
[2022-07-29 20:31:11,833] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-29 20:31:11,834] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-csbzt7_j', 'output_file': '/tmp/paperless/paperless-t48pb7wb/archive-fallback.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-t48pb7wb/sidecar-fallback.txt'}

you mentioned resources, and while I am running on Unraid with a lot of other dockers, I never see a spike in CPU or RAM usage during file processing, and i am running i7-8700 with 8GB RAM

Happy to test anything needed. Are there any performance related settings I can try to change?

Thanks..

stumpylog · 2022-07-29T21:51:52Z

I retested with 1.8.0, and the document still processes for me. For this file, there's no text content found, so it forces OCR, meaning it has to process a lot of images, which means a lot of time.

The default timeout for working an a file is 1800s or 30 mins. If the document doesn't complete by then, it will be marked as failed.

wittd19 · 2022-07-29T22:08:35Z

I've also tried adding this to my docker config, to give it a full hour to process, same result.

wittd19 · 2022-07-29T23:45:48Z

You mentioned 'marked as failed'
Should there be a log event at 30/60m when it times out?
Also, under admin, failed tasks there is no entry when this occurs

stumpylog · 2022-07-30T02:44:50Z

Ok, I think I see what the issue is. I would bet you'll see in the log a single line like:

[Q] WARNING reincarnated worker Process-1:7 after timeout

That's not much, and certainly not helpful to see the WebUI seemingly still working away, when the background has given up. I'll need to look into what a dependency does and see if it can be improved.

That doesn't help with the document still timing out, but I don't see anything which can be done for that besides increasing the timeout. It does complete, it's just a lot of processing. From within the container, you could run time ocrmypdf --force-ocr --clean --deskew --rotate-pages B1fDMHqLFES.pdf output.pdf and see how long it will actually take. And you could just use that output, since it will be OCRed.

wittd19 · 2022-07-30T03:43:33Z

Thanks for taking the time to look at this..

here's the result of that test

wittd19 · 2022-08-08T02:20:29Z

Anything else to look at here?
Not sure what that 'killed' message means but it's not hitting the configured timeout... 'killed' seems to be happening in less than 30 seconds. This is actually happening on many of the files I am attempting to upload so I can provide more examples if that helps diagnose.

stumpylog · 2022-08-08T14:28:45Z

For the timeout not being so visible, I'm working on a solution for that.

The "Killed" printed above is from the out of memory manager killing the process. That also might be the cause of an eternally processing document, and I don't think there's anyway to raise that up to a user.

wittd19 · 2022-08-22T02:23:07Z

Just a note, after updating to v1.8.0 I can see these failed documents in the new "File Tasks" view (which is pretty cool, btw), but lots of documents still failing for me...

hawkinspeter · 2022-08-22T17:40:38Z

Running v1.8.0 on RPi4 docker swarm and am getting this problem too.

Edit: It looks like they did eventually get processed

stumpylog · 2022-09-07T18:50:01Z

Our next release will include improvements to how worker timeouts are handled. They will be much more visible (see examples paperless-ngx/django-q#2 (comment)) in the UI.

If it's the OOM killer, that still won't be obvious; there just isn't a way to detect that. Hopefully, upcoming improvements in underlying libraries like pikepdf and qpdf will help reduce the occurrences.

I'm going to close this out, as I believe what we can do here is now fixed.

nheine · 2022-12-29T09:28:35Z

for me the timeout works well, it appears in the logs after the default 1800s. However on the dashboard the task for that document is still shown as processing mode in green. Only once the browser is closed and opened again, or I open the webpage on another browser this dissapears. So it looks like there is no feedback to the web UI once processing timeout

github-actions · 2023-04-15T03:02:36Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

zandadoum added bug Bug report or a Bug-fix unconfirmed labels Jul 25, 2022

paperlessngx-bot added this to Paperless-ngx Jul 25, 2022

paperlessngx-bot moved this to Todo in Paperless-ngx Jul 25, 2022

shamoon added the cant-reproduce label Jul 25, 2022

stumpylog removed the unconfirmed label Jul 25, 2022

stumpylog changed the title ~~[BUG] Concise description of the issue~~ [BUG] Document stuck at "processing" Jul 27, 2022

stumpylog added backend dependencies Pull requests that update a dependency file and removed cant-reproduce labels Jul 30, 2022

stumpylog self-assigned this Aug 8, 2022

stumpylog closed this as completed Sep 7, 2022

Repository owner moved this from Todo to Done in Paperless-ngx Sep 7, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Document stuck at "processing" #1278

[BUG] Document stuck at "processing" #1278

zandadoum commented Jul 25, 2022

shamoon commented Jul 25, 2022

stumpylog commented Jul 25, 2022

wittd19 commented Jul 29, 2022 •

edited by stumpylog

Loading

stumpylog commented Jul 29, 2022

wittd19 commented Jul 29, 2022

wittd19 commented Jul 29, 2022

stumpylog commented Jul 30, 2022

wittd19 commented Jul 30, 2022

wittd19 commented Aug 8, 2022

stumpylog commented Aug 8, 2022

wittd19 commented Aug 22, 2022

hawkinspeter commented Aug 22, 2022 •

edited

Loading

stumpylog commented Sep 7, 2022

nheine commented Dec 29, 2022

github-actions bot commented Apr 15, 2023

[BUG] Document stuck at "processing" #1278

[BUG] Document stuck at "processing" #1278

Comments

zandadoum commented Jul 25, 2022

Description

Steps to reproduce

Webserver logs

Paperless-ngx version

Host OS

Installation method

Browser

Configuration changes

Other

shamoon commented Jul 25, 2022

stumpylog commented Jul 25, 2022

wittd19 commented Jul 29, 2022 • edited by stumpylog Loading

stumpylog commented Jul 29, 2022

wittd19 commented Jul 29, 2022

wittd19 commented Jul 29, 2022

stumpylog commented Jul 30, 2022

wittd19 commented Jul 30, 2022

wittd19 commented Aug 8, 2022

stumpylog commented Aug 8, 2022

wittd19 commented Aug 22, 2022

hawkinspeter commented Aug 22, 2022 • edited Loading

stumpylog commented Sep 7, 2022

nheine commented Dec 29, 2022

github-actions bot commented Apr 15, 2023

wittd19 commented Jul 29, 2022 •

edited by stumpylog

Loading

hawkinspeter commented Aug 22, 2022 •

edited

Loading