Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Document stuck at "processing" #1278

Closed
zandadoum opened this issue Jul 25, 2022 · 15 comments
Closed

[BUG] Document stuck at "processing" #1278

zandadoum opened this issue Jul 25, 2022 · 15 comments
Assignees
Labels
backend bug Bug report or a Bug-fix dependencies Pull requests that update a dependency file

Comments

@zandadoum
Copy link

Description

certain PDF causes paperless to get stuck at "processing"
i attach the file

B1fDMHqLFES.pdf

i just recently started using paperless, i consumed 20 documents with 0 issues so far, but paperless refuses to consume this file
it is stuck at "processing" for some minutes and then if i refresh page (f5) it is like it did nothing

Steps to reproduce

  1. put attached pdf file in consume folder
  2. thats it

Webserver logs

[2022-07-25 09:21:19,890] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf

[2022-07-25 09:21:19,896] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2022-07-25 09:21:19,910] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2022-07-25 09:21:19,923] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES.pdf...

[2022-07-25 09:22:04,047] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/B1fDMHqLFES.pdf

[2022-07-25 09:22:05,264] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/B1fDMHqLFES.pdf', 'output_file': '/tmp/paperless/paperless-6gm3biet/archive.pdf', 'use_threads': True, 'jobs': 1, 'language': 'spa', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-6gm3biet/sidecar.txt'}

[2022-07-25 09:22:44,856] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.

[2022-07-25 09:23:29,129] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-6gm3biet/archive.pdf

[2022-07-25 09:23:29,130] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.

[2022-07-25 09:23:29,131] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/B1fDMHqLFES.pdf', 'output_file': '/tmp/paperless/paperless-6gm3biet/archive-fallback.pdf', 'use_threads': True, 'jobs': 1, 'language': 'spa', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-6gm3biet/sidecar-fallback.txt'}

Paperless-ngx version

1.7.1

Host OS

synology DSM 218+ docker

Installation method

Docker - official image

Browser

Chrome

Configuration changes

No response

Other

No response

@shamoon
Copy link
Member

shamoon commented Jul 25, 2022

I would guess your machine is just running out of resources to process this. It processed fine for me but note the logs, the pages are images, etc:

[2022-07-25 07:15:38,046] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf
[2022-07-25 07:15:51,224] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-07-25 07:16:07,035] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-25 07:16:08,626] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,629] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,630] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,634] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,635] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,635] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,641] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,645] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,648] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,653] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,657] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:08,662] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,502] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,571] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,625] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:26,713] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:27,513] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:27,617] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:31,096] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:31,096] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:32,928] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:32,928] [WARNING] [ocrmypdf._pipeline] page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
[2022-07-25 07:16:34,213] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:44,846] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:44,924] [WARNING] [ocrmypdf._exec.tesseract] [tesseract] lots of diacritics - possibly poor OCR
[2022-07-25 07:16:51,155] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-07-25 07:16:56,500] [WARNING] [ocrmypdf._validation] The output file size is 2.23× larger than the input file.
Possible reasons for this include:
The argument --force-ocr was issued, causing transcoding.
The argument --deskew was issued, causing transcoding.
PDF/A conversion was enabled. (Try `--output-type pdf`.)
[2022-07-25 07:16:58,282] [INFO] [paperless.handlers] Assigning document type Report to 2020-07-11 B1fDMHqLFES
[2022-07-25 07:16:58,414] [INFO] [paperless.consumer] Document 2020-07-11 B1fDMHqLFES consumption finished
07:16:58 [Q] INFO Process-1:58 stopped doing work
07:16:58 [Q] INFO Processed [B1fDMHqLFES.pdf]
07:16:59 [Q] INFO recycled worker Process-1:58

@stumpylog
Copy link
Member

I'm also able to successfully load the document.

@stumpylog stumpylog changed the title [BUG] Concise description of the issue [BUG] Document stuck at "processing" Jul 27, 2022
@wittd19
Copy link

wittd19 commented Jul 29, 2022

I installed paperless-ngx a few weeks ago, and I too am running into problems processing documents - same as OP where they get stuck in "processing" and then just seem to fail silently. This has happened on multiple documents ... initially I thought it was size related as all the documents I've seen fail have been > 5 MB, however I too get errors when trying to process the OPs file.

I'm deployed in Unraid using the lsio docker image.
I've tried all sorts of changes to the startup attributes

Here are the logs when trying to process the OP file by dropping it in the consume folder

[2022-07-29 20:13:11,377] [INFO] [paperless.management.consumer] Adding /data/consume/B1fDMHqLFES-test4.pdf to the task queue.
[2022-07-29 20:13:11,381] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /data/consume
[2022-07-29 20:13:11,648] [INFO] [paperless.consumer] Consuming B1fDMHqLFES-test4.pdf
[2022-07-29 20:13:11,658] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-07-29 20:13:11,662] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-07-29 20:13:11,664] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES-test4.pdf...
[2022-07-29 20:13:24,620] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /data/consume/B1fDMHqLFES-test4.pdf
[2022-07-29 20:13:25,029] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/data/consume/B1fDMHqLFES-test4.pdf', 'output_file': '/tmp/paperless/paperless-d5cnpdg0/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-d5cnpdg0/sidecar.txt'}
[2022-07-29 20:13:35,822] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-07-29 20:13:48,585] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-d5cnpdg0/archive.pdf
[2022-07-29 20:13:48,587] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-29 20:13:48,587] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/data/consume/B1fDMHqLFES-test4.pdf', 'output_file': '/tmp/paperless/paperless-d5cnpdg0/archive-fallback.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-d5cnpdg0/sidecar-fallback.txt'}

Dropping it in the UI results in this:
image

and seemingly same logs output:

[2022-07-29 20:30:35,821] [INFO] [paperless.consumer] Consuming B1fDMHqLFES.pdf
[2022-07-29 20:30:35,824] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-07-29 20:30:35,826] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-07-29 20:30:35,828] [DEBUG] [paperless.consumer] Parsing B1fDMHqLFES.pdf...
[2022-07-29 20:30:48,672] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-csbzt7_j
[2022-07-29 20:30:48,794] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-csbzt7_j', 'output_file': '/tmp/paperless/paperless-t48pb7wb/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-t48pb7wb/sidecar.txt'}
[2022-07-29 20:30:59,523] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-07-29 20:31:11,832] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-t48pb7wb/archive.pdf
[2022-07-29 20:31:11,833] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-07-29 20:31:11,834] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-csbzt7_j', 'output_file': '/tmp/paperless/paperless-t48pb7wb/archive-fallback.pdf', 'use_threads': True, 'jobs': 4, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-t48pb7wb/sidecar-fallback.txt'}

you mentioned resources, and while I am running on Unraid with a lot of other dockers, I never see a spike in CPU or RAM usage during file processing, and i am running i7-8700 with 8GB RAM

Happy to test anything needed. Are there any performance related settings I can try to change?

Thanks..

@stumpylog
Copy link
Member

I retested with 1.8.0, and the document still processes for me. For this file, there's no text content found, so it forces OCR, meaning it has to process a lot of images, which means a lot of time.

The default timeout for working an a file is 1800s or 30 mins. If the document doesn't complete by then, it will be marked as failed.

@wittd19
Copy link

wittd19 commented Jul 29, 2022

I've also tried adding this to my docker config, to give it a full hour to process, same result.
image

@wittd19
Copy link

wittd19 commented Jul 29, 2022

You mentioned 'marked as failed'
Should there be a log event at 30/60m when it times out?
Also, under admin, failed tasks there is no entry when this occurs

@stumpylog
Copy link
Member

Ok, I think I see what the issue is. I would bet you'll see in the log a single line like:

[Q] WARNING reincarnated worker Process-1:7 after timeout

That's not much, and certainly not helpful to see the WebUI seemingly still working away, when the background has given up. I'll need to look into what a dependency does and see if it can be improved.

That doesn't help with the document still timing out, but I don't see anything which can be done for that besides increasing the timeout. It does complete, it's just a lot of processing. From within the container, you could run time ocrmypdf --force-ocr --clean --deskew --rotate-pages B1fDMHqLFES.pdf output.pdf and see how long it will actually take. And you could just use that output, since it will be OCRed.

@stumpylog stumpylog added backend dependencies Pull requests that update a dependency file and removed cant-reproduce labels Jul 30, 2022
@wittd19
Copy link

wittd19 commented Jul 30, 2022

Thanks for taking the time to look at this..

here's the result of that test

image

@wittd19
Copy link

wittd19 commented Aug 8, 2022

Anything else to look at here?
Not sure what that 'killed' message means but it's not hitting the configured timeout... 'killed' seems to be happening in less than 30 seconds. This is actually happening on many of the files I am attempting to upload so I can provide more examples if that helps diagnose.

@stumpylog
Copy link
Member

For the timeout not being so visible, I'm working on a solution for that.

The "Killed" printed above is from the out of memory manager killing the process. That also might be the cause of an eternally processing document, and I don't think there's anyway to raise that up to a user.

@stumpylog stumpylog self-assigned this Aug 8, 2022
@wittd19
Copy link

wittd19 commented Aug 22, 2022

Just a note, after updating to v1.8.0 I can see these failed documents in the new "File Tasks" view (which is pretty cool, btw), but lots of documents still failing for me...
image

@hawkinspeter
Copy link

hawkinspeter commented Aug 22, 2022

Running v1.8.0 on RPi4 docker swarm and am getting this problem too.

Edit: It looks like they did eventually get processed

@stumpylog
Copy link
Member

Our next release will include improvements to how worker timeouts are handled. They will be much more visible (see examples paperless-ngx/django-q#2 (comment)) in the UI.

If it's the OOM killer, that still won't be obvious; there just isn't a way to detect that. Hopefully, upcoming improvements in underlying libraries like pikepdf and qpdf will help reduce the occurrences.

I'm going to close this out, as I believe what we can do here is now fixed.

Repository owner moved this from Todo to Done in Paperless-ngx Sep 7, 2022
@nheine
Copy link

nheine commented Dec 29, 2022

for me the timeout works well, it appears in the logs after the default 1800s. However on the dashboard the task for that document is still shown as processing mode in green. Only once the browser is closed and opened again, or I open the webpage on another browser this dissapears. So it looks like there is no feedback to the web UI once processing timeout

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend bug Bug report or a Bug-fix dependencies Pull requests that update a dependency file
Projects
Archived in project
Development

No branches or pull requests

6 participants