Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Feature: Worker timeout improvements #2

Merged
merged 4 commits into from
Sep 3, 2022

Conversation

stumpylog
Copy link
Member

@stumpylog stumpylog commented Jul 21, 2022

This pull request updates how django-q handles its timeout checking to be more effecient, while making the status of a worker more clear.

Previous Architecture

Previously, the Sentinel guard loop decremented a shared Value, then terminated the process if the shared value reached 0, as this value indicated a timeout. Somewhat confusingly, this same timer value was used to reflect worker status, where certain magic numbers indicated the worker was idle (-1) or had ceased working due to reaching a limit for recycling (-2).

New Architecture

Now, these have been split up and the worker is responsible for setting its own status, while also handling its own timeout.

Worker Status

Instead of magic numbers, a IntEnum is utilized alongside a Value. This is still shared between the Sentinel and the worker, but thanks to the enum, it is more clear what statuses are possible and what they mean. The value is an I or unsigned integer, which is why an IntEnum was chosen.

Worker Timeout

Directly before beginning the task given to it, the worker utilizes signal.alarm to schedule an exception to be raised in a given number of seconds. If the work completes in time, the alarm is canceled. Otherwise, the task fails and the worker status will be set to indicate a timeout, causing the sentinel to reincarnate the worker.

In the case of no timeout, the alarm is passed a value of 0, which causes no alarm to be scheduled, ending up with infinite time to do the work.

As mentioned in the file, this technique is borrowed from rq and seems pretty straightforward. After testing with paperless, this is actually quite useful, as the timeout exception makes it much more clear a task has failed due to timing out. However, if a task catches Exception or BaseException and doesn't re-raise something, a worker could keep trying to work forever. Nothing in a paperless async task should do that though, so these changes are alright for us.

@stumpylog stumpylog force-pushed the timeout-improvements branch 2 times, most recently from dd9dbae to 3fccfd3 Compare July 22, 2022 20:36
@stumpylog stumpylog force-pushed the timeout-improvements branch 5 times, most recently from 1b03b29 to 3fccfd3 Compare July 31, 2022 01:35
@stumpylog stumpylog force-pushed the timeout-improvements branch from 3a6d8cb to 939ae3e Compare August 10, 2022 16:18
…ed for some work in the guard loop. Also removes the double duty of timer as a timer and a status, status is now a dedicated enum
@stumpylog stumpylog force-pushed the timeout-improvements branch from 939ae3e to b33b4a5 Compare August 10, 2022 22:06
@stumpylog
Copy link
Member Author

Here's an idea of what a user will see when a timeout happens during processing:

image
image

as opposed to:
image

@stumpylog stumpylog force-pushed the timeout-improvements branch from b33b4a5 to b58fc7d Compare August 11, 2022 14:38
@stumpylog
Copy link
Member Author

@paperless-ngx/backend when folks have time, this would be nice to have reviewed and merged.

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

@stumpylog this and the other PRs seem great, Im a little worried that you are the only person at the moment who knows the code well enough to have a meaningful review. Im not the right person for that but Im not even sure how best to test this either, I could at least do that.

@stumpylog
Copy link
Member Author

I'm helped by the original having some pretty decent documentation of how the pieces all fit together: https://django-q.readthedocs.io/en/latest/architecture.html. Plus rightly or not, I'm trusting the original suite of tests to be pretty good and will catch it if something majorly breaks.

Another eventual option is ditching django-q for something like celery or rq which is maintained. That's a major effort though.

Good point about testing. I'm not sure where I could document this, but I do:

  1. Push my changes to Github (creating commit 123abc)
  2. Edit Pipfile.lock django-q.ref value to the commit sha 123abc
  3. Locally build the image
  4. Start it up and kick the tires:
    • Are files consumed from webui and directory?
    • Do scheduled tasks still run?
    • Are scheduled task times (based on log times) still when they should be?

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

Cool thanks. Well then, at least in my testing this works a-ok, I cant actually get it to timeout (set PAPERLESS_WORKER_TIMEOUT very low 100 but still didnt, maybe Im missing something there?)

shamoon
shamoon previously approved these changes Aug 31, 2022
@stumpylog
Copy link
Member Author

I used a setting of 30 with this pdf: https://health.mo.gov/lab/pdf/PublicWaterMassMailing.pdf. It's scanned with no digital layer, so it takes a while, at least on my machine.

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

Weird, im now getting

Exception has occurred: ImportError
cannot import name 'WorkerStatus' from partially initialized module 'django_q.cluster' (most likely due to a circular import) (.../src/django-q/django_q/cluster.py)
  File ".../src/django-q/django_q/cluster.py", line 22, in <module>
    apps.check_apps_ready()

During handling of the above exception, another exception occurred:

  File ".../src/django-q/django_q/tasks.py", line 18, in <module>
    from django_q.cluster import WorkerStatus
  File src/documents/models.py", line 15, in <module>
    from django_q.tasks import Task
  File ".../src/django-q/django_q/cluster.py", line 26, in <module>
    django.setup()
  File "<string>", line 1, in <module> (Current frame)

@stumpylog
Copy link
Member Author

Weird, I'm not sure how I never ran into that, but I believe it is fixed now.

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

Sweet, that fixed that weird one. I still cant get it to timeout, this is with PAPERLESS_WORKER_TIMEOUT=300 in my paperless.conf, something I'm missing?

[2022-08-31 14:33:01,508] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-08-31 14:33:06,864] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2022-08-31 14:33:06,870] [DEBUG] [paperless.consumer] Generating thumbnail for PublicWaterMassMailing.pdf...
[2022-08-31 14:33:06,880] [DEBUG] [paperless.parsing] Execute: /usr/local/bin/convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-0vc5ujv8/archive.pdf[0] /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-0vc5ujv8/convert.webp
OMP: Warning #96: Cannot form a team with 8 threads, using 1 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
[2022-08-31 14:33:11,886] [DEBUG] [paperless.consumer] Saving record to database
[2022-08-31 14:33:11,887] [DEBUG] [paperless.consumer] Creation date from parse_date: 2015-11-10 00:00:00-08:00
[2022-08-31 14:33:12,405] [DEBUG] [paperless.consumer] Deleting file /private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-8ra4updh
[2022-08-31 14:33:12,412] [DEBUG] [paperless.parsing.tesseract] Deleting directory /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-0vc5ujv8
[2022-08-31 14:33:12,415] [INFO] [paperless.consumer] Document 2015-11-10 PublicWaterMassMailing consumption finished
14:33:12 [Q] INFO Process-1:4 stopped doing work
14:33:12 [Q] INFO Processed [PublicWaterMassMailing.pdf]
14:33:16 [Q] INFO recycled worker Process-1:4
14:33:19 [Q] INFO Process-1:5 ready for work at 68501

or an even longer one:

14:34:11 [Q] INFO Process-1:5 processing [Ad Lib 2010.pdf]
[2022-08-31 14:34:12,720] [INFO] [paperless.consumer] Consuming Ad Lib 2010.pdf
[2022-08-31 14:34:12,723] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-08-31 14:34:12,726] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-08-31 14:34:12,733] [DEBUG] [paperless.consumer] Parsing Ad Lib 2010.pdf...
[2022-08-31 14:34:29,480] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-zzr9xtbq
[2022-08-31 14:34:29,691] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-zzr9xtbq'), 'output_file': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk/archive.pdf', 'use_threads': True, 'jobs': 8, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk/sidecar.txt'}
[2022-08-31 14:34:46,658] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-08-31 14:36:17,382] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-08-31 14:36:37,404] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk/archive.pdf
[2022-08-31 14:36:37,405] [DEBUG] [paperless.consumer] Generating thumbnail for Ad Lib 2010.pdf...
[2022-08-31 14:36:37,418] [DEBUG] [paperless.parsing] Execute: /usr/local/bin/convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk/archive.pdf[0] /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk/convert.webp
OMP: Warning #96: Cannot form a team with 8 threads, using 1 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
14:36:47 [Q] INFO Enqueued 1
14:36:47 [Q] DEBUG queueing from django_q:paperless:q
14:36:47 [Q] DEBUG Pushed ('jupiter-solar-solar-tango', '28835f9a9a654776b18804f33b6dd55a')
14:36:47 [Q] INFO Process-1 created a task from schedule [Train the classifier]
[2022-08-31 14:36:51,928] [DEBUG] [paperless.consumer] Saving record to database
[2022-08-31 14:36:51,929] [DEBUG] [paperless.consumer] Creation date from parse_date: 1945-08-01 00:00:00-07:00
[2022-08-31 14:36:53,938] [DEBUG] [paperless.consumer] Deleting file /private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-zzr9xtbq
[2022-08-31 14:36:53,956] [DEBUG] [paperless.parsing.tesseract] Deleting directory /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-p83vo8uk
[2022-08-31 14:36:53,971] [INFO] [paperless.consumer] Document 1945-08-01 Ad Lib 2010 consumption finished
14:36:54 [Q] INFO Process-1:5 stopped doing work
14:36:54 [Q] INFO Processed [Ad Lib 2010.pdf]

@stumpylog
Copy link
Member Author

Timeout of 300 or 30? It looks like the second document finished in about 3 minutes, so that would align with a 300s timeout not triggering.

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

Doh, stupid mistake sorry that setting is seconds, not ms. Ok progress! django_q.timeouts.JobTimeoutException: Task exceeded maximum timeout value (5 seconds)

But Im a bit confused because after that it still consumed:

[2022-08-31 14:50:48,161] [WARNING] [paperless.parsing.tesseract] Error while getting text from PDF document with pdfminer.six
Traceback (most recent call last):
  File "src/paperless_tesseract/parsers.py", line 118, in extract_text
    stripped = post_process_text(pdfminer_extract_text(pdf_file))
  File "lib/python3.8/site-packages/pdfminer/high_level.py", line 164, in extract_text
    interpreter.process_page(page)
  File "lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 1021, in execute
    (_, obj) = parser.nextobject()
  File "lib/python3.8/site-packages/pdfminer/psparser.py", line 607, in nextobject
    (pos, token) = self.nexttoken()
  File "lib/python3.8/site-packages/pdfminer/psparser.py", line 524, in nexttoken
    self.fillbuf()
  File "src/django-q/django_q/timeouts.py", line 55, in handle_death_penalty
    raise self._exception('Task exceeded maximum timeout value '
django_q.timeouts.JobTimeoutException: Task exceeded maximum timeout value (5 seconds)
[2022-08-31 14:50:48,408] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-14rix8z4'), 'output_file': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j/archive.pdf', 'use_threads': True, 'jobs': 8, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j/sidecar.txt'}
[2022-08-31 14:51:05,817] [WARNING] [ocrmypdf._pipeline] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
[2022-08-31 14:52:36,232] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2022-08-31 14:52:54,543] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j/archive.pdf
[2022-08-31 14:52:54,544] [DEBUG] [paperless.consumer] Generating thumbnail for Ad Lib 2010.pdf...
[2022-08-31 14:52:54,559] [DEBUG] [paperless.parsing] Execute: /usr/local/bin/convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j/archive.pdf[0] /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j/convert.webp
OMP: Warning #96: Cannot form a team with 8 threads, using 1 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
[2022-08-31 14:53:10,450] [DEBUG] [paperless.consumer] Saving record to database
[2022-08-31 14:53:10,451] [DEBUG] [paperless.consumer] Creation date from parse_date: 1945-08-01 00:00:00-07:00
[2022-08-31 14:53:12,383] [DEBUG] [paperless.consumer] Deleting file /private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-14rix8z4
[2022-08-31 14:53:12,408] [DEBUG] [paperless.parsing.tesseract] Deleting directory /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-x87p000j
[2022-08-31 14:53:12,422] [INFO] [paperless.consumer] Document 1945-08-01 Ad Lib 2010 consumption finished
14:53:12 [Q] INFO Process-1:1 stopped doing work
14:53:12 [Q] INFO Processed [Ad Lib 2010.pdf]
14:53:13 [Q] INFO recycled worker Process-1:1
14:53:16 [Q] INFO Process-1:4 ready for work at 69503

@stumpylog
Copy link
Member Author

Ah, very interesting. Looks like the timeout happened around here, then thanks to this TODO catching everything, it was just thought of as a pdfminer exception, and onward goes the consumption.

Hm, I'll have to think about that.

@shamoon
Copy link
Member

shamoon commented Aug 31, 2022

Ha, yes I thought you might find that interesting =)

When I set it to a more reasonable 30 seconds I get exactly the result you were looking for:

15:09:24 [Q] INFO Process-1:1 processing [Ad Lib 2010.pdf]
[2022-08-31 15:09:25,805] [INFO] [paperless.consumer] Consuming Ad Lib 2010.pdf
[2022-08-31 15:09:25,809] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2022-08-31 15:09:25,811] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2022-08-31 15:09:25,817] [DEBUG] [paperless.consumer] Parsing Ad Lib 2010.pdf...
[2022-08-31 15:09:42,911] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-0ni5giit
[2022-08-31 15:09:43,121] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/private/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-upload-0ni5giit'), 'output_file': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-djla2jtn/archive.pdf', 'use_threads': True, 'jobs': 8, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-djla2jtn/sidecar.txt'}
[2022-08-31 15:09:55,714] [DEBUG] [paperless.parsing.tesseract] Deleting directory /var/folders/50/p_n18fw96lj8y3jnd69gy2w40000gn/T/paperless/paperless-djla2jtn
[2022-08-31 15:09:55,725] [ERROR] [paperless.consumer] Error while consuming document Ad Lib 2010.pdf: JobTimeoutException: Task exceeded maximum timeout value (30 seconds)
Traceback (most recent call last):
  File "src/paperless_tesseract/parsers.py", line 277, in parse
    ocrmypdf.ocr(**args)
  File "lib/python3.8/site-packages/ocrmypdf/api.py", line 340, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "lib/python3.8/site-packages/ocrmypdf/_sync.py", line 392, in run_pipeline
    optimize_messages = exec_concurrent(context, executor)
  File "lib/python3.8/site-packages/ocrmypdf/_sync.py", line 308, in exec_concurrent
    pdf, messages = post_process(pdf, context, executor)
  File "lib/python3.8/site-packages/ocrmypdf/_sync.py", line 238, in post_process
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
  File "lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 734, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "lib/python3.8/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "lib/python3.8/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "lib/python3.8/site-packages/pluggy/_callers.py", line 60, in _multicall
    return outcome.get_result()
  File "lib/python3.8/site-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "lib/python3.8/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "lib/python3.8/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 90, in generate_pdfa
    ghostscript.generate_pdfa(
  File "lib/python3.8/site-packages/ocrmypdf/_exec/ghostscript.py", line 248, in generate_pdfa
    p = run_polling_stderr(
  File "lib/python3.8/site-packages/ocrmypdf/subprocess/__init__.py", line 102, in run_polling_stderr
    for msg in iter(proc.stderr.readline, ''):
  File "src/django-q/django_q/timeouts.py", line 55, in handle_death_penalty
    raise self._exception('Task exceeded maximum timeout value '
django_q.timeouts.JobTimeoutException: Task exceeded maximum timeout value (30 seconds)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/documents/consumer.py", line 320, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "src/paperless_tesseract/parsers.py", line 333, in parse
    raise ParseError(f"{e.__class__.__name__}: {str(e)}") from e
documents.parsers.ParseError: JobTimeoutException: Task exceeded maximum timeout value (30 seconds)

and the helpful little failed task:
Screen Shot 2022-08-31 at 3 29 30 PM

This is great! I think it will reduce a lot of confusion from end-users about what happened when consumption fails.

shamoon
shamoon previously approved these changes Aug 31, 2022
@stumpylog
Copy link
Member Author

I'll have to test this, but I believe it will work for even crazy low timeouts as well.

@shamoon
Copy link
Member

shamoon commented Sep 1, 2022

Awesome yea that worked for 5 seconds too!
Screen Shot 2022-08-31 at 5 24 59 PM

Copy link
Member

@shamoon shamoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bravo again

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants