Skip to content

Commit

Permalink
Fix OCR on Qubes: PyMuPDF required TESSDATA_PREFIX
Browse files Browse the repository at this point in the history
PyMuPDF versions lower than 1.22.5 pass the tesseract data path as
an argument to `pixmap.pdfocr_tobytes()` [1], but lower versions require
setting instead the TESSDATA_PREFIX environment variable [2].

Because on Qubes the pixels to pdf conversion happens on the host and
Qubes has a lower PyMuPDF package version, we need to pass instead via
environment variable.

NOTE: the TESSDATA_PREFIX env. variable was set in dangerzone-cli
instead of closer to the calling method in `doc_to_pixels.py` since
PyMuPDF reads this variable as soon as the fitz module is imported
[3][4].

[1]: https://pymupdf.readthedocs.io/en/latest/pixmap.html#Pixmap.pdfocr_tobytes
[2]: https://pymupdf.readthedocs.io/en/latest/installation.html#enabling-integrated-ocr-support
[3]: pymupdf/PyMuPDF#2439
[4]: https://github.com/pymupdf/PyMuPDF/blob/5d6a7db/src/__init__.py#L159

Fixes #682
  • Loading branch information
deeplow committed Feb 7, 2024
1 parent d1afe4c commit 6006bee
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 6 deletions.
18 changes: 13 additions & 5 deletions dangerzone/conversion/pixels_to_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,19 @@ async def convert(
self.update_progress(
f"Converting page {page_num}/{num_pages} from pixels to searchable PDF"
)
page_pdf_bytes = pixmap.pdfocr_tobytes(
compress=True,
language=ocr_lang,
tessdata=get_tessdata_dir(),
)
if int(fitz.version[2]) >= 20230621000001:
page_pdf_bytes = pixmap.pdfocr_tobytes(
compress=True,
language=ocr_lang,
tessdata=get_tessdata_dir(),
)
else:
# XXX method signature changed in v1.22.5 to add tessdata arg
# TODO remove after oldest distro has PyMuPDF >= v1.22.5
page_pdf_bytes = pixmap.pdfocr_tobytes(
compress=True,
language=ocr_lang,
)
ocr_pdf = fitz.open("pdf", page_pdf_bytes)
else: # Don't OCR
self.update_progress(
Expand Down
6 changes: 5 additions & 1 deletion dev_scripts/dangerzone
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Load dangerzone module and resources from the source code tree
import os
import sys

# XXX workaround lack of tessdata path arg for PyMuPDF < v1.22.5
# for context see https://github.com/freedomofpress/dangerzone/issues/682
os.environ["TESSDATA_PREFIX"] = os.environ.get("TESSDATA_PREFIX", "/usr/share/tesseract/tessdata")

# Load dangerzone module and resources from the source code tree
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.dangerzone_dev = True

Expand Down

0 comments on commit 6006bee

Please sign in to comment.