PDF parsing error handling #14

ricomnl · 2023-04-26T20:41:56Z

Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).

(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf 
test.pdf:   0%|  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rico/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
    documents[fn] = process(
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

ricomnl · 2023-04-26T20:42:28Z

Nevermind, I jsut realized it caches everything. Still nice to have the error handling though

freedmand · 2023-04-27T04:44:58Z

I should probably make the cache handling more clear in the docs so folks are reassured.

Great point re: error handling. Logging an error message and continuing is the way to go here. Also, if there's a PDF that's not parsing correctly that should be (and you're comfortable sharing), let me know!

ricomnl · 2023-04-28T16:08:18Z

it was a fault on my end, the pdf was empty for some reason

ricomnl · 2023-04-28T16:09:18Z

I also realized the search is quite slow for 1000s of PDFs. Is this because I'm using a relatively big model or just because they're in PDF format? Would it be faster if it was raw text or if I use a smaller model?

sam33r · 2024-06-24T21:52:41Z

Just here to +1, would be great to skip PDFs that have errors. It also currently breaks if it encounters any password protected PDFs (see below). Thank you for this very useful tool!

Traceback (most recent call last):
  File "/Users/sameer/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 619, in main
    documents[fn] = process(
                    ^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 158, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent, encoding)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 50, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).
-> Cannot close object, library is destroyed. This may cause a memory leak!

freedmand added the enhancement New feature or request label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF parsing error handling #14

PDF parsing error handling #14

ricomnl commented Apr 26, 2023

ricomnl commented Apr 26, 2023

freedmand commented Apr 27, 2023

ricomnl commented Apr 28, 2023

ricomnl commented Apr 28, 2023

sam33r commented Jun 24, 2024

PDF parsing error handling #14

PDF parsing error handling #14

Comments

ricomnl commented Apr 26, 2023

ricomnl commented Apr 26, 2023

freedmand commented Apr 27, 2023

ricomnl commented Apr 28, 2023

ricomnl commented Apr 28, 2023

sam33r commented Jun 24, 2024