Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF parsing error handling #14

Open
ricomnl opened this issue Apr 26, 2023 · 5 comments
Open

PDF parsing error handling #14

ricomnl opened this issue Apr 26, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@ricomnl
Copy link

ricomnl commented Apr 26, 2023

Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).

(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf 
test.pdf:   0%|  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rico/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
    documents[fn] = process(
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).
@ricomnl
Copy link
Author

ricomnl commented Apr 26, 2023

Nevermind, I jsut realized it caches everything. Still nice to have the error handling though

@freedmand freedmand added the enhancement New feature or request label Apr 27, 2023
@freedmand
Copy link
Owner

I should probably make the cache handling more clear in the docs so folks are reassured.

Great point re: error handling. Logging an error message and continuing is the way to go here. Also, if there's a PDF that's not parsing correctly that should be (and you're comfortable sharing), let me know!

@ricomnl
Copy link
Author

ricomnl commented Apr 28, 2023

it was a fault on my end, the pdf was empty for some reason

@ricomnl
Copy link
Author

ricomnl commented Apr 28, 2023

I also realized the search is quite slow for 1000s of PDFs. Is this because I'm using a relatively big model or just because they're in PDF format? Would it be faster if it was raw text or if I use a smaller model?

@sam33r
Copy link

sam33r commented Jun 24, 2024

Just here to +1, would be great to skip PDFs that have errors. It also currently breaks if it encounters any password protected PDFs (see below). Thank you for this very useful tool!

Traceback (most recent call last):
  File "/Users/sameer/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 619, in main
    documents[fn] = process(
                    ^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 158, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent, encoding)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/semantra.py", line 50, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sameer/.local/pipx/venvs/semantra/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).
-> Cannot close object, library is destroyed. This may cause a memory leak!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants