Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TesseractError: Estimating resolution as x #1920

Closed
sentry-io bot opened this issue Oct 27, 2023 · 0 comments · Fixed by #1996
Closed

TesseractError: Estimating resolution as x #1920

sentry-io bot opened this issue Oct 27, 2023 · 0 comments · Fixed by #1996

Comments

@sentry-io
Copy link

sentry-io bot commented Oct 27, 2023

API users are seeing the following error. We've caught this exception in the past, so likely just need a try catch in this new location.

Sentry Issue: UNSTRUCTURED-API-3Z

TesseractError: Estimating resolution as 178
  File "unstructured/partition/pdf.py", line 157, in partition_pdf
    return partition_pdf_or_image(
  File "unstructured/partition/pdf.py", line 287, in partition_pdf_or_image
    _layout_elements = _partition_pdf_or_image_local(
  File "unstructured/utils.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "unstructured/partition/pdf.py", line 412, in _partition_pdf_or_image_local
    final_layout = process_data_with_ocr(
  File "unstructured/partition/ocr.py", line 67, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "unstructured/partition/ocr.py", line 147, in process_file_with_ocr
    raise e
  File "unstructured/partition/ocr.py", line 137, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "unstructured/partition/ocr.py", line 176, in supplement_page_layout_with_ocr
    ocr_layout = get_ocr_layout_from_image(
  File "unstructured/partition/ocr.py", line 246, in get_ocr_layout_from_image
    ocr_data = unstructured_pytesseract.image_to_data(
  File "unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "unstructured_pytesseract/pytesseract.py", line 597, in <lambda>
    Output.DICT: lambda: file_to_dict(run_and_get_output(*args), '\t', -1),
  File "unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
badGarnet pushed a commit that referenced this issue Nov 7, 2023
Summary:

Close: #1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0 participants