You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TesseractError: Estimating resolution as 178
File "unstructured/partition/pdf.py", line 157, in partition_pdf
return partition_pdf_or_image(
File "unstructured/partition/pdf.py", line 287, in partition_pdf_or_image
_layout_elements = _partition_pdf_or_image_local(
File "unstructured/utils.py", line 178, in wrapper
return func(*args, **kwargs)
File "unstructured/partition/pdf.py", line 412, in _partition_pdf_or_image_local
final_layout = process_data_with_ocr(
File "unstructured/partition/ocr.py", line 67, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
File "unstructured/partition/ocr.py", line 147, in process_file_with_ocr
raise e
File "unstructured/partition/ocr.py", line 137, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
File "unstructured/partition/ocr.py", line 176, in supplement_page_layout_with_ocr
ocr_layout = get_ocr_layout_from_image(
File "unstructured/partition/ocr.py", line 246, in get_ocr_layout_from_image
ocr_data = unstructured_pytesseract.image_to_data(
File "unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
return {
File "unstructured_pytesseract/pytesseract.py", line 597, in <lambda>
Output.DICT: lambda: file_to_dict(run_and_get_output(*args), '\t', -1),
File "unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
run_tesseract(**kwargs)
File "unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
The text was updated successfully, but these errors were encountered:
Summary:
Close: #1920
* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
```
curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
```
* after this change:
* in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
* check out to this branch
* run `make run-web-app` again in api repo
* the curl command return output and see warning in log
---------
Co-authored-by: qued <[email protected]>
API users are seeing the following error. We've caught this exception in the past, so likely just need a try catch in this new location.
Sentry Issue: UNSTRUCTURED-API-3Z
The text was updated successfully, but these errors were encountered: