-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle errors from Tesseract #165
Conversation
Certain regions of a document are failing ocr with this error: `pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1250')` When I try the same region on the CLI, I get: ``` $ tesseract bad_tile.jpeg output Estimating resolution as 1813 Floating point exception ``` Whatever the root cause, let's catch this error and return an empty string.
Looks like those ingest tests have been failing on main for a while, so I'm assuming we can ignore here. |
Yeah, that just indicates the current version of unstructured-inference would produce different outputs if the ingests tests are run. Hopefully, they are good different outputs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me! Tested in docker to reproduce the issue, and this change results in a 200 status for otherwise failing documents.
* build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <[email protected]>
Related to downstream issue: #182 And upstream PR: Unstructured-IO/unstructured-inference#165 * remove test_parallel_mode_correct_result * dropped the file_directory field from elements metadata
|
||
try: | ||
return agent.detect(cropped_image) | ||
except tesseract.TesseractError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a logger.warn("<something", exc_info=True)
?
Certain regions of a document are failing ocr with this error:
pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 1250')
(for some value)When I try the same region on the CLI, I get:
Whatever the root cause, let's catch this error and return an empty string.