Skip to content

Commit

Permalink
Chore: don't pass empty language code to tesseract CLI (#1996)
Browse files Browse the repository at this point in the history
Summary:

Close: #1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <[email protected]>
  • Loading branch information
yuming-long and qued authored Nov 7, 2023
1 parent 38ab35d commit ad14321
Show file tree
Hide file tree
Showing 49 changed files with 720 additions and 748 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.29-dev13
## 0.10.29

### Enhancements

Expand All @@ -21,6 +21,7 @@
* **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
* **Ingest download-only fix.** Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
* **Fix flaky chunk-metadata.** Prior implementation was sensitive to element order in the section resulting in metadata values sometimes being dropped. Also, not all metadata items can be consolidated across multiple elements (e.g. coordinates) and so are now dropped from consolidated metadata.
* **Fix tesseract error `Estimating resolution as X`** leaded by invalid language parameters input. Proceed with defalut language `eng` when `lang.py` fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.

## 0.10.28

Expand Down
50 changes: 25 additions & 25 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile --constraint=requirements/constraints.in requirements/build.in
# pip-compile --output-file=build.txt build.in
#
alabaster==0.7.13
# via sphinx
babel==2.13.0
babel==2.13.1
# via sphinx
beautifulsoup4==4.12.2
# via
# -c requirements/base.txt
# -c base.txt
# furo
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# -r requirements/build.in
# -c base.txt
# -c constraints.in
# -r build.in
# requests
charset-normalizer==3.3.0
charset-normalizer==3.3.2
# via
# -c requirements/base.txt
# -c base.txt
# requests
docutils==0.18.1
# via
Expand All @@ -29,10 +29,10 @@ docutils==0.18.1
# sphinx-rtd-theme
# sphinx-tabs
furo==2023.7.26
# via -r requirements/build.in
# via -r build.in
idna==3.4
# via
# -c requirements/base.txt
# -c base.txt
# requests
imagesize==1.4.1
# via sphinx
Expand All @@ -53,10 +53,10 @@ mdit-py-plugins==0.4.0
mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
# via -r build.in
packaging==23.2
# via
# -c requirements/base.txt
# -c base.txt
# sphinx
pygments==2.16.1
# via
Expand All @@ -69,17 +69,17 @@ pyyaml==6.0.1
# via myst-parser
requests==2.31.0
# via
# -c requirements/base.txt
# -c base.txt
# sphinx
snowballstemmer==2.2.0
# via sphinx
soupsieve==2.5
# via
# -c requirements/base.txt
# -c base.txt
# beautifulsoup4
sphinx==6.2.1
# via
# -r requirements/build.in
# -r build.in
# furo
# myst-parser
# sphinx-basic-ng
Expand All @@ -89,37 +89,37 @@ sphinx==6.2.1
sphinx-basic-ng==1.0.0b2
# via furo
sphinx-rtd-theme==1.2.2
# via -r requirements/build.in
sphinx-tabs==3.4.1
# via -r requirements/build.in
# via -r build.in
sphinx-tabs==3.4.4
# via -r build.in
sphinxcontrib-applehelp==1.0.4
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-devhelp==1.0.2
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-htmlhelp==2.0.1
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-jquery==4.1
# via sphinx-rtd-theme
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.3
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# -r build.in
# sphinx
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# -c base.txt
# -c constraints.in
# requests
zipp==3.17.0
# via importlib-metadata
44 changes: 22 additions & 22 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,73 +2,73 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile --constraint=requirements/constraints.in requirements/base.in
# pip-compile --output-file=base.txt base.in
#
backoff==2.2.1
# via -r requirements/base.in
# via -r base.in
beautifulsoup4==4.12.2
# via -r requirements/base.in
# via -r base.in
certifi==2023.7.22
# via
# -c requirements/constraints.in
# -c constraints.in
# requests
chardet==5.2.0
# via -r requirements/base.in
charset-normalizer==3.3.1
# via -r base.in
charset-normalizer==3.3.2
# via requests
click==8.1.7
# via nltk
dataclasses-json==0.6.1
# via -r requirements/base.in
# via -r base.in
emoji==2.8.0
# via -r requirements/base.in
# via -r base.in
filetype==1.2.0
# via -r requirements/base.in
# via -r base.in
idna==3.4
# via requests
joblib==1.3.2
# via nltk
langdetect==1.0.9
# via -r requirements/base.in
# via -r base.in
lxml==4.9.3
# via -r requirements/base.in
# via -r base.in
marshmallow==3.20.1
# via dataclasses-json
mypy-extensions==1.0.0
# via typing-inspect
nltk==3.8.1
# via -r requirements/base.in
# via -r base.in
numpy==1.24.4
# via
# -c requirements/constraints.in
# -r requirements/base.in
# -c constraints.in
# -r base.in
packaging==23.2
# via marshmallow
python-iso639==2023.6.15
# via -r requirements/base.in
# via -r base.in
python-magic==0.4.27
# via -r requirements/base.in
rapidfuzz==3.4.0
# via -r requirements/base.in
# via -r base.in
rapidfuzz==3.5.2
# via -r base.in
regex==2023.10.3
# via nltk
requests==2.31.0
# via -r requirements/base.in
# via -r base.in
six==1.16.0
# via langdetect
soupsieve==2.5
# via beautifulsoup4
tabulate==0.9.0
# via -r requirements/base.in
# via -r base.in
tqdm==4.66.1
# via nltk
typing-extensions==4.8.0
# via
# -r requirements/base.in
# -r base.in
# typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==1.26.18
# via
# -c requirements/constraints.in
# -c constraints.in
# requests
46 changes: 23 additions & 23 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile --constraint=requirements/constraints.in requirements/build.in
# pip-compile --output-file=build.txt build.in
#
alabaster==0.7.13
# via sphinx
babel==2.13.1
# via sphinx
beautifulsoup4==4.12.2
# via
# -c requirements/base.txt
# -c base.txt
# furo
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# -r requirements/build.in
# -c base.txt
# -c constraints.in
# -r build.in
# requests
charset-normalizer==3.3.1
charset-normalizer==3.3.2
# via
# -c requirements/base.txt
# -c base.txt
# requests
docutils==0.18.1
# via
Expand All @@ -29,10 +29,10 @@ docutils==0.18.1
# sphinx-rtd-theme
# sphinx-tabs
furo==2023.7.26
# via -r requirements/build.in
# via -r build.in
idna==3.4
# via
# -c requirements/base.txt
# -c base.txt
# requests
imagesize==1.4.1
# via sphinx
Expand All @@ -53,10 +53,10 @@ mdit-py-plugins==0.4.0
mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
# via -r build.in
packaging==23.2
# via
# -c requirements/base.txt
# -c base.txt
# sphinx
pygments==2.16.1
# via
Expand All @@ -69,17 +69,17 @@ pyyaml==6.0.1
# via myst-parser
requests==2.31.0
# via
# -c requirements/base.txt
# -c base.txt
# sphinx
snowballstemmer==2.2.0
# via sphinx
soupsieve==2.5
# via
# -c requirements/base.txt
# -c base.txt
# beautifulsoup4
sphinx==6.2.1
# via
# -r requirements/build.in
# -r build.in
# furo
# myst-parser
# sphinx-basic-ng
Expand All @@ -89,37 +89,37 @@ sphinx==6.2.1
sphinx-basic-ng==1.0.0b2
# via furo
sphinx-rtd-theme==1.2.2
# via -r requirements/build.in
# via -r build.in
sphinx-tabs==3.4.4
# via -r requirements/build.in
# via -r build.in
sphinxcontrib-applehelp==1.0.4
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-devhelp==1.0.2
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-htmlhelp==2.0.1
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-jquery==4.1
# via sphinx-rtd-theme
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.3
# via
# -r requirements/build.in
# -r build.in
# sphinx
sphinxcontrib-serializinghtml==1.1.5
# via
# -r requirements/build.in
# -r build.in
# sphinx
urllib3==1.26.18
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# -c base.txt
# -c constraints.in
# requests
zipp==3.17.0
# via importlib-metadata
Loading

0 comments on commit ad14321

Please sign in to comment.