Skip to content

Commit

Permalink
chore: stop passing language code from tesseract mapping to paddle (#226
Browse files Browse the repository at this point in the history
)

### Summary

A user is flagging the assertion error for paddle language code:
```
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
```
and tried setting the `ocr_languages` param to 'en' (the correct lang
code for english in paddle) but also didn't work.
The reason is that the `ocr_languages` uses the mapping for tesseract
code which will convert `en` to `eng` since thats the correct lang code
for english in tesseract.

The quick workaround here is stop passing the lang code to paddle and
let it use default `en`, and this will be addressed once we have the
lang code mapping for paddle.

### Test
looks like user used this branch and got the lang parameter working from
[linked
comments](Unstructured-IO/unstructured-api#247 (comment))
:)
on api repo:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export ENTIRE_PAGE_OCR=paddle
make run-web-app
```
* check error before this change:
```
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   -F 'ocr_languages=en'  | jq -C . | less -R
```
will see the error:
```
{
  "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng"
}
```
also in logger you will see `INFO Loading paddle with CPU on
language=eng...` since tesseract mapping converts `en` to `eng`.
* check after this change:

Checkout to this branch and install inference repo into your env (the
same env thats running api) with `pip install -e .`

Rerun `make run-web-app`

Run the curl command again, you won't get the result on m1 chip since
paddle doesn't work on it but from the logger info you can see
`2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with
CPU on language=en...`, which means the lang parameter is using default
`en` (logger info is coming from [this
line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)).

---------

Co-authored-by: shreyanid <[email protected]>
  • Loading branch information
yuming-long and shreyanid authored Sep 27, 2023
1 parent 12ca9d9 commit cf15726
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 5 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 0.6.6

* Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
we have the mapping from standard language code to paddle language code.
## 0.6.5

* Add functionality to keep extracted image elements while merging inferred layout with extracted layout
Expand Down
2 changes: 1 addition & 1 deletion unstructured_inference/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.5" # pragma: no cover
__version__ = "0.6.6" # pragma: no cover
7 changes: 3 additions & 4 deletions unstructured_inference/inference/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,12 +275,11 @@ def get_elements_with_detection_model(
)

if entrie_page_ocr == "paddle":
logger.info("Processing entrie page OCR with paddle...")
logger.info("Processing entire page OCR with paddle...")
from unstructured_inference.models import paddle_ocr

# TODO(yuming): paddle only support one language at once,
# change ocr to tesseract if passed in multilanguages.
ocr_data = paddle_ocr.load_agent(language=self.ocr_languages).ocr(
# TODO(yuming): pass ocr language to paddle when we have language mapping for paddle
ocr_data = paddle_ocr.load_agent().ocr(
np.array(self.image),
cls=True,
)
Expand Down

0 comments on commit cf15726

Please sign in to comment.