Refactor: Remove OCR related code for entire page OCR #231

yuming-long · 2023-09-29T17:01:53Z

Summary

First part of OCR refactor to move it from inference repo to unstructured repo. This PR removes all OCR related code for entire page OCR, which means all table related OCR still remain the same (will be moved after table refactor to accept preprocessed OCR data)

Test

Please see test description in Unstructured-IO/unstructured#1579, since those two need to work together.

Note

The ingest test won't pass until we merge the unstructured refactor PR

unstructured_inference/inference/layout.py

Dockerfile

unstructured_inference/inference/elements.py

CHANGELOG.md

Co-authored-by: cragwolfe <[email protected]>

unstructured_inference/inference/elements.py

christinestraub

LGTM!

…#1579) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in Unstructured-IO/unstructured-inference#231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: yuming-long <[email protected]> Co-authored-by: christinestraub <[email protected]> Co-authored-by: christinestraub <[email protected]>

yuming-long added 13 commits September 27, 2023 16:47

move func merge_inferred_layout_with_ocr_layout

34af41f

aggregate_ocr_text_by_block

fc98bd6

supplement_layout_with_ocr_elements

57fb359

get_elements_from_ocr_regions

08abf44

merge_text_regions

0f38f5c

remove ocr_layout

b8d08f7

remove text region ocr

d7025df

more clean up

2f56748

add back

97c6e02

remove tests clean up

365f3f8

disable tesseract

b64e546

move test fixture

bc89ffa

remove paddle install in docker

a5fc90d

yuming-long force-pushed the yuming/remove_ocr_code branch from 041c465 to 12c30c3 Compare September 29, 2023 17:08

empty

abd95d1

yuming-long force-pushed the yuming/remove_ocr_code branch from 12c30c3 to abd95d1 Compare September 29, 2023 17:09

yuming-long mentioned this pull request Sep 29, 2023

Refactor: support entire page OCR with ocr_mode and ocr_languages Unstructured-IO/unstructured#1579

Merged

yuming-long added 5 commits September 29, 2023 15:52

Merge branch 'main' into yuming/remove_ocr_code

bfaf2bb

ocr param in new test

5726d2f

changlog version

1a968c4

remove ocr constant

28e1bc7

remove all comment ocr

3f43f06

yuming-long changed the title ~~Yuming/remove ocr code~~ Chore: Remove OCR related code for entire page OCR Oct 3, 2023

yuming-long marked this pull request as ready for review October 3, 2023 23:47

yuming-long requested review from christinestraub, benjats07 and qued October 3, 2023 23:50

benjats07 suggested changes Oct 3, 2023

View reviewed changes

unstructured_inference/inference/layout.py Show resolved Hide resolved

add deduplicate_detected_elements back

4768a8e

benjats07 approved these changes Oct 4, 2023

View reviewed changes

christinestraub requested changes Oct 4, 2023

View reviewed changes

refactor: remove ocr layout visualization

5f0bbff

christinestraub changed the title ~~Chore: Remove OCR related code for entire page OCR~~ Refactor: Remove OCR related code for entire page OCR Oct 4, 2023

cragwolfe reviewed Oct 4, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

yuming-long and others added 3 commits October 4, 2023 16:16

remove teseract module since table won't use it

2b7a2fc

remove using image in extract_text

d319c9d

Update CHANGELOG.md

e6bb6a3

Co-authored-by: cragwolfe <[email protected]>

yuming-long requested a review from christinestraub October 4, 2023 20:40

christinestraub requested changes Oct 4, 2023

View reviewed changes

unstructured_inference/inference/elements.py Outdated Show resolved Hide resolved

yuming-long added 3 commits October 4, 2023 17:07

version bump

4de3ff3

fix: remove value error in extract text

2f3c0db

remove test since won't raise errir

6a4677e

yuming-long requested a review from christinestraub October 4, 2023 22:48

christinestraub approved these changes Oct 5, 2023

View reviewed changes

cragwolfe approved these changes Oct 5, 2023

View reviewed changes

cragwolfe merged commit ffb1f0b into main Oct 5, 2023
5 of 8 checks passed

cragwolfe deleted the yuming/remove_ocr_code branch October 5, 2023 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Remove OCR related code for entire page OCR #231

Refactor: Remove OCR related code for entire page OCR #231

yuming-long commented Sep 29, 2023 •

edited

Loading

christinestraub left a comment

Refactor: Remove OCR related code for entire page OCR #231

Refactor: Remove OCR related code for entire page OCR #231

Conversation

yuming-long commented Sep 29, 2023 • edited Loading

Summary

Test

Note

christinestraub left a comment

Choose a reason for hiding this comment

yuming-long commented Sep 29, 2023 •

edited

Loading