Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` #1579

yuming-long · 2023-09-29T16:57:27Z

Summary

Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in Unstructured-IO/unstructured-inference#231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks".

The updated workflow for Hi_res partition:

pass the document as data/filename to inference repo to get inferred_layout (DocumentLayout)
pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file)
if ocr mode is "entire_page"
- OCR the entire image
- merge the OCR layout with inferred page layout
if ocr mode is "individual_blocks"
- from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element
- replace empty text element with the text obtained from OCR the cropped image
return all merged PageLayouts and form a DocumentLayout subject for later on process

This PR also bump unstructured-inference==0.7.2 since the branch relay on OCR refactor from unstructured-inference.

Test

from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])

latest output:

# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC [email protected] so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']

This reverts commit bd6107b.

This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: yuming-long <[email protected]>

… <- Ingest test fixtures update (#1658) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: yuming-long <[email protected]>

# Conflicts: # CHANGELOG.md # test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json # test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json # test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/2023-Jan-economic-outlook.pdf.json # test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/Silent-Giant-(1).pdf.json # test_unstructured_ingest/expected-structured-output/s3/small-pdf-set/recalibrating-risk-report.pdf.json

… <- Ingest test fixtures update (#1661) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

# Conflicts: # CHANGELOG.md # requirements/constraints.in # requirements/extra-pdf-image.txt # requirements/huggingface.txt # requirements/ingest-openai.txt # requirements/ingest-salesforce.txt # requirements/test.txt

… <- Ingest test fixtures update (#1677) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

stale request

yuming-long and others added 30 commits September 22, 2023 11:06

stage

4468863

stage

df85466

need tp update test

0abd264

stage

1385b33

Merge branch 'main' into yuming/refactor_ocr

8e924b6

stage

3f0c0db

Merge branch 'main' into yuming/refactor_ocr

9f66d68

change to import

327aa5b

stage

35376ab

revert code back to 5.31 inference

468e1e5

update mock test

97962c1

some todo note

bd6107b

Revert "some todo note"

58c38ac

This reverts commit bd6107b.

fix test

593f23e

TODO...

9874b63

fix all tests

8d8a0d9

cance; out the wrong guy

1d0a81b

add paddle ocr func

38c8db3

feel like missing some texts...

fdbe8a9

update todo

cac87a6

Merge branch 'main' into yuming/refactor_ocr

aaee4cd

test ingest

db23355

null <- Ingest test fixtures update (#1571)

bf7d427

This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: yuming-long <[email protected]>

tidy and add paddle entire page

21d598a

test file and more doc string

2978d91

todo note

04f4a81

note todo

54bfde2

move test to unst

c58621a

let ci depends on inference branch

0052d92

Merge branch 'main' into yuming/refactor_ocr

58a2ab4

yuming-long and others added 20 commits October 5, 2023 14:35

revert force pip install -e .

ae97449

pip unstructured-inference==0.7.0 and dep conlicts

73f3453

Merge branch 'main' into yuming/refactor_ocr

b6881e8

version bump

73ef72f

add test coverage

88fbf5c

Merge branch 'main' into yuming/refactor_ocr

a93644d

add coverage: skip converage check on paddle init

92dc988

Merge branch 'main' into yuming/refactor_ocr

a63b07e

Refactor: support entire page OCR with ocr_mode and ocr_languages…

ea323e5

… <- Ingest test fixtures update (#1658) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: yuming-long <[email protected]>

Merge branch 'main' into yuming/refactor_ocr

4e349ae

fix: element with text=None in final_layout

25b7ea5

Refactor: support entire page OCR with ocr_mode and ocr_languages…

a311259

… <- Ingest test fixtures update (#1661) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

chore: update ingest test fixtures

856d3ff

chore: revert ingest test fixtures

3bd6256

chore: bump unstructured-inference==0.7.2 & make pip-compile

cc36149

Merge branch 'main' into yuming/refactor_ocr

b29f8bc

# Conflicts: # CHANGELOG.md # requirements/constraints.in # requirements/extra-pdf-image.txt # requirements/huggingface.txt # requirements/ingest-openai.txt # requirements/ingest-salesforce.txt # requirements/test.txt

chore: update version

e5b6925

Refactor: support entire page OCR with ocr_mode and ocr_languages…

a148486

… <- Ingest test fixtures update (#1677) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <[email protected]>

chore: update dependencies

3957fa6

cragwolfe approved these changes Oct 6, 2023

View reviewed changes

christinestraub approved these changes Oct 6, 2023

View reviewed changes

cragwolfe added this pull request to the merge queue Oct 6, 2023

yuming-long mentioned this pull request Oct 6, 2023

Benjamin/bump unstructured inference 0.7.1 #1675

Closed

Merged via the queue into main with commit dcd6d0f Oct 6, 2023
39 checks passed

cragwolfe deleted the yuming/refactor_ocr branch October 6, 2023 23:32

yuming-long mentioned this pull request Oct 20, 2023

Missing text "Signature" in image output after entire page OCR refactor #1813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` #1579

Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` #1579

yuming-long commented Sep 29, 2023 •

edited

Loading

Refactor: support entire page OCR with ocr_mode and ocr_languages #1579

Refactor: support entire page OCR with ocr_mode and ocr_languages #1579

Conversation

yuming-long commented Sep 29, 2023 • edited Loading

Summary

Test

Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` #1579

Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` #1579

yuming-long commented Sep 29, 2023 •

edited

Loading