Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: support entire page OCR with ocr_mode and ocr_languages #1579

Merged
merged 105 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
4468863
stage
yuming-long Sep 22, 2023
df85466
stage
yuming-long Sep 25, 2023
0abd264
need tp update test
yuming-long Sep 26, 2023
1385b33
stage
yuming-long Sep 26, 2023
8e924b6
Merge branch 'main' into yuming/refactor_ocr
yuming-long Sep 26, 2023
3f0c0db
stage
yuming-long Sep 27, 2023
9f66d68
Merge branch 'main' into yuming/refactor_ocr
yuming-long Sep 27, 2023
327aa5b
change to import
yuming-long Sep 27, 2023
35376ab
stage
yuming-long Sep 27, 2023
468e1e5
revert code back to 5.31 inference
yuming-long Sep 27, 2023
97962c1
update mock test
yuming-long Sep 27, 2023
bd6107b
some todo note
yuming-long Sep 27, 2023
58c38ac
Revert "some todo note"
yuming-long Sep 27, 2023
593f23e
fix test
yuming-long Sep 27, 2023
9874b63
TODO...
yuming-long Sep 27, 2023
8d8a0d9
fix all tests
yuming-long Sep 27, 2023
1d0a81b
cance; out the wrong guy
yuming-long Sep 28, 2023
38c8db3
add paddle ocr func
yuming-long Sep 28, 2023
fdbe8a9
feel like missing some texts...
yuming-long Sep 28, 2023
cac87a6
update todo
yuming-long Sep 28, 2023
aaee4cd
Merge branch 'main' into yuming/refactor_ocr
yuming-long Sep 28, 2023
db23355
test ingest
yuming-long Sep 28, 2023
bf7d427
null <- Ingest test fixtures update (#1571)
ryannikolaidis Sep 29, 2023
21d598a
tidy and add paddle entire page
yuming-long Sep 29, 2023
2978d91
test file and more doc string
yuming-long Sep 29, 2023
04f4a81
todo note
yuming-long Sep 29, 2023
54bfde2
note todo
yuming-long Sep 29, 2023
c58621a
move test to unst
yuming-long Sep 29, 2023
0052d92
let ci depends on inference branch
yuming-long Sep 29, 2023
58a2ab4
Merge branch 'main' into yuming/refactor_ocr
yuming-long Sep 29, 2023
f9ec23e
changelog versoin
yuming-long Sep 29, 2023
afaa5f3
lint check
yuming-long Sep 29, 2023
19a9b70
no source
yuming-long Sep 29, 2023
ee6859a
Yuming/refactor ocr <- Ingest test fixtures update (#1582)
ryannikolaidis Sep 29, 2023
56374de
update test ficture ci
yuming-long Sep 29, 2023
652c3f4
update copyied code
yuming-long Sep 29, 2023
ff628ce
Merge branch 'main' into yuming/refactor_ocr
yuming-long Sep 29, 2023
6ea82c2
update ci
yuming-long Oct 2, 2023
8cab7b2
aviod conflict
yuming-long Oct 2, 2023
4663361
Revert "aviod conflict"
yuming-long Oct 2, 2023
f3c0df8
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 2, 2023
539f4c5
depilicate name
yuming-long Oct 2, 2023
fb1eaf1
new line?
yuming-long Oct 2, 2023
593b9c5
Yuming/refactor ocr <- Ingest test fixtures update (#1617)
ryannikolaidis Oct 2, 2023
cf7901a
add individual blockers to ocr mode
yuming-long Oct 2, 2023
f6684f2
moe mote
yuming-long Oct 3, 2023
e92b714
fix bug for tests
yuming-long Oct 3, 2023
abb8f67
nit on mock ocr func name
yuming-long Oct 3, 2023
7915128
should fix all TODO with no ticket number
yuming-long Oct 3, 2023
22ad3b6
add dostring
yuming-long Oct 3, 2023
0539dd1
assume to use image from pade.image
yuming-long Oct 3, 2023
cea28da
bug fix
yuming-long Oct 3, 2023
e3a6577
Revert "assume to use image from pade.image"
yuming-long Oct 3, 2023
0426811
add ocr text
yuming-long Oct 3, 2023
7204ec6
from file test
yuming-long Oct 3, 2023
cd9473e
more test coverage
yuming-long Oct 3, 2023
22c1f6d
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 3, 2023
43093e7
rewite try except
yuming-long Oct 3, 2023
398c96a
revert some fixed changes; only import paddle in func
yuming-long Oct 3, 2023
7691a11
add pip install -e . right before ingest update
yuming-long Oct 4, 2023
0c5b0a4
updaye for ci test
yuming-long Oct 4, 2023
b9ea113
revert all ci yaml changes
yuming-long Oct 4, 2023
0725bea
Chore: support entire page OCR with `ocr_mode` and `ocr_languages` <-…
ryannikolaidis Oct 4, 2023
ef44c8c
install branch right before test
yuming-long Oct 4, 2023
404fb71
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 4, 2023
d3b5a8f
Chore: support entire page OCR with `ocr_mode` and `ocr_languages` <-…
ryannikolaidis Oct 4, 2023
cd82e31
refactor: add `OCRMode` enum
christinestraub Oct 4, 2023
5cdf327
move tesseract env; move constant
yuming-long Oct 4, 2023
db2e48b
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 4, 2023
f61ee9a
add padding logic to individual blocks
yuming-long Oct 4, 2023
904d85e
Merge branch 'main' into yuming/refactor_ocr
christinestraub Oct 5, 2023
21e93c1
refactor: keep original element when adding padding
christinestraub Oct 5, 2023
463d85f
test: add test cases for `pad_element_bboxes()`
christinestraub Oct 5, 2023
68e41f0
refactor: remove unused index
christinestraub Oct 5, 2023
819047a
refactor: fix spelling mistakes
christinestraub Oct 5, 2023
3293f9f
Merge branch 'main' into yuming/refactor_ocr
christinestraub Oct 5, 2023
9c8ea7e
fix test: add index to title since xy cut
yuming-long Oct 5, 2023
6c12c24
fix test: update title output since ocr change it
yuming-long Oct 5, 2023
d421949
lint
yuming-long Oct 5, 2023
6ac3505
feat: update logic to merge "out layout" (returned by `unstructured-i…
christinestraub Oct 5, 2023
223038e
fix test and doc nit inferred_layout -> out_layout
yuming-long Oct 5, 2023
aa17d8e
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 5, 2023
dfeba46
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 5, 2023
2260b99
refactor: keep passing parameters used to extract images from PDF's t…
christinestraub Oct 5, 2023
428ba60
update ocr output in test
yuming-long Oct 5, 2023
ae97449
revert force pip install -e .
yuming-long Oct 5, 2023
73f3453
pip unstructured-inference==0.7.0 and dep conlicts
yuming-long Oct 5, 2023
b6881e8
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 5, 2023
73ef72f
version bump
yuming-long Oct 5, 2023
88fbf5c
add test coverage
yuming-long Oct 5, 2023
a93644d
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 5, 2023
92dc988
add coverage: skip converage check on paddle init
yuming-long Oct 5, 2023
a63b07e
Merge branch 'main' into yuming/refactor_ocr
yuming-long Oct 5, 2023
ea323e5
Refactor: support entire page OCR with `ocr_mode` and `ocr_languages`…
ryannikolaidis Oct 5, 2023
4e349ae
Merge branch 'main' into yuming/refactor_ocr
christinestraub Oct 5, 2023
25b7ea5
fix: element with `text=None` in final_layout
christinestraub Oct 6, 2023
d19a55f
Merge branch 'main' into yuming/refactor_ocr
christinestraub Oct 6, 2023
a311259
Refactor: support entire page OCR with `ocr_mode` and `ocr_languages`…
ryannikolaidis Oct 6, 2023
856d3ff
chore: update ingest test fixtures
christinestraub Oct 6, 2023
3bd6256
chore: revert ingest test fixtures
christinestraub Oct 6, 2023
cc36149
chore: bump unstructured-inference==0.7.2 & make pip-compile
christinestraub Oct 6, 2023
b29f8bc
Merge branch 'main' into yuming/refactor_ocr
christinestraub Oct 6, 2023
e5b6925
chore: update version
christinestraub Oct 6, 2023
a148486
Refactor: support entire page OCR with `ocr_mode` and `ocr_languages`…
ryannikolaidis Oct 6, 2023
3957fa6
chore: update dependencies
christinestraub Oct 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
[run]
omit =
unstructured/ingest/*
# TODO(yuming): please remove this line after adding tests for paddle (CORE-1886)
unstructured/partition/utils/ocr_models/paddle_ocr.py
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.10.20-dev5
## 0.10.20-dev6

### Enhancements

* **Refactor OCR code** The OCR code for entire page is moved from unstructured-inference to unstructured. On top of continuing support for OCR language parameter, we also support two OCR processing modes, "entire_page" or "individual_blocks".
* **Align to top left when shrinking bounding boxes for `xy-cut` sorting:** Update `shrink_bbox()` to keep top left rather than center.
* **Add visualization script to annotate elements** This script is often used to analyze/visualize elements with coordinates (e.g. partition_pdf()).
* **Adds data source properties to the Jira connector** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
Expand Down
1 change: 1 addition & 0 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,4 @@ anyio<4.0
opencv-python==4.8.0.76
opencv-contrib-python==4.8.0.76
onnxruntime==1.15.1
platformdirs==3.10.0
20 changes: 11 additions & 9 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ anyio==3.7.1
# via
# -c requirements/constraints.in
# jupyter-server
appdirs==1.4.4
# via
# -c requirements/test.txt
# virtualenv
appnope==0.1.3
# via
# ipykernel
Expand All @@ -34,7 +38,7 @@ beautifulsoup4==4.12.2
# via
# -c requirements/base.txt
# nbconvert
bleach==6.0.0
bleach==6.1.0
# via nbconvert
build==1.0.3
# via pip-tools
Expand Down Expand Up @@ -153,7 +157,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.2
jupyter-core==4.12.0
# via
# -c requirements/constraints.in
# ipykernel
Expand Down Expand Up @@ -245,12 +249,7 @@ pip-tools==7.3.0
# via -r requirements/dev.in
pkgutil-resolve-name==1.3.10
# via jsonschema
platformdirs==3.11.0
# via
# -c requirements/test.txt
# jupyter-core
# virtualenv
pre-commit==3.4.0
pre-commit==2.20.0
# via -r requirements/dev.in
prometheus-client==0.17.1
# via jupyter-server
Expand Down Expand Up @@ -333,6 +332,7 @@ six==1.16.0
# bleach
# python-dateutil
# rfc3339-validator
# virtualenv
sniffio==1.3.0
# via anyio
soupsieve==2.5
Expand All @@ -347,6 +347,8 @@ terminado==0.17.1
# jupyter-server-terminals
tinycss2==1.2.1
# via nbconvert
toml==0.10.2
# via pre-commit
tomli==2.0.1
# via
# -c requirements/test.txt
Expand Down Expand Up @@ -395,7 +397,7 @@ urllib3==1.26.17
# -c requirements/constraints.in
# -c requirements/test.txt
# requests
virtualenv==20.24.5
virtualenv==20.4.7
# via pre-commit
wcwidth==0.2.8
# via prompt-toolkit
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
importlib-metadata==6.8.0
# via markdown
markdown==3.4.4
markdown==3.5
# via -r requirements/extra-markdown.in
zipp==3.17.0
# via importlib-metadata
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ flask==3.0.0
# visualdl
flask-babel==4.0.0
# via visualdl
fonttools==4.43.0
fonttools==4.43.1
# via matplotlib
future==0.18.3
# via bce-python-sdk
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.6.6
unstructured-inference==0.7.2
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
8 changes: 4 additions & 4 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,14 @@ filelock==3.12.4
# transformers
flatbuffers==23.5.26
# via onnxruntime
fonttools==4.43.0
fonttools==4.43.1
# via matplotlib
fsspec==2023.9.1
# via
# -c requirements/constraints.in
# huggingface-hub
# torch
huggingface-hub==0.16.4
huggingface-hub==0.17.3
# via
# timm
# tokenizers
Expand Down Expand Up @@ -199,7 +199,7 @@ sympy==1.12
# torch
timm==0.9.7
# via effdet
tokenizers==0.14.0
tokenizers==0.14.1
# via transformers
torch==2.1.0
# via
Expand Down Expand Up @@ -229,7 +229,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.6.6
unstructured-inference==0.7.2
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ fsspec==2023.9.1
# -c requirements/constraints.in
# huggingface-hub
# torch
huggingface-hub==0.16.4
huggingface-hub==0.17.3
# via
# tokenizers
# transformers
Expand Down Expand Up @@ -90,7 +90,7 @@ six==1.16.0
# sacremoses
sympy==1.12
# via torch
tokenizers==0.14.0
tokenizers==0.14.1
# via transformers
torch==2.1.0
# via -r requirements/huggingface.in
Expand Down
6 changes: 4 additions & 2 deletions requirements/ingest-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
greenlet==3.0.0
# via sqlalchemy
idna==3.4
# via
# -c requirements/base.txt
Expand All @@ -50,9 +52,9 @@ jsonpatch==1.33
# via langchain
jsonpointer==2.4
# via jsonpatch
langchain==0.0.309
langchain==0.0.310
# via -r requirements/ingest-openai.in
langsmith==0.0.42
langsmith==0.0.43
# via langchain
marshmallow==3.20.1
# via
Expand Down
6 changes: 4 additions & 2 deletions requirements/ingest-salesforce.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@ more-itertools==10.1.0
# via simple-salesforce
pendulum==2.1.2
# via simple-salesforce
platformdirs==3.11.0
# via zeep
platformdirs==3.10.0
# via
# -c requirements/constraints.in
# zeep
pycparser==2.21
# via cffi
pyjwt==2.8.0
Expand Down
6 changes: 4 additions & 2 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,10 @@ packaging==23.2
# pytest
pathspec==0.11.2
# via black
platformdirs==3.11.0
# via black
platformdirs==3.10.0
# via
# -c requirements/constraints.in
# black
pluggy==1.3.0
# via pytest
pycodestyle==2.11.0
Expand Down
61 changes: 46 additions & 15 deletions test_unstructured/partition/pdf-image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from unstructured_inference.inference import layout

from unstructured.chunking.title import chunk_by_title
from unstructured.partition import image, pdf
from unstructured.partition import image, ocr, pdf
from unstructured.partition.json import partition_json
from unstructured.partition.utils.constants import UNSTRUCTURED_INCLUDE_DEBUG_METADATA
from unstructured.staging.base import elements_to_json
Expand Down Expand Up @@ -84,7 +84,10 @@ def pages(self):

@pytest.mark.parametrize(
("filename", "file"),
[("example-docs/example.jpg", None), (None, b"0000")],
[
("example-docs/example.jpg", None),
(None, b"0000"),
],
)
def test_partition_image_local(monkeypatch, filename, file):
monkeypatch.setattr(
Expand All @@ -97,6 +100,16 @@ def test_partition_image_local(monkeypatch, filename, file):
"process_file_with_model",
lambda *args, **kwargs: MockDocumentLayout(),
)
monkeypatch.setattr(
ocr,
"process_data_with_ocr",
lambda *args, **kwargs: MockDocumentLayout(),
)
monkeypatch.setattr(
ocr,
"process_data_with_ocr",
lambda *args, **kwargs: MockDocumentLayout(),
)

partition_image_response = pdf._partition_pdf_or_image_local(
filename,
Expand Down Expand Up @@ -146,8 +159,8 @@ def test_partition_image_with_multipage_tiff(

def test_partition_image_with_language_passed(filename="example-docs/example.jpg"):
with mock.patch.object(
layout,
"process_file_with_model",
ocr,
"process_file_with_ocr",
mock.MagicMock(),
) as mock_partition:
image.partition_image(
Expand All @@ -163,8 +176,8 @@ def test_partition_image_from_file_with_language_passed(
filename="example-docs/example.jpg",
):
with mock.patch.object(
layout,
"process_data_with_model",
ocr,
"process_data_with_ocr",
mock.MagicMock(),
) as mock_partition, open(filename, "rb") as f:
image.partition_image(file=f, strategy="hi_res", ocr_languages="eng+swe")
Expand Down Expand Up @@ -437,16 +450,13 @@ def test_partition_image_with_ocr_coordinates_are_not_nan_from_filename(

def test_partition_image_formats_languages_for_tesseract():
filename = "example-docs/jpn-vert.jpeg"
with mock.patch.object(layout, "process_file_with_model", mock.MagicMock()) as mock_process:
with mock.patch(
"unstructured.partition.ocr.process_file_with_ocr",
) as mock_process_file_with_ocr:
image.partition_image(filename=filename, strategy="hi_res", languages=["jpn_vert"])
mock_process.assert_called_once_with(
filename,
is_image=True,
ocr_languages="jpn_vert",
ocr_mode="entire_page",
extract_tables=False,
model_name=pdf.default_hi_res_model(),
)
_, kwargs = mock_process_file_with_ocr.call_args_list[0]
assert "ocr_languages" in kwargs
assert kwargs["ocr_languages"] == "jpn_vert"


def test_partition_image_warns_with_ocr_languages(caplog):
Expand Down Expand Up @@ -493,3 +503,24 @@ def test_partition_image_uses_model_name():
print(mockpartition.call_args)
assert "model_name" in mockpartition.call_args.kwargs
assert mockpartition.call_args.kwargs["model_name"]


@pytest.mark.parametrize(
("ocr_mode", "idx_title_element"),
[
("entire_page", 2),
("individual_blocks", 1),
],
)
def test_partition_image_hi_res_ocr_mode(ocr_mode, idx_title_element):
filename = "example-docs/layout-parser-paper-fast.jpg"
elements = image.partition_image(filename=filename, ocr_mode=ocr_mode, strategy="hi_res")
first_line = "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
# Note(yuming): idx_title_element is different based on xy-cut and ocr mode
assert elements[idx_title_element].text == first_line


def test_partition_image_hi_res_invalid_ocr_mode():
filename = "example-docs/layout-parser-paper-fast.jpg"
with pytest.raises(ValueError):
_ = image.partition_image(filename=filename, ocr_mode="invalid_ocr_mode", strategy="hi_res")
Loading