Fix/pdf miner source property #228

benjats07 · 2023-09-23T00:28:30Z

This PR adds three possible values for source field:

pdfminer as source for elements directly obtained from PDFs.
OCR-tesseract and OCR-paddle for elements obtained with the respective OCR engines.

All those new values are stored in a new class Source in unstructured_inference>constants.py

This would help users to filter certain elements depending on how were obtained.

For testing purposes, you can execute this script:

from unstructured_inference.constants import OCRMode
from unstructured_inference.inference import layout
from unstructured_inference.models.base import get_model

file = "sample-docs/loremipsum-flat.pdf"
model = get_model("yolox_tiny")
doc = layout.DocumentLayout.from_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="force",
)
assert "OCR-tesseract" in {e.source for e in doc.pages[0].elements}
print("OCR-tesseract in sources")

elements = layout.load_pdf(file)
assert elements[0][0][0].source == 'pdfminer'
print("pdfminer in sources")

The output should be:

OCR-tesseract in sources
pdfminer in sources

qued

LGTM. I wonder if as a follow on we could make the source property mandatory... I don't know if we ever want to allow an element without a source?

badGarnet · 2023-09-26T01:40:23Z

unstructured_inference/inference/layout.py

+                x2 * coef,
+                y2 * coef,
+                text=_text,
+                source="pdfminer",


can we have those string constants defined as constants? this is so that user can import the names instead of need to read the code to find out how they are spelled and what options are there.

Sure, but...will be a special case: the merged type in

https://github.com/Unstructured-IO/unstructured-inference/blob/2de61a4d5c60dfda4d789e7d7fdcdf49f1f04960/unstructured_inference/inference/layoutelement.py#L314C62-L314C62

If we want to transform this into constants, then the source for those elements needs to be just "merged" 🤔 (losing the source of origin elements), any ideas about how to solve this? (apart from enumerating all possible combinations)

Edit: I added other attribute to merged elements, but not sure if is the best approach (see

unstructured-inference/unstructured_inference/inference/layoutelement.py

Line 320 in 46f2c8d

setattr(element, "merged_sources", sources)

)

benjats07 · 2023-09-26T04:47:47Z

LGTM. I wonder if as a follow on we could make the source property mandatory... I don't know if we ever want to allow an element without a source?

🤔 Yeah, this must be mandatory, no makes sense elements coming from an unknown source.

This PR adds three possible values for `source` field: * `pdfminer` as source for elements directly obtained from PDFs. * `OCR-tesseract` and `OCR-paddle` for elements obtained with the respective OCR engines. All those new values are stored in a new class `Source` in unstructured_inference>constants.py This would help users filter certain elements depending on how were obtained.

Benjamin Torres added 2 commits September 22, 2023 18:28

fix: add 'pdfminer' as source

a477190

CHANGELOG update

fc403dc

benjats07 linked an issue Sep 23, 2023 that may be closed by this pull request

Bug: pdf miner elements don't contain source property correctly filled #227

Closed

Benjamin Torres and others added 6 commits September 22, 2023 18:53

refactor: change names of OCR sources

06817a8

test: add source checking

01eaf42

fix: adds OCR specific source to elements

988ee35

Merge branch 'main' into fix/pdf-miner-source-property

2375eee

test: add test for OCR source

08ba663

CHANGELOG update

2de61a4

benjats07 requested review from qued and badGarnet September 26, 2023 01:10

benjats07 marked this pull request as ready for review September 26, 2023 01:11

qued approved these changes Sep 26, 2023

View reviewed changes

badGarnet reviewed Sep 26, 2023

View reviewed changes

refactor: sources are now an Enum

46f2c8d

benjats07 merged commit f4236c8 into main Sep 26, 2023

benjats07 deleted the fix/pdf-miner-source-property branch September 26, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pdf miner source property #228

Fix/pdf miner source property #228

benjats07 commented Sep 23, 2023 •

edited

Loading

qued left a comment

badGarnet Sep 26, 2023

benjats07 Sep 26, 2023 •

edited

Loading

benjats07 commented Sep 26, 2023

Fix/pdf miner source property #228

Fix/pdf miner source property #228

Conversation

benjats07 commented Sep 23, 2023 • edited Loading

qued left a comment

Choose a reason for hiding this comment

badGarnet Sep 26, 2023

Choose a reason for hiding this comment

benjats07 Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

benjats07 commented Sep 26, 2023

benjats07 commented Sep 23, 2023 •

edited

Loading

benjats07 Sep 26, 2023 •

edited

Loading