Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/pdf miner source property #228

Merged
merged 9 commits into from
Sep 26, 2023
Merged

Conversation

benjats07
Copy link
Contributor

@benjats07 benjats07 commented Sep 23, 2023

This PR adds three possible values for source field:

  • pdfminer as source for elements directly obtained from PDFs.
  • OCR-tesseract and OCR-paddle for elements obtained with the respective OCR engines.

All those new values are stored in a new class Source in unstructured_inference>constants.py

This would help users to filter certain elements depending on how were obtained.

For testing purposes, you can execute this script:

from unstructured_inference.constants import OCRMode
from unstructured_inference.inference import layout
from unstructured_inference.models.base import get_model

file = "sample-docs/loremipsum-flat.pdf"
model = get_model("yolox_tiny")
doc = layout.DocumentLayout.from_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="force",
)
assert "OCR-tesseract" in {e.source for e in doc.pages[0].elements}
print("OCR-tesseract in sources")

elements = layout.load_pdf(file)
assert elements[0][0][0].source == 'pdfminer'
print("pdfminer in sources")

The output should be:

OCR-tesseract in sources
pdfminer in sources

@benjats07 benjats07 requested review from qued and badGarnet September 26, 2023 01:10
@benjats07 benjats07 marked this pull request as ready for review September 26, 2023 01:11
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder if as a follow on we could make the source property mandatory... I don't know if we ever want to allow an element without a source?

x2 * coef,
y2 * coef,
text=_text,
source="pdfminer",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have those string constants defined as constants? this is so that user can import the names instead of need to read the code to find out how they are spelled and what options are there.

Copy link
Contributor Author

@benjats07 benjats07 Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but...will be a special case: the merged type in

https://github.com/Unstructured-IO/unstructured-inference/blob/2de61a4d5c60dfda4d789e7d7fdcdf49f1f04960/unstructured_inference/inference/layoutelement.py#L314C62-L314C62

If we want to transform this into constants, then the source for those elements needs to be just "merged" 🤔 (losing the source of origin elements), any ideas about how to solve this? (apart from enumerating all possible combinations)

Edit: I added other attribute to merged elements, but not sure if is the best approach (see

setattr(element, "merged_sources", sources)
)

@benjats07
Copy link
Contributor Author

LGTM. I wonder if as a follow on we could make the source property mandatory... I don't know if we ever want to allow an element without a source?

🤔 Yeah, this must be mandatory, no makes sense elements coming from an unknown source.

@benjats07 benjats07 merged commit f4236c8 into main Sep 26, 2023
@benjats07 benjats07 deleted the fix/pdf-miner-source-property branch September 26, 2023 17:05
benjats07 added a commit that referenced this pull request Sep 30, 2023
This PR adds three possible values for `source` field:
* `pdfminer` as source for elements directly obtained from PDFs.
* `OCR-tesseract` and `OCR-paddle` for elements obtained with the
respective OCR engines.

All those new values are stored in a new class `Source` in unstructured_inference>constants.py

This would help users filter certain elements depending on how were
obtained.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: pdf miner elements don't contain source property correctly filled
3 participants