-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/pdf miner source property #228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I wonder if as a follow on we could make the source
property mandatory... I don't know if we ever want to allow an element without a source
?
x2 * coef, | ||
y2 * coef, | ||
text=_text, | ||
source="pdfminer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have those string constants defined as constants? this is so that user can import the names instead of need to read the code to find out how they are spelled and what options are there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but...will be a special case: the merged
type in
If we want to transform this into constants, then the source for those elements needs to be just "merged" 🤔 (losing the source of origin elements), any ideas about how to solve this? (apart from enumerating all possible combinations)
Edit: I added other attribute to merged elements, but not sure if is the best approach (see
setattr(element, "merged_sources", sources) |
🤔 Yeah, this must be mandatory, no makes sense elements coming from an unknown source. |
This PR adds three possible values for `source` field: * `pdfminer` as source for elements directly obtained from PDFs. * `OCR-tesseract` and `OCR-paddle` for elements obtained with the respective OCR engines. All those new values are stored in a new class `Source` in unstructured_inference>constants.py This would help users filter certain elements depending on how were obtained.
This PR adds three possible values for
source
field:pdfminer
as source for elements directly obtained from PDFs.OCR-tesseract
andOCR-paddle
for elements obtained with the respective OCR engines.All those new values are stored in a new class
Source
in unstructured_inference>constants.pyThis would help users to filter certain elements depending on how were obtained.
For testing purposes, you can execute this script:
The output should be: