Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract subscripts / superscripts #1976

Closed
malinphy opened this issue Jul 18, 2023 · 4 comments
Closed

Extract subscripts / superscripts #1976

malinphy opened this issue Jul 18, 2023 · 4 comments
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@malinphy
Copy link

malinphy commented Jul 18, 2023

PyPDF loader is a great tool. However, I am having a problem extracting the chemical formulas with subscripts. For example
H2O is extracted as \nHO \2. Is there any way to fix this issue? (Maybe with visitor functions) Thanks in advance.
Best regards

@pubpub-zz
Copy link
Collaborator

PDF is a electronic printing format where "glyphs" are printed at defined positions with defined size. subscripts are not always printed in the good order. You can try with visitors, but it may be tricky.

@malinphy
Copy link
Author

PDF is a electronic printing format where "glyphs" are printed at defined positions with defined size. subscripts are not always printed in the good order. You can try with visitors, but it may be tricky.

Thanks for the advice. Do you have any idea what should be changed, which parameters should be tricked?

@pubpub-zz
Copy link
Collaborator

no ideas...😞

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 18, 2023
@MartinThoma MartinThoma changed the title subscripts Extract subscripts / superscripts Jul 18, 2023
@MartinThoma
Copy link
Member

I'm closing this issue, because I don't know how we could tackle this. It's a good question and a desirable result, but I don't see this happening in the next years. I've added it to #1181 just in case somebody has an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants