Extract subscripts / superscripts #1976

malinphy · 2023-07-18T08:30:14Z

PyPDF loader is a great tool. However, I am having a problem extracting the chemical formulas with subscripts. For example
H₂O is extracted as \nHO \2. Is there any way to fix this issue? (Maybe with visitor functions) Thanks in advance.
Best regards

pubpub-zz · 2023-07-18T09:04:05Z

PDF is a electronic printing format where "glyphs" are printed at defined positions with defined size. subscripts are not always printed in the good order. You can try with visitors, but it may be tricky.

malinphy · 2023-07-18T09:11:27Z

PDF is a electronic printing format where "glyphs" are printed at defined positions with defined size. subscripts are not always printed in the good order. You can try with visitors, but it may be tricky.

Thanks for the advice. Do you have any idea what should be changed, which parameters should be tricked?

pubpub-zz · 2023-07-18T09:21:38Z

no ideas...😞

MartinThoma · 2023-07-18T16:03:24Z

I'm closing this issue, because I don't know how we could tackle this. It's a good question and a desirable result, but I don't see this happening in the next years. I've added it to #1181 just in case somebody has an idea.

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 18, 2023

MartinThoma changed the title ~~subscripts~~ Extract subscripts / superscripts Jul 18, 2023

MartinThoma closed this as completed Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract subscripts / superscripts #1976

Extract subscripts / superscripts #1976

malinphy commented Jul 18, 2023 •

edited

Loading

pubpub-zz commented Jul 18, 2023

malinphy commented Jul 18, 2023

pubpub-zz commented Jul 18, 2023

MartinThoma commented Jul 18, 2023

Extract subscripts / superscripts #1976

Extract subscripts / superscripts #1976

Comments

malinphy commented Jul 18, 2023 • edited Loading

pubpub-zz commented Jul 18, 2023

malinphy commented Jul 18, 2023

pubpub-zz commented Jul 18, 2023

MartinThoma commented Jul 18, 2023

malinphy commented Jul 18, 2023 •

edited

Loading