ENH: Extract LaTeX characters #2016

pubpub-zz · 2023-07-25T22:38:44Z

closes #2009

note: code clean up removed duplicates from adobe_glyphs

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

pubpub-zz · 2023-07-25T22:44:46Z

@MartinThoma
I'm interested in your position about phi / phi1 being crossed

codecov · 2023-07-25T22:55:27Z

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.03% 🎉

Comparison is base (890c93a) 94.03% compared to head (9a06598) 94.07%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2016      +/-   ##
==========================================
+ Coverage   94.03%   94.07%   +0.03%     
==========================================
  Files          33       33              
  Lines        7076     7104      +28     
  Branches     1413     1421       +8     
==========================================
+ Hits         6654     6683      +29     
  Misses        263      263              
+ Partials      159      158       -1

Files Changed	Coverage Δ
pypdf/_codecs/adobe_glyphs.py	`100.00% <ø> (ø)`
pypdf/_cmap.py	`95.01% <92.85%> (-0.28%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-07-26T09:03:25Z

except from My comment above, this PR is all yours

MartinThoma · 2023-07-26T20:16:18Z

This is amazing 😲 😍 Thank you so much 🤗

MartinThoma · 2023-07-26T20:21:13Z

I'm looking forward to the release on the weekend + an update of https://github.com/py-pdf/benchmarks/blob/main/benchmark.py 🎉

## What's new ### New Features (ENH) - Accelerate image list keys generation (#2014) - Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000) - Extract LaTeX characters (#2016) - ASCIIHexDecode.decode now returns bytes instead of str (#1994) ### Bug Fixes (BUG) - Add RunLengthDecode filter (#2012) - Process /Separation ColorSpace (#2007) - Handle single element ColorSpace list (#2026) - Process lookup decoded as TextStringObjects (#2008) ### Robustness (ROB) - Cope with garbage collector during cloning (#1841) ### Maintenance (MAINT) - Cleanup of annotations (#1745) [Full Changelog](3.13.0...3.14.0)

MartinThoma · 2023-07-29T14:48:39Z

@pubpub-zz I've updated the benchmark:

The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better.

I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective):

Local optimizations

Ligature replacement:

ﬁ should be fi
ﬂ should be fl
ﬀ should be ff

I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔

Composed characters:

¯x should be x̄
ˆx should be x̂
Chinese characters in arxiv 2201.00021: The name one the first page.
Removal of hypens inserted solely to fit on the line: Here I'm uncertain. I think most people use the text extraction to do Natural Language Processing (NLP). For them, the hyphens are just noise. But some might need the layout mode to do post-processing on their own. Then hyphen-removal might actually harm.
Superscript / subscripts: Especially squares (x²) and cubes (x³) as well as zero-subscripts (x₀) and one-subscripts (x₁)

Whitespace

Most important are inner-word spaces that often occur after the first letter of a word. See Random whitespaces are inserted when using page.extract_text() #1507
Newlines, especially for arXiv 2201.00029
Spaces around math-mode stuff
Spaces after dots: New line character missing and URLs adding periods and space #1974

Layout-mode

Indentation of code blocks currently completely breaks.
Multiple newlines to represent paragraph / section boundaries

Advanced text extraction normalization

This will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:

Detection of tables + automatic application of layout mode for them, while not using layout mode for e.g. two-column pages.
Removal of footers (page numbers)
Removal of headers
Removal of spaces used for thousands separation
Detection of text that belongs to an image / diagram
Re-structuring of text that is broken up by an image to ensure a smooth text flow

/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016

`/uniHHHH` (H is a hexadecimal) glyphs seems to be generated in LaTeX but is ok for other characters This was mentioned in #2016 / #2038

ENH : extract latex characters

bdfaa49

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

pubpub-zz mentioned this pull request Jul 25, 2023

Fix copying of the reduced Planck constant mozilla/pdf.js#16735

Merged

typo in comment

9a06598

MartinThoma changed the title ~~ENH : extract latex characters~~ ENH: Extract LaTeX characters Jul 26, 2023

MartinThoma merged commit a327df6 into py-pdf:main Jul 26, 2023

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Jul 30, 2023

ENH : Process /uniHHHH for text_extract

8df2dfa

/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016

pubpub-zz mentioned this pull request Jul 30, 2023

ENH: Process /uniHHHH for text_extract #2043

Merged

MartinThoma pushed a commit that referenced this pull request Jul 30, 2023

ENH: Process /uniHHHH for text_extract (#2043)

534c7b4

`/uniHHHH` (H is a hexadecimal) glyphs seems to be generated in LaTeX but is ok for other characters This was mentioned in #2016 / #2038

pubpub-zz deleted the Type1asUnicode branch September 2, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Extract LaTeX characters #2016

ENH: Extract LaTeX characters #2016

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023

codecov bot commented Jul 25, 2023 •

edited

Loading

pubpub-zz commented Jul 26, 2023

MartinThoma commented Jul 26, 2023

MartinThoma commented Jul 26, 2023

MartinThoma commented Jul 29, 2023

ENH: Extract LaTeX characters #2016

ENH: Extract LaTeX characters #2016

Conversation

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023

codecov bot commented Jul 25, 2023 • edited Loading

Codecov Report

pubpub-zz commented Jul 26, 2023

MartinThoma commented Jul 26, 2023

MartinThoma commented Jul 26, 2023

MartinThoma commented Jul 29, 2023

Local optimizations

Whitespace

Layout-mode

Advanced text extraction normalization

codecov bot commented Jul 25, 2023 •

edited

Loading