Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Extract LaTeX characters #2016

Merged
merged 2 commits into from
Jul 26, 2023
Merged

Conversation

pubpub-zz
Copy link
Collaborator

closes #2009

note: code clean up removed duplicates from adobe_glyphs

closes py-pdf#2009

note: code clean up removed duplicates from adobe_glyphs
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
I'm interested in your position about phi / phi1 being crossed

@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.03% 🎉

Comparison is base (890c93a) 94.03% compared to head (9a06598) 94.07%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2016      +/-   ##
==========================================
+ Coverage   94.03%   94.07%   +0.03%     
==========================================
  Files          33       33              
  Lines        7076     7104      +28     
  Branches     1413     1421       +8     
==========================================
+ Hits         6654     6683      +29     
  Misses        263      263              
+ Partials      159      158       -1     
Files Changed Coverage Δ
pypdf/_codecs/adobe_glyphs.py 100.00% <ø> (ø)
pypdf/_cmap.py 95.01% <92.85%> (-0.28%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator Author

except from My comment above, this PR is all yours

@MartinThoma MartinThoma changed the title ENH : extract latex characters ENH: Extract LaTeX characters Jul 26, 2023
@MartinThoma
Copy link
Member

This is amazing 😲 😍 Thank you so much 🤗

@MartinThoma MartinThoma merged commit a327df6 into py-pdf:main Jul 26, 2023
@MartinThoma
Copy link
Member

I'm looking forward to the release on the weekend + an update of https://github.com/py-pdf/benchmarks/blob/main/benchmark.py 🎉

MartinThoma added a commit that referenced this pull request Jul 29, 2023
## What's new

### New Features (ENH)
-  Accelerate image list keys generation (#2014)
-  Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000)
-  Extract LaTeX characters (#2016)
-  ASCIIHexDecode.decode now returns bytes instead of str (#1994)

### Bug Fixes (BUG)
-  Add RunLengthDecode filter (#2012)
-  Process /Separation ColorSpace (#2007)
-  Handle single element ColorSpace list (#2026)
-  Process lookup decoded as TextStringObjects (#2008)

### Robustness (ROB)
-  Cope with garbage collector during cloning (#1841)

### Maintenance (MAINT)
-  Cleanup of annotations (#1745)

[Full Changelog](3.13.0...3.14.0)
@MartinThoma
Copy link
Member

@pubpub-zz I've updated the benchmark:

The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better.

I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective):

Local optimizations

Ligature replacement:

  • fi should be fi
  • fl should be fl
  • ff should be ff

I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔

Composed characters:

  • ¯x should be

  • ˆx should be

  • Chinese characters in arxiv 2201.00021: The name one the first page.

  • Removal of hypens inserted solely to fit on the line: Here I'm uncertain. I think most people use the text extraction to do Natural Language Processing (NLP). For them, the hyphens are just noise. But some might need the layout mode to do post-processing on their own. Then hyphen-removal might actually harm.

  • Superscript / subscripts: Especially squares (x²) and cubes (x³) as well as zero-subscripts (x₀) and one-subscripts (x₁)

Whitespace

Layout-mode

  • Indentation of code blocks currently completely breaks.
  • Multiple newlines to represent paragraph / section boundaries

Advanced text extraction normalization

This will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:

  • Detection of tables + automatic application of layout mode for them, while not using layout mode for e.g. two-column pages.
  • Removal of footers (page numbers)
  • Removal of headers
  • Removal of spaces used for thousands separation
  • Detection of text that belongs to an image / diagram
  • Re-structuring of text that is broken up by an image to ensure a smooth text flow

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Jul 30, 2023
/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters
addressed partially in  py-pdf#2016
MartinThoma pushed a commit that referenced this pull request Jul 30, 2023
`/uniHHHH` (H is a hexadecimal) glyphs seems to be generated in LaTeX but is ok for other characters

This was mentioned in #2016 / #2038
@pubpub-zz pubpub-zz deleted the Type1asUnicode branch September 2, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve math character extraction
2 participants