-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Extract LaTeX characters #2016
Conversation
closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs
@MartinThoma |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2016 +/- ##
==========================================
+ Coverage 94.03% 94.07% +0.03%
==========================================
Files 33 33
Lines 7076 7104 +28
Branches 1413 1421 +8
==========================================
+ Hits 6654 6683 +29
Misses 263 263
+ Partials 159 158 -1
☔ View full report in Codecov by Sentry. |
except from My comment above, this PR is all yours |
This is amazing 😲 😍 Thank you so much 🤗 |
I'm looking forward to the release on the weekend + an update of https://github.com/py-pdf/benchmarks/blob/main/benchmark.py 🎉 |
## What's new ### New Features (ENH) - Accelerate image list keys generation (#2014) - Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000) - Extract LaTeX characters (#2016) - ASCIIHexDecode.decode now returns bytes instead of str (#1994) ### Bug Fixes (BUG) - Add RunLengthDecode filter (#2012) - Process /Separation ColorSpace (#2007) - Handle single element ColorSpace list (#2026) - Process lookup decoded as TextStringObjects (#2008) ### Robustness (ROB) - Cope with garbage collector during cloning (#1841) ### Maintenance (MAINT) - Cleanup of annotations (#1745) [Full Changelog](3.13.0...3.14.0)
@pubpub-zz I've updated the benchmark: The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better. I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective): Local optimizationsLigature replacement:
I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔 Composed characters:
Whitespace
Layout-mode
Advanced text extraction normalizationThis will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:
|
/uniHHHH glyphs seems to be generated in laTeX but is ok for other characters addressed partially in py-pdf#2016
closes #2009
note: code clean up removed duplicates from adobe_glyphs