-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add orientation param for text_extraction (# 1071) #1175
Conversation
add new capability to filter text extraction on orientation
Codecov Report
@@ Coverage Diff @@
## main #1175 +/- ##
==========================================
+ Coverage 92.08% 92.11% +0.02%
==========================================
Files 24 24
Lines 4866 4897 +31
Branches 996 1011 +15
==========================================
+ Hits 4481 4511 +30
Misses 242 242
- Partials 143 144 +1
Continue to review full report at Codecov.
|
Very nice! It looks good to me - I will merge it tomorrow if the text extraction benchmark looks fine as well. So it should get into the release on Sunday :-) |
For Interestingly, it seems to have killed a lot of newlines: I think I need to design a new benchmark which measures how well newlines are captured. At the moment, this is completely ignored for calculating the score. |
However, getting the spaces in / between words right is way more important. And there was the improvement 👍 |
New Features (ENH): - Add ability to add hex encoded colors to outline items (#1186) - Add support for pathlib.Path in PdfMerger.merge (#1190) - Add link annotation (#1189) - Add capability to filter text extraction by orientation (#1175) Bug Fixes (BUG): - Named Dest in PDF1.1 (#1174) - Incomplete Graphic State save/restore (#1172) Documentation (DOC): - Update changelog url in package metadata (#1180) - Table extraction (#1179) - Mention pyHanko for signing PDF documents (#1178) - We now have CMAP support (#1177) Maintenance (MAINT): - Consistant usage of warnings / log messages (#1164) - Consistent terminology for outline items (#1156) Code Style (STY): - Apply pre-commit (#1188) Full Changelog: 2.8.1...2.9.0
add new capability to filter text extraction on orientation
Deprecations: PageObject.extract_text no longer uses the
Tj_sep
andTJ_sep
parameters.cf #1071