-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Unnecessary character mapping process #2888
MAINT: Unnecessary character mapping process #2888
Conversation
This reverts commit 5400f5a.
This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test
…nt size comparison to ratio
Co-authored-by: Stefan <[email protected]>
…he assertion process
Co-authored-by: Stefan <[email protected]>
Co-authored-by: Stefan <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2888 +/- ##
=======================================
Coverage 96.35% 96.36%
=======================================
Files 52 52
Lines 8735 8738 +3
Branches 1723 1727 +4
=======================================
+ Hits 8417 8420 +3
Misses 186 186
Partials 132 132 ☔ View full report in Codecov by Sentry. |
I have addressed the parts that were returned due to code check errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I could not see any apparent issue with this and all the tests still pass without having to change anything. Thus I am going to merge this for now.
Your previous PR was from your main branch, while this PR used a separate branch, but all branches usually originate from main. Thus your previous commits where included here verbosely. I recommend you to reset the changes from your main branch and sync it with upstream. If this is too complex, consider deleting and re-creating your fork if there is no work which would be lost by such a process, and always use dedicated branches for further PRs. |
## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)
This is a fix for the problem that occurred when #2882 was changed.
The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately.
This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351.
The change in handle_tj is because it cannot pass Ruff's check.
Error: PLR0915 Too many statements (nnn > 176)
The following code is only used to get the character code for a space.
However, I think it would be better to split the code into parts for obtaining the character code.
Style changes are considered in another PR.