-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Text extraction improvements #1126
Conversation
Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #1126 +/- ##
=======================================
Coverage 92.02% 92.02%
=======================================
Files 24 24
Lines 4667 4667
Branches 964 964
=======================================
Hits 4295 4295
Misses 227 227
Partials 145 145
Continue to review full report at Codecov.
|
c08fa8f
to
7740a6e
Compare
Note that the modification to But see also my comment over on #1118 about why i think this approach is less correct than the approach that you've already merged. |
I've opened py-pdf/sample-files#13 to put |
With elif process_char:
lst = [x for x in l.split(b" ") if x]
map_dict[-1] = len(lst[0]) // 2
if len(lst) == 1: # some case where the 2nd param is empty (seems not IAW pdfspec)
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = ""
else:
while len(lst) > 0:
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
int_entry.append(int(lst[0], 16))
lst = lst[2:] I get
|
New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0
Credits to pubpub-zz, see
#1118 (comment)
Co-authored-by: pubpub-zz [email protected]