Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Text extraction improvements #1126

Merged
merged 4 commits into from
Jul 17, 2022
Merged

MAINT: Text extraction improvements #1126

merged 4 commits into from
Jul 17, 2022

Conversation

MartinThoma
Copy link
Member

Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz [email protected]

Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz <[email protected]>
@codecov
Copy link

codecov bot commented Jul 17, 2022

Codecov Report

Merging #1126 (40925ed) into main (0b693e1) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1126   +/-   ##
=======================================
  Coverage   92.02%   92.02%           
=======================================
  Files          24       24           
  Lines        4667     4667           
  Branches      964      964           
=======================================
  Hits         4295     4295           
  Misses        227      227           
  Partials      145      145           
Impacted Files Coverage Δ
PyPDF2/_page.py 92.60% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b693e1...40925ed. Read the comment docs.

@dkg
Copy link
Contributor

dkg commented Jul 17, 2022

Note that the modification to parse_to_unicode here attempts to clean up some part of ae0ff49, but doesn't seem to account for the earlier modification in that commit, where the null dstString was mapped to .. If you are going to go with this approach, you should avoid mapping the null dstString to . as well (that is, revert the first hunk of ae0ff49).

But see also my comment over on #1118 about why i think this approach is less correct than the approach that you've already merged.

@dkg
Copy link
Contributor

dkg commented Jul 17, 2022

I've opened py-pdf/sample-files#13 to put habibi.pdf in the sample-files repo. i recommend including a test for it before merging this.

@MartinThoma
Copy link
Member Author

With

        elif process_char:
            lst = [x for x in l.split(b" ") if x]
            map_dict[-1] = len(lst[0]) // 2
            if len(lst) == 1:       # some case where the 2nd param is empty (seems not IAW pdfspec)
                map_dict[
                    unhexlify(lst[0]).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                    )
                ] = ""
            else:
                while len(lst) > 0:
                    map_dict[
                        unhexlify(lst[0]).decode(
                            "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                        )
                    ] = unhexlify(lst[1]).decode(
                        "utf-16-be", "surrogatepass"
                    )  # join is here as some cases where the code was split
                    int_entry.append(int(lst[0], 16))
                    lst = lst[2:]

I get

>                       ] = unhexlify(lst[1]).decode(
                            "utf-16-be", "surrogatepass"
                        )  # join is here as some cases where the code was split
E                       binascii.Error: Odd-length string

@MartinThoma MartinThoma merged commit e24b0a0 into main Jul 17, 2022
@MartinThoma MartinThoma deleted the text-extraction-impr branch July 17, 2022 18:53
MartinThoma added a commit that referenced this pull request Jul 17, 2022
New Features (ENH):
-  Add color and font_format to PdfReader.outlines[i] (#1104)
-  Extract Text Enhancement (whitespaces) (#1084)

Bug Fixes (BUG):
-  Use `build_destination` for named destination outlines (#1128)
-  Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)
-  Prevent deduplication of PageObject (#1105)
-  None-check in DictionaryObject.read_from_stream (#1113)
-  Avoid IndexError in _cmap.parse_to_unicode (#1110)

Documentation (DOC):
-  Explanation for git submodule
-  Watermark and stamp (#1095)

Maintenance (MAINT):
-  Text extraction improvements (#1126)
-  Destination.color returns ArrayObject instead of tuple as fallback (#1119)
-  Use add_bookmark_destination in add_bookmark (#1100)
-  Use add_bookmark_destination in add_bookmark_dict (#1099)

Testing (TST):
-  Remove xfail from test_outline_title_issue_1121
-  Add test for arab text (#1127)
-  Add xfail for decryption fail (#1125)
-  Add xfail test for IndexError when extracting text (#1124)
-  Add MCVE showing outline title issue (#1123)

Code Style (STY):
-  Apply black and isort
-  Use IntFlag for permissions_flag / update_page_form_field_values (#1094)
-  Simplify code (#1101)

Full Changelog: 2.5.0...2.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants