Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved ExtractText(3) #969

Merged
merged 45 commits into from
Jun 13, 2022
Merged

improved ExtractText(3) #969

merged 45 commits into from
Jun 13, 2022

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Jun 10, 2022

New corrections for extract_text()
fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic

PyPDF2/_cmap.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

There are two minor Flake8 issues:

./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_block_backwards' imported but unused
./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_previous_line' imported but unused

Do you prefer to fix them yourself or should I do it? (also as a general question)

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma,
I need your help !! 😥
I have an issue in test_utils.py : My changes on tag 2.1.0 works but I get regressions on main.

Can you have a look please

@MartinThoma
Copy link
Member

@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971

@MartinThoma
Copy link
Member

I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-)

@MartinThoma
Copy link
Member

MartinThoma commented Jun 10, 2022

Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged.

I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered)

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
If you look at the changed files I had to drastically revert in test_utils.py as I had major issues with it. give me 5 min and I will confirm/infirm my issue

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
my problems are this section of code


@pytest.mark.parametrize(
    ("dat", "pos", "expected", "expected_pos"),
    [
        (b"abc", 1, b"a", 0),
        (b"abc", 2, b"ab", 0),
        (b"abc", 3, b"abc", 0),
        (b"abc\n", 3, b"abc", 0),
        (b"abc\n", 4, b"", 3),
        (b"abc\n\r", 4, b"", 3),
        (b"abc\nd", 5, b"d", 3),
        # Skip over multiple CR/LF bytes
        (b"abc\n\r\ndef", 9, b"def", 3),
        # Include a block full of newlines...
        (
            b"abc" + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE) + b"d",
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d",
            3,
        ),
        # Include a block full of non-newline characters
        (
            b"abc\n" + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            3,
        ),
        # Both
        (
            b"abcxyz"
            + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE)
            + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            4 * io.DEFAULT_BUFFER_SIZE + 6,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            6,
        ),
    ],
)
def test_read_previous_line(dat, pos, expected, expected_pos):
    s = io.BytesIO(dat)
    s.seek(pos)
    assert read_previous_line(s) == expected
    assert s.tell() == expected_pos

@MartinThoma
Copy link
Member

Oh damn. That sounds as if it's related to #646

I'll have a closer look tomorrow

@pubpub-zz
Copy link
Collaborator Author

I still have some work to fix text extraction with the "paper rotated"
Chinese / russian/ .... are working
I have some doubts about arabic as the text is written right to left.

@codecov
Copy link

codecov bot commented Jun 10, 2022

Codecov Report

Merging #969 (2aea3e9) into main (9c4e7f5) will increase coverage by 0.16%.
The diff coverage is 86.43%.

@@            Coverage Diff             @@
##             main     #969      +/-   ##
==========================================
+ Coverage   84.25%   84.42%   +0.16%     
==========================================
  Files          18       18              
  Lines        4115     4179      +64     
  Branches      868      887      +19     
==========================================
+ Hits         3467     3528      +61     
- Misses        465      468       +3     
  Partials      183      183              
Impacted Files Coverage Δ
PyPDF2/_page.py 82.65% <ø> (+1.10%) ⬆️
PyPDF2/_cmap.py 76.43% <76.43%> (+3.70%) ⬆️
PyPDF2/generic.py 89.70% <81.81%> (-0.13%) ⬇️
PyPDF2/_utils.py 98.03% <98.03%> (ø)
PyPDF2/__init__.py 100.00% <100.00%> (ø)
PyPDF2/filters.py 81.81% <0.00%> (+0.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4e7f5...2aea3e9. Read the comment docs.

@MartinThoma
Copy link
Member

@pubpub-zz I've added the test back, without any adjustment. It works: #971
Did you maybe solve the issue in between?

@MartinThoma
Copy link
Member

@pubpub-zz I cannot answer that. It would be ok for me to have the behavior as in the specs, which also means that crazyones.txt needs to get adjusted.

I just re-ran the benchmark and the results look very similar.

@pubpub-zz
Copy link
Collaborator Author

@pubpub-zz I cannot answer that. It would be ok for me to have the behavior as in the specs, which also means that crazyones.txt needs to get adjusted.

I just re-ran the benchmark and the results look very similar.

@MartinThoma,
I did more research and I think I've got the solution. should be available by diner

@MartinThoma
Copy link
Member

@pubpub-zz I love how committed you are to improve PyPDF2, but please don't feel pressured because I said that I wanted to make a release today. It's unpaid so it should be fun. If it doesn't work today or even for some weeks, it would be fine 🤗

@pubpub-zz
Copy link
Collaborator Author

Devlivered just before diner and I've mowed the lawn 😁

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Jun 12, 2022

@MartinThoma , can you have a look I do not understand : there is an error on test_utils but it did not changed it.
and it's working fine locally

@MartinThoma
Copy link
Member

That is a merge with main which got wrong. You need to adjust the "ids" parameter to match the number of tests. I think there is currently an "11" but it should be "8" (in the range function)

@MartinThoma
Copy link
Member

I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long

@MartinThoma
Copy link
Member

Devlivered just before diner and I've mowed the lawn

Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊

tests/test_utils.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa

I try to go through the PR today evening / night :-)

PyPDF2/_cmap.py Outdated Show resolved Hide resolved
PyPDF2/_cmap.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@pubpub-zz Looks good to me! I would squash-commit with the following text:

ENH: Text Extraction improvements

- Improvements around /Encoding / /ToUnicode
- Extraction of CMaps improved
- Fallback for font def missing
- Support for /Identity-H and /Identity-V: utf-16-be
- Support for /GB-EUC-H / /GB-EUC-V: gbk
- Support for /GBpc-EUC-H / /GBpc-EUC-V : gb2312
- Store default font space width for 18 commonly used fonts to improve
  whitespace extraction

Does that represent the changes well to users?

@MartinThoma
Copy link
Member

Besides the two typos I've just commented, there is one robustness-change I would do: The .decode("utf-16-be") fails 167x for 22847 PDF files (0.7% of my dataset, so not too wild) with:

    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1125, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 221, in parse_to_unicode
    ] = unhexlify(sq).decode("utf-16-be")
  File "/home/moose/.pyenv/versions/3.10.2/lib/python3.10/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data

I would just wrap it in a try-except UnicodeDecodeError:

import logging
logger = logging.getLogger(__name__)

...

                while a <= b:
                    sq = fmt2 % c
                    key = unhexlify(fmt % a).decode(
                                "charmap" if map_dict[-1] == 1 else "utf-16-be"
                            )
                    unhexlified = unhexlify(sq)
                    try:
                         decoded = unhexlified.decode("utf-16-be")
                    except UnicodeDecodeError as exc:
                        logger.warning("UnicodeDecodeError when parsing cmap")
                        a += 1
                        c += 1
                        continue
                    map_dict[key] = decoded
                    int_entry.append(a)
                    a += 1
                    c += 1

pubpub-zz and others added 2 commits June 13, 2022 18:35
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
@pubpub-zz
Copy link
Collaborator Author

under analysis

@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Jun 13, 2022

ENH: Text Extraction improvements

  • Improvements around /Encoding / /ToUnicode
  • Extraction of CMaps improved
  • Fallback for font def missing
  • Support for /Identity-H and /Identity-V: utf-16-be
  • Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
  • Arabic (for evaluation)
  • whitespace extraction improvement

…end of data

use surrogatepass  in _cmap and _page
@pubpub-zz
Copy link
Collaborator Author

pubpub-zz commented Jun 13, 2022

@MartinThoma
This latest mod fixed the 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data
This should close the issue on the .7% remaining

@MartinThoma MartinThoma merged commit 72fcaae into py-pdf:main Jun 13, 2022
MartinThoma added a commit that referenced this pull request Jun 13, 2022
The 2.2.0 release improves text extraction again via (#969):

* Improvements around /Encoding / /ToUnicode
* Extraction of CMaps improved
* Fallback for font def missing
* Support for /Identity-H and /Identity-V: utf-16-be
* Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
* Arabic (for evaluation)
* Whitespace extraction improvements

Those changes should mainly improve the text extraction for non-ASCII alphabets,
e.g. Russian / Chinese / Japanese / Korean / Arabic.

Full Changelog: 2.1.1...2.2.0
@pubpub-zz pubpub-zz deleted the ExtractText branch June 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants