Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

Closed
troethe opened this issue Aug 3, 2023 · 0 comments · Fixed by #2060 or #2206
Closed

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

troethe opened this issue Aug 3, 2023 · 0 comments · Fixed by #2060 or #2206
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...

Comments

@troethe
Copy link
Contributor

troethe commented Aug 3, 2023

When supplying page.extract_text with a visitor_text function, the callback to the function is made with an incorrect tm_matrix (text matrix) parameter.

I believe this is due to the fact that here, the visitor is called with the new tm_matrix, even though at this point, the text matrix may already be changed by the currently handled OP.

Environment

Running the current main branch of pypdf on Debian.

$ python -m platform
Linux-5.10.0-23-amd64-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.14.0

Code + PDF

Here is an example script that converts a testdoc.pdf to SVG, logs the OP's in the page and the calls to the visitor with x, y coordinates.

from pypdf import PdfReader
import svgwrite

file_name = "./testdoc.pdf"

reader = PdfReader(file_name)
page = reader.pages[0]

print("####\nOps:\n####")
print(page.get_contents().operations)

length, height = page.mediabox[2:]

dwg = svgwrite.Drawing(file_name[:-3] + "svg", size=(length, height), profile="full")


def visitor_svg_text(text, cm, tm, fontDict, fontSize):
    (x, y) = (tm[4], tm[5])
    print(x, y, text)
    dwg.add(dwg.text(text, insert=(x, height-y), fill="blue", style=f"font-size: {fontSize}px"))


print("\n####\nParsed Lines:\n####")
page.extract_text(visitor_text=visitor_svg_text)
dwg.save()

Output:

####
Ops:
####
[([], b'BT'), (['/F29', 14.3462], b'Tf'), ([133.768, 707.125], b'Td'), ([['1', -1125, 'T', 94, 'estsection']], b'TJ'), (['/F19', 9.9626000000000001], b'Tf'), ([0, -21.821000000000002], b'Td'), ([['A', -333, 'B', -334, 'C']], b'TJ'), ([169.36500000000001, -546.04899999999998], b'Td'), ([['1']], b'TJ'), ([], b'ET')]

####
Parsed Lines:
####
0.0 0.0 
133.768 707.125 1 Testsection
133.768 685.304 

303.13300000000004 139.255 A B C

303.13300000000004 139.255 1

As can be seen (also in the resulting SVG), the coordinates of the line "A B C" get affected by the Td following after it, because this is when crlf_space_check detects a new line. It then however supplies the tm_matrix that was already altered to the call to the visitor, instead of the one that was active before the operator was applied.

@MartinThoma MartinThoma changed the title Incorrect text matrix passed to visitor_text in page.extract_text BUG: Incorrect text matrix passed to visitor_text in page.extract_text Aug 3, 2023
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 3, 2023
troethe added a commit to troethe/pypdf that referenced this issue Aug 13, 2023
Supply the old tm_matrix when flushing out `text` to the `visitor_text`
in `crlf_space_check`. The new one might already be changed and
unrelated to the current text.

Also add a test for the tm_matrix and cm_matrix that are given to
`visitor_text` when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.

Fixes py-pdf#2059
MartinThoma pushed a commit that referenced this issue Aug 13, 2023
Supply the old tm_matrix when flushing out `text` to the `visitor_text`
in `crlf_space_check`. The new one might already be changed and
unrelated to the current text.

Also add a test for the tm_matrix and cm_matrix that are given to
`visitor_text` when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.

Fixes #2059
@MartinThoma MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Aug 14, 2023
MartinThoma pushed a commit that referenced this issue Oct 8, 2023
Reworks and is still valid to close #2059

Closes #2200
Closes #2075
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...
Projects
None yet
2 participants