BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

troethe · 2023-08-03T10:57:18Z

When supplying page.extract_text with a visitor_text function, the callback to the function is made with an incorrect tm_matrix (text matrix) parameter.

I believe this is due to the fact that here, the visitor is called with the new tm_matrix, even though at this point, the text matrix may already be changed by the currently handled OP.

Environment

Running the current main branch of pypdf on Debian.

$ python -m platform
Linux-5.10.0-23-amd64-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.14.0

Code + PDF

Here is an example script that converts a testdoc.pdf to SVG, logs the OP's in the page and the calls to the visitor with x, y coordinates.

from pypdf import PdfReader
import svgwrite

file_name = "./testdoc.pdf"

reader = PdfReader(file_name)
page = reader.pages[0]

print("####\nOps:\n####")
print(page.get_contents().operations)

length, height = page.mediabox[2:]

dwg = svgwrite.Drawing(file_name[:-3] + "svg", size=(length, height), profile="full")


def visitor_svg_text(text, cm, tm, fontDict, fontSize):
    (x, y) = (tm[4], tm[5])
    print(x, y, text)
    dwg.add(dwg.text(text, insert=(x, height-y), fill="blue", style=f"font-size: {fontSize}px"))


print("\n####\nParsed Lines:\n####")
page.extract_text(visitor_text=visitor_svg_text)
dwg.save()

Output:

####
Ops:
####
[([], b'BT'), (['/F29', 14.3462], b'Tf'), ([133.768, 707.125], b'Td'), ([['1', -1125, 'T', 94, 'estsection']], b'TJ'), (['/F19', 9.9626000000000001], b'Tf'), ([0, -21.821000000000002], b'Td'), ([['A', -333, 'B', -334, 'C']], b'TJ'), ([169.36500000000001, -546.04899999999998], b'Td'), ([['1']], b'TJ'), ([], b'ET')]

####
Parsed Lines:
####
0.0 0.0 
133.768 707.125 1 Testsection
133.768 685.304 

303.13300000000004 139.255 A B C

303.13300000000004 139.255 1

As can be seen (also in the resulting SVG), the coordinates of the line "A B C" get affected by the Td following after it, because this is when crlf_space_check detects a new line. It then however supplies the tm_matrix that was already altered to the call to the visitor, instead of the one that was active before the operator was applied.

The text was updated successfully, but these errors were encountered:

Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`. The new one might already be changed and unrelated to the current text. Also add a test for the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices and checks, if they are roughly where they should be. Fixes py-pdf#2059

Supply the old tm_matrix when flushing out `text` to the `visitor_text` in `crlf_space_check`. The new one might already be changed and unrelated to the current text. Also add a test for the tm_matrix and cm_matrix that are given to `visitor_text` when extracting text. The test computes the coordinates of three letters in different parts of a test page based on the matrices and checks, if they are roughly where they should be. Fixes #2059

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

troethe mentioned this issue Aug 3, 2023

BUG: Fix incorrect tm_matrix in call to visitor_text #2060

Merged

MartinThoma changed the title ~~Incorrect text matrix passed to visitor_text in page.extract_text~~ BUG: Incorrect text matrix passed to visitor_text in page.extract_text Aug 3, 2023

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 3, 2023

MartinThoma closed this as completed in #2060 Aug 13, 2023

MartinThoma mentioned this issue Aug 14, 2023

x values in the tm_matrix are wrong #2075

Closed

MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Aug 14, 2023

pubpub-zz mentioned this issue Sep 19, 2023

BUG: invalid cm/tm in visitor functions #2206

Merged

MartinThoma pushed a commit that referenced this issue Oct 8, 2023

BUG: invalid cm/tm in visitor functions (#2206)

bcd85c4

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

troethe commented Aug 3, 2023 •

edited

Loading

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

Comments

troethe commented Aug 3, 2023 • edited Loading

Environment

Code + PDF

troethe commented Aug 3, 2023 •

edited

Loading