"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

ghbm-itk · 2023-12-20T10:16:34Z

I'm trying to extract text from a pdf together with the position of the text.
When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.

Environment

Windows-10-10.0.19045-SP0
pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0
AND
pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
file_path = "list.pdf"
reader = pypdf.PdfReader(file_path)

text_parts = []

def visitor(text, cm, tm, fd, fs):
    if text.strip() == "Flyttesagsnr.:":
        text_parts.append((cm, tm, text))

reader.pages[0].extract_text(visitor_text=visitor)

print(text_parts)

Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug.
I know this might make the bug hard to replicate.

Results

In version 3.17 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], ' Flyttesagsnr.:')]

In version 3.16 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, -1.0, 448.313, 352.05], ' Flyttesagsnr.:')]

As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.

stefan6419846 · 2023-12-20T10:27:37Z

If you have a look at the changelog, you will see that there have been some changes/improvements to the text extraction in the meantime. This probably is related to these changes and most likely intended or a previous bug.

ghbm-itk · 2023-12-20T10:30:06Z

But 3.17 outputs a wrong answer, when 3.16 outputs the correct answer. Seems like a new bug.

stefan6419846 · 2023-12-20T15:18:43Z

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

pubpub-zz · 2023-12-20T19:47:32Z

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

ghbm-itk · 2023-12-21T06:59:28Z

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

I will try this when I have some time.

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

ghbm-itk · 2023-12-21T08:00:39Z

@stefan6419846
I tested the code snippet in different versions with the following results:
3.16.0: Correct
3.16.1: Correct
3.16.2: Correct
3.16.3: Wrong
3.17.3: Wrong

I suspect the change happened with #2206

pubpub-zz · 2023-12-21T08:22:38Z

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

oups you are right I had to keep the existing definitions whereas it was more complex to be used.

I suspect the change happened with #2206

The change was raised because the TM was not captured at the beginning of the line. Would you accept to share the file in private, emailing it to @MartinThoma ?

ghbm-itk · 2023-12-21T08:25:43Z

I'm sorry but it would be illegal for me to share the document with anyone outside my org.
Is there a good way where I can remove all other text from the pdf without affecting the "Flyttesagsnr.:" text?

Whenever I try to edit the pdf, the matrices change completely.

stefan6419846 · 2023-12-24T20:05:01Z

In general, there is no easy/general purpose approach to do this as far as I know. A possible way would be to manually mess with the internal page source, but this requires some deeper understanding of the PDF format.

MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Dec 20, 2023

stefan6419846 mentioned this issue Mar 16, 2024

PageObject.extract_texts text_visitor reports a wrong matrix for some text nodes #2513

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

ghbm-itk commented Dec 20, 2023

stefan6419846 commented Dec 20, 2023

ghbm-itk commented Dec 20, 2023

stefan6419846 commented Dec 20, 2023

pubpub-zz commented Dec 20, 2023

ghbm-itk commented Dec 21, 2023

ghbm-itk commented Dec 21, 2023

pubpub-zz commented Dec 21, 2023

ghbm-itk commented Dec 21, 2023

stefan6419846 commented Dec 24, 2023

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353

Comments

ghbm-itk commented Dec 20, 2023

Environment

Code + PDF

Results

stefan6419846 commented Dec 20, 2023

ghbm-itk commented Dec 20, 2023

stefan6419846 commented Dec 20, 2023

pubpub-zz commented Dec 20, 2023

ghbm-itk commented Dec 21, 2023

ghbm-itk commented Dec 21, 2023

pubpub-zz commented Dec 21, 2023

ghbm-itk commented Dec 21, 2023

stefan6419846 commented Dec 24, 2023