-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16. #2353
Comments
If you have a look at the changelog, you will see that there have been some changes/improvements to the text extraction in the meantime. This probably is related to these changes and most likely intended or a previous bug. |
But 3.17 outputs a wrong answer, when 3.16 outputs the correct answer. Seems like a new bug. |
Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this? |
In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix. |
I will try this when I have some time.
I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't. |
@stefan6419846 I suspect the change happened with #2206 |
oups you are right I had to keep the existing definitions whereas it was more complex to be used.
The change was raised because the TM was not captured at the beginning of the line. Would you accept to share the file in private, emailing it to @MartinThoma ? |
I'm sorry but it would be illegal for me to share the document with anyone outside my org. Whenever I try to edit the pdf, the matrices change completely. |
In general, there is no easy/general purpose approach to do this as far as I know. A possible way would be to manually mess with the internal page source, but this requires some deeper understanding of the PDF format. |
I'm trying to extract text from a pdf together with the position of the text.
When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.
Environment
Windows-10-10.0.19045-SP0
pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0
AND
pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0
Code + PDF
This is a minimal, complete example that shows the issue:
Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug.
I know this might make the bug hard to replicate.
Results
In version 3.17 I get:
In version 3.16 I get:
As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.
The text was updated successfully, but these errors were encountered: