Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 #2927

Open
stefan6419846 opened this issue Oct 30, 2024 · 2 comments
Labels
is-regression Regression introduced as a side-effect of another change PdfWriter The PdfWriter component is affected

Comments

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Oct 30, 2024

Calling page.transfer_rotation_to_content() changes the visibility of some content after upgrading from version 4.2.0 to 4.3.0 for some PDF files. The corresponding text layer is invisible, but can be selected.

When viewing the diff, two Q operators are missing in version 4.3.0.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.4.0-150600.23.25-default-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

writer = PdfWriter(clone_from='file.pdf')
for page in writer.pages:
    page.transfer_rotation_to_content()
writer.write('out.pdf')

I do not have a suitable PDF file at the moment, but I am working on getting one.

@stefan6419846 stefan6419846 added PdfWriter The PdfWriter component is affected is-regression Regression introduced as a side-effect of another change labels Oct 30, 2024
@stefan6419846 stefan6419846 changed the title PageObject.transfer_rotation_to_content() hides content since pypdf 4.3.0 PageObject.transfer_rotation_to_content() hides some content since pypdf 4.3.0 Oct 30, 2024
@stefan6419846
Copy link
Collaborator Author

I managed to create a standalone example in the meantime: test_clean.pdf Please note that this might show further issues due to the cleanup done by me.

After running the above code with pypdf version 4.2.0 and 4.3.0, I get the following diff:

diff --git a/result_4.2.0.pdf b/result_4.3.0.pdf
index 04d3347..72ec47e 100644
--- a/result_4.2.0.pdf
+++ b/result_4.3.0.pdf
@@ -72,7 +72,7 @@ endstream
 endobj
 8 0 obj
 <<
-/Length 992
+/Length 990
 >>
 stream
 q
@@ -122,7 +122,6 @@ BI
 ID /221̎215346^PT^PBS377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377377360^A^@^P
 EI
 Q
-Q
 q
 110 170 5520 7850 re
 W
@@ -177,8 +176,8 @@ xref
 0000000576 00000 n 
 0000000845 00000 n 
 0000001785 00000 n 
-0000002828 00000 n 
-0000002865 00000 n 
+0000002826 00000 n 
+0000002863 00000 n 
 trailer
 <<
 /Size 11
@@ -186,5 +185,5 @@ trailer
 /Info 10 0 R
 >>
 startxref
-2929
+2927
 %%EOF

The most apparent change seems to be that there is one Q operator less than before.

The output files: result_4.2.0.pdf result_4.3.0.pdf

You can already see that the "abc" text disappeared. When rendering this as PNG through Ghostscript, we can see that the white circles disappear as well.

For 4.2.0:

result_4 2 0

For 4.3.0:

result_4 3 0

@stefan6419846
Copy link
Collaborator Author

The offending commit appears to be 23a81ba, which makes sense as the offending image is an inline image (although never requesting it explicitly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change PdfWriter The PdfWriter component is affected
Projects
None yet
Development

No branches or pull requests

1 participant