Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect image generation and content extraction #4184

Open
1339503169 opened this issue Dec 30, 2024 · 11 comments
Open

Incorrect image generation and content extraction #4184

1339503169 opened this issue Dec 30, 2024 · 11 comments

Comments

@1339503169
Copy link

Description of the bug

here is the raw pdf
BOW2429730S1.pdf
what i look in wps
image
pic i transfer with pymupdf
BOW2429730S1 pdf
They seem to have inconsistencies

There are two issues with this document. The first issue is that the content I extracted is missing compared to the source PDF. The second issue is that the layout of the generated images is different from the source file. Are there any configurations or schemes that can allow this document to be extracted normally

How to reproduce the bug

import fitz
document = fitz.open('BOW2429730S1.pdf')
page = document.load_page(0)
texts = page.get_text()
img = page.get_pixmap()
img.save()

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.9

@JorjMcKie
Copy link
Collaborator

Please be specific:

  • what are the layout differences exactly
  • which parts of the text are missing

@HuJianE
Copy link

HuJianE commented Jan 9, 2025

I found a similar issue, and this PDF has only 4 pics , but what I see in WPS is different
SAMPLE+5.pdf

file = '/Users/hujian/Downloads/SAMPLE+5.pdf'
pdf = fitz.open(file)
images = pdf[0].get_image_info(hashes=False, xrefs=True)
image_info = pdf.extract_image(xref=7)

please help check this, I assume this is a submasked picture and thus
image

waiting for your response, hope this is what I missed how to process this image

@HuJianE
Copy link

HuJianE commented Jan 13, 2025

Any feedback?
@JorjMcKie

@JorjMcKie
Copy link
Collaborator

Everything works as it should!
The page does have 4 image references in page.get_images() but it only displays 3 of them. The 4th one (with the mask, KSPX48) is used as watermark only and thus does not fall under this category, meaning it does not appear in page.det_image_info() / page.get_text("dict").
The other 3 images at xrefs 7, 14, 16 are accessible like expected.

@HuJianE
Copy link

HuJianE commented Jan 16, 2025

@JorjMcKie thanks understood it as a smask.
But How could I know if the xref 7 has got a mask?
I tried to use
page.det_image_info() / page.get_text("dict")
i found picture xref 7 has no mask image
image
how could I find this KSPX48 you mentioned is on top of the xref 7?
Thanks for your any reply about my stupid question.

@JorjMcKie
Copy link
Collaborator

Image KSPX1 is used as a watermark. This is the only one that has a mask!
It cannot be extracted via .get_text()/.get_image_info() at all.
It therefore is not easy to find out whether this image is "below" or "above" other page content. But being a watermark and having having transparency make it probably that it is above content.

@HuJianE
Copy link

HuJianE commented Jan 16, 2025

No
I think you miss understood my issue, I am referring to the picture with xref 7, and that pic I extracted content is as below:
image
BUt when I look into this picture in the WPS or Microsoft edit; I tried to look into this picture,
I removed those two yellow picture and I got like this
image

So I am curious of why when I tried to extract this picture, it is showing it has no mask; But when I tried to use WPS or Microsoft Word editor I found this two pictures are different?

@HuJianE
Copy link

HuJianE commented Jan 16, 2025

@JorjMcKie This is my issue, and I know that KSPX1 is as the water mark of the whole page, and that is what I found red on the page with 30 or 45 degree picture

@JorjMcKie
Copy link
Collaborator

I don't know what WPS or Microsoft edit are doing ... and I won't investigate their behavior either.

But PyMuPDF allows you to "delete" images selectively. If you execute page.delete_image() with xrefs 7, 14, 16 and save the resulting PDF each time separately, you will get these results:
without-7.pdf
without-14.pdf
without-16.pdf
This shows nicely how the overall page appearance comes about.

@HuJianE
Copy link

HuJianE commented Jan 16, 2025

I would look into those myself for now; still trying to know what they did to the PDF

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jan 16, 2025

The PDF page also has a number of "clips" defined. They can make areas appear empty.
All images on the page are wrapped in clip rectangles, so when "deleting" them, their areas appear empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants