Incorrect image generation and content extraction #4184

1339503169 · 2024-12-30T06:52:49Z

Description of the bug

here is the raw pdf
BOW2429730S1.pdf
what i look in wps

pic i transfer with pymupdf

They seem to have inconsistencies

There are two issues with this document. The first issue is that the content I extracted is missing compared to the source PDF. The second issue is that the layout of the generated images is different from the source file. Are there any configurations or schemes that can allow this document to be extracted normally

How to reproduce the bug

import fitz
document = fitz.open('BOW2429730S1.pdf')
page = document.load_page(0)
texts = page.get_text()
img = page.get_pixmap()
img.save()

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.9

JorjMcKie · 2024-12-30T09:12:03Z

Please be specific:

what are the layout differences exactly
which parts of the text are missing

HuJianE · 2025-01-09T07:27:02Z

I found a similar issue, and this PDF has only 4 pics , but what I see in WPS is different
SAMPLE+5.pdf

file = '/Users/hujian/Downloads/SAMPLE+5.pdf'
pdf = fitz.open(file)
images = pdf[0].get_image_info(hashes=False, xrefs=True)
image_info = pdf.extract_image(xref=7)

please help check this, I assume this is a submasked picture and thus

waiting for your response, hope this is what I missed how to process this image

HuJianE · 2025-01-13T03:38:08Z

Any feedback?
@JorjMcKie

JorjMcKie · 2025-01-13T11:43:32Z

Everything works as it should!
The page does have 4 image references in page.get_images() but it only displays 3 of them. The 4th one (with the mask, KSPX48) is used as watermark only and thus does not fall under this category, meaning it does not appear in page.det_image_info() / page.get_text("dict").
The other 3 images at xrefs 7, 14, 16 are accessible like expected.

HuJianE · 2025-01-16T06:51:45Z

@JorjMcKie thanks understood it as a smask.
But How could I know if the xref 7 has got a mask?
I tried to use
page.det_image_info() / page.get_text("dict")
i found picture xref 7 has no mask image

how could I find this KSPX48 you mentioned is on top of the xref 7?
Thanks for your any reply about my stupid question.

JorjMcKie · 2025-01-16T07:19:10Z

Image KSPX1 is used as a watermark. This is the only one that has a mask!
It cannot be extracted via .get_text()/.get_image_info() at all.
It therefore is not easy to find out whether this image is "below" or "above" other page content. But being a watermark and having having transparency make it probably that it is above content.

HuJianE · 2025-01-16T07:27:07Z

No
I think you miss understood my issue, I am referring to the picture with xref 7, and that pic I extracted content is as below:

BUt when I look into this picture in the WPS or Microsoft edit; I tried to look into this picture,
I removed those two yellow picture and I got like this

So I am curious of why when I tried to extract this picture, it is showing it has no mask; But when I tried to use WPS or Microsoft Word editor I found this two pictures are different?

HuJianE · 2025-01-16T07:28:31Z

@JorjMcKie This is my issue, and I know that KSPX1 is as the water mark of the whole page, and that is what I found red on the page with 30 or 45 degree picture

JorjMcKie · 2025-01-16T08:00:40Z

I don't know what WPS or Microsoft edit are doing ... and I won't investigate their behavior either.

But PyMuPDF allows you to "delete" images selectively. If you execute page.delete_image() with xrefs 7, 14, 16 and save the resulting PDF each time separately, you will get these results:
without-7.pdf
without-14.pdf
without-16.pdf
This shows nicely how the overall page appearance comes about.

HuJianE · 2025-01-16T08:04:29Z

I would look into those myself for now; still trying to know what they did to the PDF

JorjMcKie · 2025-01-16T08:15:05Z

The PDF page also has a number of "clips" defined. They can make areas appear empty.
All images on the page are wrapped in clip rectangles, so when "deleting" them, their areas appear empty.

JorjMcKie added the Waiting for information label Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect image generation and content extraction #4184

Incorrect image generation and content extraction #4184

1339503169 commented Dec 30, 2024

JorjMcKie commented Dec 30, 2024

HuJianE commented Jan 9, 2025

HuJianE commented Jan 13, 2025

JorjMcKie commented Jan 13, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025

HuJianE commented Jan 16, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025 •

edited

Loading

Incorrect image generation and content extraction #4184

Incorrect image generation and content extraction #4184

Comments

1339503169 commented Dec 30, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Dec 30, 2024

HuJianE commented Jan 9, 2025

HuJianE commented Jan 13, 2025

JorjMcKie commented Jan 13, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025

HuJianE commented Jan 16, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025

HuJianE commented Jan 16, 2025

JorjMcKie commented Jan 16, 2025 • edited Loading

JorjMcKie commented Jan 16, 2025 •

edited

Loading