Can't extract images for this PDF #3936

bbfrog · 2024-10-10T18:23:31Z

Description of the bug

Monaleesa_full.pdf
Pymupdf can't extract images in page 2 and page 4 of this pdf.

How to reproduce the bug

import pymupdf
doc = pymupdf.open('Monaleesa_full.pdf')

page_num = 0
for page in doc:
  page_num += 1
  images = page.get_images(full=True)
  print(f'page {page_num}: {len(images)} images')

PyMuPDF version

1.24.11

Operating system

MacOS

Python version

3.12

JorjMcKie · 2024-10-10T20:16:16Z

Except for page 7 (0-based), none of the pages contains an image.
What you see are vector graphics - no images.

JorjMcKie · 2024-10-10T20:18:01Z

Vector graphics cannot be extracted. All you can do is making a "photo" of the respective page area ...

bbfrog · 2024-10-12T05:38:37Z

Acrobat API can extract the vector graphics and save as png or svg. How does it do this? Is it hard to support in Pymupdf? THanks!

JorjMcKie · 2024-10-12T06:36:38Z

You can try this script. Or do this:

import pymupdf

doc = pymupdf.open("input.pdf")
for page in doc:
    for i, bbox in enumerate(page.cluster_drawings()):
        pix = page.get_pixmap(clip=bbox, dpi=150)
        pix.save(f"{doc.name}-{page.number}-{i}.png")

bbfrog · 2024-10-15T19:11:35Z

Thanks @JorjMcKie very much. It works and can extract the image I want. But it also extracted tables from this pdf as drawing, is there any field can differentiate the tables with other drawing? Thanks!

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Oct 10, 2024

JorjMcKie closed this as completed Oct 10, 2024

pymupdf locked and limited conversation to collaborators Oct 15, 2024

JorjMcKie converted this issue into discussion #3948 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Can't extract images for this PDF #3936

Can't extract images for this PDF #3936

bbfrog commented Oct 10, 2024 •

edited

Loading

JorjMcKie commented Oct 10, 2024 •

edited

Loading

JorjMcKie commented Oct 10, 2024

bbfrog commented Oct 12, 2024 •

edited

Loading

JorjMcKie commented Oct 12, 2024

bbfrog commented Oct 15, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Can't extract images for this PDF #3936

Can't extract images for this PDF #3936

Comments

bbfrog commented Oct 10, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Oct 10, 2024 • edited Loading

JorjMcKie commented Oct 10, 2024

bbfrog commented Oct 12, 2024 • edited Loading

JorjMcKie commented Oct 12, 2024

bbfrog commented Oct 15, 2024

This issue was moved to a discussion.

bbfrog commented Oct 10, 2024 •

edited

Loading

JorjMcKie commented Oct 10, 2024 •

edited

Loading

bbfrog commented Oct 12, 2024 •

edited

Loading