Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pix.color_topusage raise Segmentation fault (core dumped) #3994

Closed
ytcpub opened this issue Oct 28, 2024 · 11 comments
Closed

pix.color_topusage raise Segmentation fault (core dumped) #3994

ytcpub opened this issue Oct 28, 2024 · 11 comments

Comments

@ytcpub
Copy link

ytcpub commented Oct 28, 2024

Description of the bug

test3.pdf

I test diffirent colorspace, and reduce the bbox, it always raise "Segmentation fault", I don't know why

How to reproduce the bug

import fitz
doc = fitz.open('test3.pdf')
page = doc[0]
txt_blocks = [blk for blk in page.get_text('dict')['blocks'] if blk['type']==0]
for blk in txt_blocks:
	pix = page.get_pixmap(clip=fitz.Rect([int(v) for v in blk['bbox']]), colorspace=fitz.csRGB, alpha=False)
	percent, color = pix.color_topusage()

PyMuPDF version

1.24.12

Operating system

Linux

Python version

3.12

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 28, 2024

I cannot reproduce your problem: everything works fine under Windows and Linux.
However, there is a known issue (fixed in next version):

The method crashes the interpreter if the clip represents an empty rectangle or leads to an empty pixmap.
Your bizarre way of computing rectangles with integer coordinates may well lead to this condition.
So please for now check this condition yourself.
Why don't you use IRect by the way?

import pymupdf

doc = pymupdf.open("test.pdf")
page = doc[0]
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
    clip = pymupdf.Rect([int(v) for v in blk["bbox"]])
    if clip.is_empty:
        print(f"this is empty: {clip=}")
        continue
    pix = page.get_pixmap(
        clip=clip,
        colorspace=pymupdf.csRGB,
        alpha=False,
    )
    print(f"{pix.color_topusage()=}")

You do not need to convert to an integer rectangle - the method can deal with lists or tuples of 4 numbers directly: just pass blk["bbox"].
And instead of creating zillions of pixmaps (one for each text block), you can create one pixmap for the whole page and then use the clip parameter of .color_topusage().

import pymupdf

print(pymupdf.version)
doc = pymupdf.open("test.pdf")
page = doc[0]
pix = page.get_pixmap()
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
    clip = blk["bbox"]
    print(f"{pix.color_topusage(clip=clip)=}")

@JorjMcKie
Copy link
Collaborator

BTW a colleague also tried on a Mac and doesn't see a segv either.

@ytcpub
Copy link
Author

ytcpub commented Oct 29, 2024

BTW a colleague also tried on a Mac and doesn't see a segv either.

I am sorry, I upload wrong file, this is the file
test3.pdf

@ytcpub
Copy link
Author

ytcpub commented Oct 29, 2024

import fitz
doc = fitz.open("test3.pdf")
page = doc[0]
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
clip = fitz.Rect([int(v) for v in blk["bbox"]])
if clip.is_empty:
print(f"this is empty: {clip=}")
continue
pix = page.get_pixmap(
clip=clip,
colorspace=fitz.csRGB,
alpha=False,
)
print(clip)
print(f"{pix.color_topusage()=}")

I see a txt_block bbox that x1 is larger than the page width: Rect(35.0, 636.0, 63.0, 738.0), the page width is page.get_text('dict')['width'] 612, but the x1 is 636

@JorjMcKie
Copy link
Collaborator

Ok, now that we have that file, here again my recommendation to use the function with more care until we have immunized it against wrong calls:

import pymupdf

doc = pymupdf.open("test3.pdf")
page = doc[0]

txt_blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
for blk in txt_blocks:
    clip = pymupdf.IRect(blk["bbox"]) & page.rect  # only inside visible page!
    if clip.is_empty:  # and never with empty clips!
        print(f"empty: {clip=}")
        continue
    pix = page.get_pixmap(clip=clip)
    pix.color_topusage()

Shows this:

empty: clip=IRect(35, 636, 63, 612)
empty: clip=IRect(523, 618, 536, 612)
empty: clip=IRect(132, 637, 142, 612)
empty: clip=IRect(159, 725, 169, 612)
empty: clip=IRect(167, 693, 177, 612)
empty: clip=IRect(288, 633, 298, 612)
empty: clip=IRect(334, 633, 344, 612)
empty: clip=IRect(395, 612, 405, 612)
empty: clip=IRect(426, 619, 435, 612)
empty: clip=IRect(464, 612, 474, 612)
empty: clip=IRect(502, 612, 512, 612)
empty: clip=IRect(514, 735, 527, 612)

Note: Text written outside the page rectangle is not illegal and sometimes used by PDF creators to store hidden information. So we must make sure that the color count is happening on the visible part of the page only.

@JorjMcKie
Copy link
Collaborator

In the future, please provide us code properly indented in a code block like this:
image

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the updated test file. I've reproduced the segv with the current release PyMuPDF-1.24.12.

Happily the bug is already fixed in PyMuPDF git (with the fix for #3848), so will be fixed in our next release.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.13.

@ytcpub
Copy link
Author

ytcpub commented Oct 30, 2024

thanks very much!
another point, if color_topusage support one input param like n maybe more usable, maybe want get top 1 or top 2 colors

@JorjMcKie
Copy link
Collaborator

thanks very much! another point, if color_topusage support one input param like n maybe more usable, maybe want get top 1 or top 2 colors

Take the pix.color_count method and build your own evaluator around its output.

@ytcpub
Copy link
Author

ytcpub commented Oct 30, 2024

thanks very much! another point, if color_topusage support one input param like n maybe more usable, maybe want get top 1 or top 2 colors

Take the pix.color_count method and build your own evaluator around its output.

got , very thanks ~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants