pix.color_topusage raise Segmentation fault (core dumped) #3994

ytcpub · 2024-10-28T07:21:12Z

Description of the bug

I test diffirent colorspace, and reduce the bbox， it always raise "Segmentation fault", I don't know why

How to reproduce the bug

import fitz
doc = fitz.open('test3.pdf')
page = doc[0]
txt_blocks = [blk for blk in page.get_text('dict')['blocks'] if blk['type']==0]
for blk in txt_blocks:
	pix = page.get_pixmap(clip=fitz.Rect([int(v) for v in blk['bbox']]), colorspace=fitz.csRGB, alpha=False)
	percent, color = pix.color_topusage()

PyMuPDF version

1.24.12

Operating system

Linux

Python version

3.12

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-10-28T10:18:13Z

I cannot reproduce your problem: everything works fine under Windows and Linux.
However, there is a known issue (fixed in next version):

The method crashes the interpreter if the clip represents an empty rectangle or leads to an empty pixmap.
Your bizarre way of computing rectangles with integer coordinates may well lead to this condition.
So please for now check this condition yourself.
Why don't you use IRect by the way?

import pymupdf

doc = pymupdf.open("test.pdf")
page = doc[0]
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
    clip = pymupdf.Rect([int(v) for v in blk["bbox"]])
    if clip.is_empty:
        print(f"this is empty: {clip=}")
        continue
    pix = page.get_pixmap(
        clip=clip,
        colorspace=pymupdf.csRGB,
        alpha=False,
    )
    print(f"{pix.color_topusage()=}")

You do not need to convert to an integer rectangle - the method can deal with lists or tuples of 4 numbers directly: just pass blk["bbox"].
And instead of creating zillions of pixmaps (one for each text block), you can create one pixmap for the whole page and then use the clip parameter of .color_topusage().

import pymupdf

print(pymupdf.version)
doc = pymupdf.open("test.pdf")
page = doc[0]
pix = page.get_pixmap()
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
    clip = blk["bbox"]
    print(f"{pix.color_topusage(clip=clip)=}")

JorjMcKie · 2024-10-28T10:26:33Z

BTW a colleague also tried on a Mac and doesn't see a segv either.

ytcpub · 2024-10-29T02:50:40Z

BTW a colleague also tried on a Mac and doesn't see a segv either.

I am sorry, I upload wrong file, this is the file
test3.pdf

ytcpub · 2024-10-29T03:08:43Z

import fitz
doc = fitz.open("test3.pdf")
page = doc[0]
txt_blocks = [blk for blk in page.get_text("dict")["blocks"] if blk["type"] == 0]
for blk in txt_blocks:
clip = fitz.Rect([int(v) for v in blk["bbox"]])
if clip.is_empty:
print(f"this is empty: {clip=}")
continue
pix = page.get_pixmap(
clip=clip,
colorspace=fitz.csRGB,
alpha=False,
)
print(clip)
print(f"{pix.color_topusage()=}")

I see a txt_block bbox that x1 is larger than the page width： Rect(35.0, 636.0, 63.0, 738.0), the page width is page.get_text('dict')['width'] 612, but the x1 is 636

JorjMcKie · 2024-10-29T08:42:47Z

Ok, now that we have that file, here again my recommendation to use the function with more care until we have immunized it against wrong calls:

import pymupdf

doc = pymupdf.open("test3.pdf")
page = doc[0]

txt_blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
for blk in txt_blocks:
    clip = pymupdf.IRect(blk["bbox"]) & page.rect  # only inside visible page!
    if clip.is_empty:  # and never with empty clips!
        print(f"empty: {clip=}")
        continue
    pix = page.get_pixmap(clip=clip)
    pix.color_topusage()

Shows this:

empty: clip=IRect(35, 636, 63, 612)
empty: clip=IRect(523, 618, 536, 612)
empty: clip=IRect(132, 637, 142, 612)
empty: clip=IRect(159, 725, 169, 612)
empty: clip=IRect(167, 693, 177, 612)
empty: clip=IRect(288, 633, 298, 612)
empty: clip=IRect(334, 633, 344, 612)
empty: clip=IRect(395, 612, 405, 612)
empty: clip=IRect(426, 619, 435, 612)
empty: clip=IRect(464, 612, 474, 612)
empty: clip=IRect(502, 612, 512, 612)
empty: clip=IRect(514, 735, 527, 612)

Note: Text written outside the page rectangle is not illegal and sometimes used by PDF creators to store hidden information. So we must make sure that the color count is happening on the visible part of the page only.

JorjMcKie · 2024-10-29T08:45:38Z

In the future, please provide us code properly indented in a code block like this:

julian-smith-artifex-com · 2024-10-29T11:41:11Z

Thanks for the updated test file. I've reproduced the segv with the current release PyMuPDF-1.24.12.

Happily the bug is already fixed in PyMuPDF git (with the fix for #3848), so will be fixed in our next release.

julian-smith-artifex-com · 2024-10-29T16:25:02Z

Fixed in 1.24.13.

ytcpub · 2024-10-30T03:40:17Z

thanks very much!
another point, if color_topusage support one input param like n maybe more usable， maybe want get top 1 or top 2 colors

JorjMcKie · 2024-10-30T07:56:34Z

thanks very much! another point, if color_topusage support one input param like n maybe more usable， maybe want get top 1 or top 2 colors

Take the pix.color_count method and build your own evaluator around its output.

ytcpub · 2024-10-30T08:39:24Z

thanks very much! another point, if color_topusage support one input param like n maybe more usable， maybe want get top 1 or top 2 colors

Take the pix.color_count method and build your own evaluator around its output.

got , very thanks ~~

julian-smith-artifex-com added the Fixed in next release label Oct 29, 2024

julian-smith-artifex-com closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pix.color_topusage raise Segmentation fault (core dumped) #3994

pix.color_topusage raise Segmentation fault (core dumped) #3994

ytcpub commented Oct 28, 2024 •

edited by julian-smith-artifex-com

Loading

JorjMcKie commented Oct 28, 2024 •

edited

Loading

JorjMcKie commented Oct 28, 2024

ytcpub commented Oct 29, 2024

ytcpub commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024

julian-smith-artifex-com commented Oct 29, 2024

julian-smith-artifex-com commented Oct 29, 2024

ytcpub commented Oct 30, 2024

JorjMcKie commented Oct 30, 2024

ytcpub commented Oct 30, 2024

pix.color_topusage raise Segmentation fault (core dumped) #3994

pix.color_topusage raise Segmentation fault (core dumped) #3994

Comments

ytcpub commented Oct 28, 2024 • edited by julian-smith-artifex-com Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Oct 28, 2024 • edited Loading

JorjMcKie commented Oct 28, 2024

ytcpub commented Oct 29, 2024

ytcpub commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024

JorjMcKie commented Oct 29, 2024

julian-smith-artifex-com commented Oct 29, 2024

julian-smith-artifex-com commented Oct 29, 2024

ytcpub commented Oct 30, 2024

JorjMcKie commented Oct 30, 2024

ytcpub commented Oct 30, 2024

ytcpub commented Oct 28, 2024 •

edited by julian-smith-artifex-com

Loading

JorjMcKie commented Oct 28, 2024 •

edited

Loading