Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert_htmlbox error: int too large to convert to float #3916

Closed
Mominadar opened this issue Oct 4, 2024 · 4 comments
Closed

insert_htmlbox error: int too large to convert to float #3916

Mominadar opened this issue Oct 4, 2024 · 4 comments

Comments

@Mominadar
Copy link

Description of the bug

Hi,

I am using the PyMuPDF library to translate pdfs keeping the overall structure intact. Now, this has been working successfully but while trying to translate an RTL langue. I got this error in processing the pdf on insert_htmlbox function.

int too large to convert to float

The bbox for which this occured is:

[393.83990478515625, 245.69000244140625, 393.83990478515625, 256.7300109863281]

I tried to round off the values to 2 decimal places even but still got the same issue. This bbox value is gotten using this code snippet:

 pages_blocks = []
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            blocks = page.get_text("dict")["blocks"]
            page_blocks = []
            for block in blocks:
                if "lines" in block:
                    for line in block["lines"]:
                        for span in line["spans"]:
                            if span["text"].strip():
                                bbox = span["bbox"]
                                if isinstance(bbox, (list, tuple)) and len(bbox) == 4:
                                    bbox = [float(coord) for coord in bbox]
                                    if span["text"]:
                                        page_blocks.append({
                                            "text": span["text"],
                                            "bbox": bbox,
                                        })
            pages_blocks.append(page_blocks)
        return pages_blocks   

Note that is value of bbox works:

[393.83990478515625, 245.69000244140625, 400.83990478515625, 256.7300109863281]

Only difference is the third value's whole part so I am unsure what exactly are the constraints here.

How to reproduce the bug

Code to reproduce output:

import fitz

class TranslatePdfClient: 
    def extract_text_blocks(self, doc):
        pages_blocks = []
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            blocks = page.get_text("dict")["blocks"]
            page_blocks = []
            for block in blocks:
                if "lines" in block:
                    for line in block["lines"]:
                        for span in line["spans"]:
                            font_properties = {"size": span["size"], "color": "#%06x" % span["color"], "font-family": '%s' % span["font"]}
                            if span["text"].strip():
                                bbox = span["bbox"]
                                if isinstance(bbox, (list, tuple)) and len(bbox) == 4:
                                    bbox = [float(coord) for coord in bbox]
                                    if span["text"]:
                                        page_blocks.append({
                                            "text": span["text"],
                                            "bbox": bbox,
                                            "font_properties": font_properties
                                        })
            pages_blocks.append(page_blocks)
        return pages_blocks   
    
    def translate(self, source_language, target_language):
        scale = 0.5
        try:
            doc = fitz.open('test.pdf')
            pages_blocks = self.extract_text_blocks(doc)
            
            print("doc", doc)
            for page_num, page_blocks in enumerate(pages_blocks):
                page = doc.load_page(page_num)
                for block in page_blocks:
                    text = block["text"]
                    bbox = block["bbox"]
                    font_properties = block["font_properties"]
                    if isinstance(bbox, (list, tuple)) and len(bbox) == 4:
                        try:
                            translated_text = text #hardcoding for bug reproduction only
                            if any(c.isalpha() for c in text):
                                #bbox = [388.3199157714844, 250.489990234375, 400.7757873535156, 261.5299987792969] #works
                                #bbox = [393.83990478515625, 245.69000244140625, 393.83990478515625, 256.7300109863281] #crashes
                                #bbox = [393.83990478515625, 245.69000244140625, 400.83990478515625, 256.7300109863281] works
                                rect = fitz.Rect(bbox)
                                page.add_redact_annot(rect, text="")
                                page.apply_redactions(images=0, graphics=0, text=0)
                                adjusted_size = font_properties['size'] - 2 
                                adjusted_size = "%g" % adjusted_size
                                html = f'''
                                        <div style="font-size:{adjusted_size}px; font-family:{font_properties["font-family"]}; color:{font_properties["color"]}; overflow:visible;">
                                            {translated_text}
                                        </div>
                                        ''' 
                                page.insert_htmlbox(rect, html, scale_low=scale)
                            else:
                                print(f'No need to translate non-alpha text: {text}')
                        except Exception as e:
                            print(f"Error processing block: '{text}' with bbox: {bbox}")
                            print(e)
                            raise
                    else:
                        print(f"Invalid bbox: {bbox}")
            
            doc.save('output.pdf', garbage=4, clean=True, deflate=True, deflate_images=True, deflate_fonts=True)
            doc.close()
        except Exception as e:
            print(fitz.TOOLS.mupdf_warnings())
            print(f"Error processing block: {str(e)}")
            print(e)


if __name__ == "__main__":
    client = TranslatePdfClient()
    client.translate('arabic', 'english')

test file:
Getting error when using file directly as:
Error processing block: 'ر' with bbox: [393.83990478515625, 245.69000244140625, 393.83990478515625, 256.7300109863281]
'NoneType' object has no attribute 'y1'

Error processing block: 'NoneType' object has no attribute 'y1'
'NoneType' object has no attribute 'y1'

but when I use a file stream which I use on prod, I get:
Error processing block: 'ر' with bbox: [393.83990478515625, 245.69000244140625, 393.83990478515625, 256.7300109863281]
int too large to convert to float

file: test.pdf

PyMuPDF version

1.24.11

Operating system

Linux

Python version

3.10

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the detailed report, i was able to reproduce it easily.

It's a bug in Page.insert_htmlbox() where it fails to notice if the html content doesn't fit even at the minimum scaling scale_low (0.5 in your code).

I have a fix in my tree.

@Mominadar
Copy link
Author

Hi Juilan,

Thanks so much for the quick response and fix. Really appreciate it. Will this be released soon?

@julian-smith-artifex-com
Copy link
Collaborator

I'm hoping to make a new release this week, but it's quite likely to be delayed until sometime next week.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants