Lines of text are sometimes split into two #3653

rezemika · 2024-07-03T15:18:29Z

Description of the bug

In the same context as #3650, I am parsing many table of contents of french PDF documents.
I'm not sure if it's a real "bug", but sometimes, in some documents, blocks of text are separated into two distinct lines, while other similar blocks are returned unified.

How to reproduce the bug

Here is an example with this file (on its second page): 2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf

>>> import pymupdf
>>> f = "2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[1].get_text("blocks")
[
    ...
    (56.79999923706055, 620.2137451171875, 565.9680786132812, 647.05078125, "Arrêté préfectoral n° 2023-CAB-29, du 17 juin 2024, portant interdiction temporaire de port et\ntransport d'objets pouvant constituer une arme par destination dans le centre ville de Nantes.\n", 12, 0),
    (56.79999923706055, 660.5637817382812, 566.5072631835938, 673.9508056640625, 'Arrêté  CAB/SPAS/2024/n°567\n,  en  date  du  17  juin  2024,  portant  interdiction  temporaire\n', 13, 0),
    (56.79999923706055, 674.0137329101562, 372.8849182128906, 687.4007568359375, "d'utilisation et de transport des artifices de divertissement.\n", 14, 0),
]

And here is the corresponding part of the PDF:

As you can see, the first item is on two lines but get_text() returns it entirely with just a linebreak, but the second one is returned as two separate blocks. But, well, I'm not sure it's a real "bug", sorry if it's a false report...

In the meantime, just in case someone has the same problem, I partially fixed it by iterating on the blocks and, for each one, checking its bbox to see if it's close enough to the bbox of the last block (comparing last_block[y1] with block[y0]), and merging both if they're separated by less than 2pts.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2024-07-03T15:38:45Z

This is no bug.
MuPDF's separation of text into blocks/line/spans is a multi-causal algorithm. Large horizontal distance between pieces of text may lead to different formal line items although they have the same bottom coordinate.
This often happens with text in table cells or text that has been stored justified as in your case.
You can use a script like the following to copy with this:
recovered-lines.zip

rezemika · 2024-07-03T15:42:51Z

Oh ok, my bad, sorry for the inconvenience (and thank you for the script)!

JorjMcKie · 2024-07-03T16:13:28Z

No problem at all. The question was absolutely justified, and a reproducing PDF was attached.
So no complaints to be raised 😎.

krish-tech02 · 2024-11-08T16:46:10Z

Hi @JorjMcKie I have reviewed your recovered lines script, my question is how to use this script, is it going to edit the pdf with recovered lines or after reading the we need to all the recovered lines function. I need to recovered the lines and make those lines a single block in PDF so later save all the properties in data frame. Below is my data frame object. Pleast help with saving the block text as recovered lines with applied html tags Could you please help?

Step 5: Group by block_id and concatenate HTML-formatted text

rows_with_html = []
for page_num, blocks in block_dict.items():
    for block in blocks:
        if block['type'] == 0:  # Only text blocks
            block_id = block['number']
            block_text = []  # Collect text for this block
            original_text = []
            for line in block['lines']:
                for span in line['spans']:
                    xmin, ymin, xmax, ymax = list(span['bbox'])
                    font_size = span['size']
                    text = span['text'].strip().replace('\n', '').replace('\r', '')
                    span_font = span['font']
                    color = span["color"]

                    is_upper = "uppercase" in span_font.lower()
                    is_bold = "bold" in span_font.lower()

                    # Validate and format color value
                    if isinstance(color, int):
                        font_color = f'#{color:06x}'  # Ensure it's a 6-digit hex
                    elif isinstance(color, tuple) and len(color) >= 3:
                        font_color = f'#{color[0]:02x}{color[1]:02x}{color[2]:02x}'
                    else:
                        font_color = '#000000'  # Fallback to black if invalid
                    # Validate color length (should be 7 characters including #)
                    if len(font_color) != 7 or not font_color.startswith('#'):
                        font_color = '#000000'  # Fallback to black if invalid

                    if text.replace(" ", "") != "":
                        original_text.append(text)
                        text = unidecode(text)
                        tag_for_text = tag.get(round(font_size), 'span')  # Default to 'span' if not found

                        if (font_size > 14):
                            tag_for_text = 'h1'
                        elif is_bold and tag_for_text.startswith('h'):
                            tag_for_text = 'h2'
                        elif tag_for_text.startswith('h'):
                            tag_for_text = 'h3' 

                        if is_bold:
                            # Apply <b> tags only if it's bold and not a heading
                            text = f"<b>{text}</b>"
                        
                        # if is_upper:
                        #     text = f"<span style='text-transform:uppercase'>{text}</span>"
                        # Only execute if text is not None, not empty, and not whitespace
                        # if text and text.strip():
                        #     text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                        #     block_text.append(text_with_tag)
                        if tag_for_text != 'p':
                            text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                            block_text.append(text_with_tag)
                        else:
                            block_text.append(text)  

            if not block_text or not block_text[0].startswith('<h'):
                rows_with_html.append((page_num, block_id, f"<p>{' '.join(block_text)}</p>", ' '.join(original_text)))
            else:
                rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<p></p>", ' '.join(original_text)))
            #rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<br><br>", ' '.join(original_text)))

# Create the final DataFrame
grouped_df = pd.DataFrame(rows_with_html, columns=['page_num', 'block_id', 'text', 'originalText'])
grouped_df.to_excel('test.xlsx')
return grouped_df

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Jul 3, 2024

JorjMcKie closed this as completed Jul 3, 2024

krish-tech02 mentioned this issue Nov 8, 2024

Lines of text are sometimes split into two #4031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lines of text are sometimes split into two #3653

Lines of text are sometimes split into two #3653

rezemika commented Jul 3, 2024

JorjMcKie commented Jul 3, 2024

rezemika commented Jul 3, 2024 •

edited

Loading

JorjMcKie commented Jul 3, 2024

krish-tech02 commented Nov 8, 2024

Lines of text are sometimes split into two #3653

Lines of text are sometimes split into two #3653

Comments

rezemika commented Jul 3, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Jul 3, 2024

rezemika commented Jul 3, 2024 • edited Loading

JorjMcKie commented Jul 3, 2024

krish-tech02 commented Nov 8, 2024

Step 5: Group by block_id and concatenate HTML-formatted text

rezemika commented Jul 3, 2024 •

edited

Loading