Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lines of text are sometimes split into two #3653

Closed
rezemika opened this issue Jul 3, 2024 · 4 comments
Closed

Lines of text are sometimes split into two #3653

rezemika opened this issue Jul 3, 2024 · 4 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@rezemika
Copy link

rezemika commented Jul 3, 2024

Description of the bug

In the same context as #3650, I am parsing many table of contents of french PDF documents.
I'm not sure if it's a real "bug", but sometimes, in some documents, blocks of text are separated into two distinct lines, while other similar blocks are returned unified.

How to reproduce the bug

Here is an example with this file (on its second page): 2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf

>>> import pymupdf
>>> f = "2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[1].get_text("blocks")
[
    ...
    (56.79999923706055, 620.2137451171875, 565.9680786132812, 647.05078125, "Arrêté préfectoral n° 2023-CAB-29, du 17 juin 2024, portant interdiction temporaire de port et\ntransport d'objets pouvant constituer une arme par destination dans le centre ville de Nantes.\n", 12, 0),
    (56.79999923706055, 660.5637817382812, 566.5072631835938, 673.9508056640625, 'Arrêté  CAB/SPAS/2024/n°567\n,  en  date  du  17  juin  2024,  portant  interdiction  temporaire\n', 13, 0),
    (56.79999923706055, 674.0137329101562, 372.8849182128906, 687.4007568359375, "d'utilisation et de transport des artifices de divertissement.\n", 14, 0),
]

And here is the corresponding part of the PDF:

Two lines of table of contents.

As you can see, the first item is on two lines but get_text() returns it entirely with just a linebreak, but the second one is returned as two separate blocks. But, well, I'm not sure it's a real "bug", sorry if it's a false report...


In the meantime, just in case someone has the same problem, I partially fixed it by iterating on the blocks and, for each one, checking its bbox to see if it's close enough to the bbox of the last block (comparing last_block[y1] with block[y0]), and merging both if they're separated by less than 2pts.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.11

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Jul 3, 2024
@JorjMcKie
Copy link
Collaborator

This is no bug.
MuPDF's separation of text into blocks/line/spans is a multi-causal algorithm. Large horizontal distance between pieces of text may lead to different formal line items although they have the same bottom coordinate.
This often happens with text in table cells or text that has been stored justified as in your case.
You can use a script like the following to copy with this:
recovered-lines.zip

@rezemika
Copy link
Author

rezemika commented Jul 3, 2024

Oh ok, my bad, sorry for the inconvenience (and thank you for the script)!

@JorjMcKie
Copy link
Collaborator

No problem at all. The question was absolutely justified, and a reproducing PDF was attached.
So no complaints to be raised 😎.

@krish-tech02
Copy link

Hi @JorjMcKie I have reviewed your recovered lines script, my question is how to use this script, is it going to edit the pdf with recovered lines or after reading the we need to all the recovered lines function. I need to recovered the lines and make those lines a single block in PDF so later save all the properties in data frame. Below is my data frame object. Pleast help with saving the block text as recovered lines with applied html tags Could you please help?

Step 5: Group by block_id and concatenate HTML-formatted text

rows_with_html = []
for page_num, blocks in block_dict.items():
    for block in blocks:
        if block['type'] == 0:  # Only text blocks
            block_id = block['number']
            block_text = []  # Collect text for this block
            original_text = []
            for line in block['lines']:
                for span in line['spans']:
                    xmin, ymin, xmax, ymax = list(span['bbox'])
                    font_size = span['size']
                    text = span['text'].strip().replace('\n', '').replace('\r', '')
                    span_font = span['font']
                    color = span["color"]

                    is_upper = "uppercase" in span_font.lower()
                    is_bold = "bold" in span_font.lower()

                    # Validate and format color value
                    if isinstance(color, int):
                        font_color = f'#{color:06x}'  # Ensure it's a 6-digit hex
                    elif isinstance(color, tuple) and len(color) >= 3:
                        font_color = f'#{color[0]:02x}{color[1]:02x}{color[2]:02x}'
                    else:
                        font_color = '#000000'  # Fallback to black if invalid
                    # Validate color length (should be 7 characters including #)
                    if len(font_color) != 7 or not font_color.startswith('#'):
                        font_color = '#000000'  # Fallback to black if invalid

                    if text.replace(" ", "") != "":
                        original_text.append(text)
                        text = unidecode(text)
                        tag_for_text = tag.get(round(font_size), 'span')  # Default to 'span' if not found

                        if (font_size > 14):
                            tag_for_text = 'h1'
                        elif is_bold and tag_for_text.startswith('h'):
                            tag_for_text = 'h2'
                        elif tag_for_text.startswith('h'):
                            tag_for_text = 'h3' 

                        if is_bold:
                            # Apply <b> tags only if it's bold and not a heading
                            text = f"<b>{text}</b>"
                        
                        # if is_upper:
                        #     text = f"<span style='text-transform:uppercase'>{text}</span>"
                        # Only execute if text is not None, not empty, and not whitespace
                        # if text and text.strip():
                        #     text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                        #     block_text.append(text_with_tag)
                        if tag_for_text != 'p':
                            text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                            block_text.append(text_with_tag)
                        else:
                            block_text.append(text)  

            if not block_text or not block_text[0].startswith('<h'):
                rows_with_html.append((page_num, block_id, f"<p>{' '.join(block_text)}</p>", ' '.join(original_text)))
            else:
                rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<p></p>", ' '.join(original_text)))
            #rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<br><br>", ' '.join(original_text)))

# Create the final DataFrame
grouped_df = pd.DataFrame(rows_with_html, columns=['page_num', 'block_id', 'text', 'originalText'])
grouped_df.to_excel('test.xlsx')
return grouped_df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

3 participants