Skip to content

How to get formatted text from a strange messed up pdf #850

Discussion options

You must be logged in to vote

I have a better script already. Works similar, but goes down to each character. This automatically replaces doubled characters:

import fitz

doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
for b in blocks:
    for line in b["lines"]:
        for s in line["spans"]:
            for char in s["chars"]:
                bbox = fitz.Rect(char["bbox"])
                chars.append((bbox.y1, bbox.x0, char["c"]))
chars.sort(key=lambda x: (x[0], x[1]))
lines = {}
for char in chars:
    y = char[0]  # y1 = bottom of the char
    x = round(char[1])  # x0 = start (left) of the char
    ch = lines.get(y, {})
    ch[x] = char[2]  # st…

Replies: 2 comments 4 replies

Comment options

You must be logged in to vote
1 reply
@jindili
Comment options

Comment options

You must be logged in to vote
3 replies
@JorjMcKie
Comment options

@jindili
Comment options

@JorjMcKie
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants