How to get formatted text from a strange messed up pdf #850
-
The strange messed up pdf: It looks like this:
result of page.getText("words"):
and result of page.getText("blocks")
Seems texts are randomly splitted up, there is even some images in it. Any one have some suggestion, or maybe it's better to convert it to image and ocr it. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
To get rid of images, simply use the flags parameter in text extractions Otherwise the PDF indeed looks like being messed up on purpose. But there is hope. The following snippet brings at least some order to all that sloppiness: import fitz
doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
wl = page.getText("words")
wl.sort(key=lambda w: (w[3], w[0])) # sort asc: vertical, horizontal coordinates
lines = {}
for w in wl:
y = w[3] # y1 = bottom of the word
x = round(w[0]) # x0 = start (left) of the word
words = lines.get(y, {})
words[x] = w[4] # store word text und its start coord
lines[y] = words # store back words for this line
for y in lines.keys():
words = lines[y]
print(" ".join([words[x] for x in words.keys()])) Produces this, which is a lot closer:
As you see: some characters are still doubled. But you may see the direction of it all. |
Beta Was this translation helpful? Give feedback.
-
I have a better script already. Works similar, but goes down to each character. This automatically replaces doubled characters: import fitz
doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
for b in blocks:
for line in b["lines"]:
for s in line["spans"]:
for char in s["chars"]:
bbox = fitz.Rect(char["bbox"])
chars.append((bbox.y1, bbox.x0, char["c"]))
chars.sort(key=lambda x: (x[0], x[1]))
lines = {}
for char in chars:
y = char[0] # y1 = bottom of the char
x = round(char[1]) # x0 = start (left) of the char
ch = lines.get(y, {})
ch[x] = char[2] # store char text und its start coord
lines[y] = ch # store back ch for this line
for y in lines.keys():
ch = lines[y]
print("".join([ch[x] for x in ch.keys()])) Produces this:
Now all you have to do is adding logic which increases large distances between characters in a line by multiple spaces. |
Beta Was this translation helpful? Give feedback.
I have a better script already. Works similar, but goes down to each character. This automatically replaces doubled characters: