How to get formatted text from a strange messed up pdf #850

jindili · 2021-01-21T07:52:04Z

jindili
Jan 21, 2021

The strange messed up pdf:
strange_pdf_text.pdf

It looks like this:

seems all right, but the result of page.getText("text")

o. Euro 
10 to 49
< 1
< 10 M
1  -  - 
Mio
io. Euro 
500 to 999 
>1
>100 Mi
Mio
Mi
< < 50 Mio. Euro 
10 to 49 
10 -- <

result of page.getText("words"):

[(38.463714599609375, 10.389280319213867, 43.76622009277344, 19.771780014038086, 'o.', 0, 0, 0), 
(45.46875, 10.389280319213867, 59.44049072265625, 19.771780014038086, 'Euro', 0, 0, 1), 
(80.9737548828125, 10.389280319213867, 87.76874542236328, 19.771780014038086, '10', 0, 1, 0), 
(89.47125244140625, 10.389280319213867, 95.29873657226562, 19.771780014038086, 'to', 0, 1, 1), 
(97.00125122070312, 10.389280319213867, 103.7962417602539, 19.771780014038086, '49', 0, 1, 2), 
(17.823699951171875, 10.389280319213867, 21.40869903564453, 19.771780014038086, '<', 0, 2, 0), 
(23.1112060546875, 10.389280319213867, 26.508705139160156, 19.771780014038086, '1', 0, 2, 1), 
(17.823699951171875, 10.389280319213867, 21.40869903564453, 19.771780014038086, '<', 0, 3, 0), 
(23.1112060546875, 10.389280319213867, 29.90619659423828, 19.771780014038086, '10', 0, 3, 1), 
(31.60870361328125, 10.389280319213867, 36.731204986572266, 19.771780014038086, 'M', 0, 3, 2), 
(8.97369384765625, 10.389280319213867, 12.371193885803223, 19.771780014038086, '1', 0, 4, 0), 
(14.07421875, 10.389280319213867, 16.121719360351562, 19.771780014038086, '-', 0, 4, 1), 
(14.07421875, 10.389280319213867, 16.121719360351562, 19.771780014038086, '-', 0, 4, 2), 
(31.609710693359375, 10.389280319213867, 42.244720458984375, 19.771780014038086, 'Mio', 0, 5, 0), 
(29.576690673828125, 24.391294479370117, 36.61170959472656, 33.7737922668457, 'io.', 1, 0, 0), 
(38.314239501953125, 24.391294479370117, 52.2867431640625, 33.7737922668457, 'Euro', 1, 0, 1), 
(80.97500610351562, 24.391294479370117, 91.16748809814453, 33.7737922668457, '500', 1, 1, 0), 
(92.8699951171875, 24.391294479370117, 98.69747924804688, 33.7737922668457, 'to', 1, 1, 1), 
(100.39999389648438, 24.391294479370117, 110.59247589111328, 33.7737922668457, '999', 1, 1, 2), 
(8.97369384765625, 24.391294479370117, 15.956185340881348, 33.7737922668457, '>1', 1, 2, 0), 
(8.97369384765625, 24.391294479370117, 22.75116729736328, 33.7737922668457, '>100', 1, 3, 0), 
(24.45367431640625, 24.391294479370117, 31.30919075012207, 33.7737922668457, 'Mi', 1, 3, 1), 
(24.453704833984375, 24.391294479370117, 35.088714599609375, 33.7737922668457, 'Mio', 1, 4, 0), 
(24.453704833984375, 24.391294479370117, 31.30870246887207, 33.7737922668457, 'Mi', 1, 5, 0), 
(21.226715087890625, 38.394283294677734, 24.81171417236328, 47.77678298950195, '<', 2, 0, 0), 
(21.226715087890625, 38.394283294677734, 24.81171417236328, 47.77678298950195, '<', 2, 0, 1), 
(26.51422119140625, 38.394283294677734, 33.30921173095703, 47.77678298950195, '50', 2, 0, 2), 
(35.01171875, 38.394283294677734, 47.16923522949219, 47.77678298950195, 'Mio.', 2, 0, 3), 
(48.87176513671875, 38.394283294677734, 62.844268798828125, 47.77678298950195, 'Euro', 2, 0, 4), 
(80.98077392578125, 38.394283294677734, 87.77576446533203, 47.77678298950195, '10', 2, 1, 0), 
(89.478271484375, 38.394283294677734, 95.30575561523438, 47.77678298950195, 'to', 2, 1, 1), 
(97.00827026367188, 38.394283294677734, 103.80326080322266, 47.77678298950195, '49', 2, 1, 2), 
(8.97869873046875, 38.394283294677734, 15.773690223693848, 47.77678298950195, '10', 2, 2, 0), 
(17.4761962890625, 38.394283294677734, 19.524215698242188, 47.77678298950195, '--', 2, 2, 1), 
(21.226715087890625, 38.394283294677734, 24.81171417236328, 47.77678298950195, '<', 2, 2, 2)]

and result of page.getText("blocks")

[(8.97369384765625, 10.389280319213867, 103.7962417602539, 19.771780014038086, 'o. Euro \n10 to 49\n< 1\n< 10 M\n1  -  - \n', 0, 0), 
(8.440704345703125, 11.41131591796875, 22.440704345703125, 18.131315231323242, '<image: ICCBased(RGB,sRGB IEC61966-2.1), width 14, height 7, bpc 8>', 1, 1), 
(31.609710693359375, 10.389280319213867, 42.244720458984375, 19.771780014038086, 'Mio\n', 2, 0), 
(8.97369384765625, 24.391294479370117, 112.29496765136719, 33.7737922668457, 'io. Euro \n500 to 999 \n>1\n>100 M\n', 3, 0), 
(7.960693359375, 25.331298828125, 16.960693359375, 31.091299057006836, '<image: ICCBased(RGB,sRGB IEC61966-2.1), width 9, height 6, bpc 8>', 4, 1), 
(7.960693359375, 28.6912841796875, 11.960693359375, 32.05128479003906, '<image: ICCBased(RGB,sRGB IEC61966-2.1), width 4, height 4, bpc 8>', 5, 1), 
(24.453704833984375, 24.391294479370117, 35.088714599609375, 33.7737922668457, 'i\nMio\nMi\n', 6, 0), 
(8.97869873046875, 38.394283294677734, 105.50575256347656, 47.77678298950195, '< < 50 Mio. Euro \n10 to 49 \n10 -- < \n', 7, 0)]

Seems texts are randomly splitted up, there is even some images in it.
I don't know how such a strange pdf is produced,
I want to format the text as it looks in pdf viewer by analysing the result of page.getText("words"), but it seems messed up and hard to format,

Any one have some suggestion, or maybe it's better to convert it to image and ocr it.

Answered by JorjMcKie

Jan 21, 2021

I have a better script already. Works similar, but goes down to each character. This automatically replaces doubled characters:

import fitz

doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
for b in blocks:
    for line in b["lines"]:
        for s in line["spans"]:
            for char in s["chars"]:
                bbox = fitz.Rect(char["bbox"])
                chars.append((bbox.y1, bbox.x0, char["c"]))
chars.sort(key=lambda x: (x[0], x[1]))
lines = {}
for char in chars:
    y = char[0]  # y1 = bottom of the char
    x = round(char[1])  # x0 = start (left) of the char
    ch = lines.get(y, {})
    ch[x] = char[2]  # st…

View full answer

JorjMcKie · 2021-01-21T11:51:53Z

JorjMcKie
Jan 21, 2021
Maintainer

To get rid of images, simply use the flags parameter in text extractions page.getText( , flags=0).

Otherwise the PDF indeed looks like being messed up on purpose. But there is hope. The following snippet brings at least some order to all that sloppiness:

import fitz

doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
wl = page.getText("words")
wl.sort(key=lambda w: (w[3], w[0]))  # sort asc: vertical, horizontal coordinates
lines = {}
for w in wl:
    y = w[3]  # y1 = bottom of the word
    x = round(w[0])  # x0 = start (left) of the word
    words = lines.get(y, {})
    words[x] = w[4]  # store word text und its start coord
    lines[y] = words  # store back words for this line

for y in lines.keys():
    words = lines[y]
    print(" ".join([words[x] for x in words.keys()]))

Produces this, which is a lot closer:

1 - < 10 Mio o. Euro 10 to 49
>100 Mi io. Euro 500 to 999
10 -- < 50 Mio. Euro 10 to 49

As you see: some characters are still doubled. But you may see the direction of it all.

1 reply

jindili Jan 21, 2021
Author

@JorjMcKie thank you for your solution,
I think I can check bbox from the word lists and get rid of those start at same top left corner but has fewer text contents.

JorjMcKie · 2021-01-21T12:19:36Z

JorjMcKie
Jan 21, 2021
Maintainer

I have a better script already. Works similar, but goes down to each character. This automatically replaces doubled characters:

import fitz

doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []
for b in blocks:
    for line in b["lines"]:
        for s in line["spans"]:
            for char in s["chars"]:
                bbox = fitz.Rect(char["bbox"])
                chars.append((bbox.y1, bbox.x0, char["c"]))
chars.sort(key=lambda x: (x[0], x[1]))
lines = {}
for char in chars:
    y = char[0]  # y1 = bottom of the char
    x = round(char[1])  # x0 = start (left) of the char
    ch = lines.get(y, {})
    ch[x] = char[2]  # store char text und its start coord
    lines[y] = ch  # store back ch for this line

for y in lines.keys():
    ch = lines[y]
    print("".join([ch[x] for x in ch.keys()]))

Produces this:

1 - < 10 Mio. Euro 10 to 49
>100 Mio. Euro 500 to 999
10 - < 50 Mio. Euro 10 to 49

Now all you have to do is adding logic which increases large distances between characters in a line by multiple spaces.

3 replies

JorjMcKie Jan 21, 2021
Maintainer

the last comment in previous post entails to also store character bbox widths with each character and then replaces the print statement by something more complex:

check differences between end of a character and start of next character and insert an appropriate amount of spaces.

jindili Jan 21, 2021
Author

@JorjMcKie, That's awesome !

JorjMcKie Jan 21, 2021
Maintainer

Thanks 😎!
But you always can get better:

import fitz

doc = fitz.open("strange_pdf_text.pdf")
page = doc[0]
blocks = page.getText("rawdict", flags=0)["blocks"]
chars = []  # list of characters and their bboxes
spc_width = 0  # width of space character
for b in blocks:
    for line in b["lines"]:
        for s in line["spans"]:
            for char in s["chars"]:
                bbox = fitz.Rect(char["bbox"])
                chars.append((bbox, char["c"]))
                if bbox.width > spc_width and char["c"] == " ":
                    spc_width = bbox.width
chars.sort(key=lambda x: (x[0].y1, x[0].x0))  # sort characters: vertcal, horizontal
lines = {}  # collects char infos per line (key: line bottom coord)
for char in chars:
    y = char[0].y1  # y1 = bottom of the char
    x = round(char[0].x0)  # x0 = left of the char, don't be too precise!
    ch = lines.get(y, {})  # dictionary of char infos in this line
    ch[x] = char  # store this char info under its left coordinate
    lines[y] = ch  # store back dictionary of char infos

for y in lines.keys():  # iterate by ascending line coordinates
    chars = lines[y]  # dict of char infos
    keys = list(chars.keys())  # all the x coordinates
    nchars = len(keys) - 1  # how many are they
    text = ""  # will contain text of line
    for i in range(nchars + 1):  # process all the characters
        char = chars[keys[i]]  # current char
        text += char[1]  # append it to line text
        if i < nchars:  # now check for large gap to following char
            folchar = chars[keys[i + 1]]  # following char
            nx0 = folchar[0].x0  # its start coord
            x1 = char[0].x1  # end of current char
            while nx0 > x1 + spc_width:  # next char too far away
                text += " "  # add a filler space
                x1 += spc_width  # add one space width
    print(text)

Produces this:

1 - < 10 Mio. Euro            10 to 49
>100 Mio. Euro                500 to 999
10 - < 50 Mio. Euro          10 to 49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get formatted text from a strange messed up pdf #850

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to get formatted text from a strange messed up pdf #850

jindili Jan 21, 2021

Replies: 2 comments · 4 replies

JorjMcKie Jan 21, 2021 Maintainer

jindili Jan 21, 2021 Author

JorjMcKie Jan 21, 2021 Maintainer

JorjMcKie Jan 21, 2021 Maintainer

jindili Jan 21, 2021 Author

JorjMcKie Jan 21, 2021 Maintainer

jindili
Jan 21, 2021

Replies: 2 comments 4 replies

JorjMcKie
Jan 21, 2021
Maintainer

jindili Jan 21, 2021
Author

JorjMcKie
Jan 21, 2021
Maintainer

JorjMcKie Jan 21, 2021
Maintainer

jindili Jan 21, 2021
Author

JorjMcKie Jan 21, 2021
Maintainer