Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two files which look identical (on first inspection) produce different line breaks when extracting text #1395

Open
dl-racing opened this issue Oct 14, 2022 · 12 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@dl-racing
Copy link

I'm raising this issue as a result of a super useful (and helpful!) chat with @MartinThoma.

For simplicity, I am trying to extract the first page of the 'SECTOR ANALYSIS' sections from both the attached PDFs.

One file (correct_newlines.pdf) produces each row as expected as a new line of text (albeit the columns are in a different but consistent order).

The other file (missing_newlines.pdf) has very similar data but produces fewer lines of text, with multiple lines concatenated without spaces between.

correct_newlines.pdf
missing_newlines.pdf

@dl-racing dl-racing changed the title FAO Martin Thoma: Two files which look identical (on first inspection) produce different line breaks when extracting text Two files which look identical (on first inspection) produce different line breaks when extracting text Oct 14, 2022
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Oct 14, 2022
@MartinThoma
Copy link
Member

Hi @dl-racing

I just gave this a try:

import PyPDF2

print(f"PyPDF2=={PyPDF2.__version__}\n\n")

reader = PyPDF2.PdfReader("missing_newlines.pdf")
print(reader.pages[6].extract_text())

which gives me

PyPDF2==2.11.1

2022 Intelligent Money British GT Championship
TEST SESSION 1 - SECTOR ANALYSIS
SECTOR 1 = FL to I1,    SECTOR 2 = I1 to I2,    SECTOR 3 = I2 to FL,    DIFF = Difference To Personal Best Lap,   P = Crossed F inish Line in Pit Lane,   D = Time Disallowed
77 Enduro Motorsport P1
LAP LAP TIME DIFF TIME OF DAY SECTOR 1 SECTOR 2 SECTOR 3GT3PA McLaren 720S GT3
MPHIDEAL LAP TIME :  1:26.866 BEST LAP TIME :  1:26.942 DIFFERENCE :   0.076
D1: Morgan TILLBROOK D2: Marcus CLUTTON
1 - D1 11:03:12.126 OUTLAP 116.7 37.750 139.2 35.381 101.0
2 - D1 1:28.713 1.771 11:04:40.839 19.626 144.6 34.707 140.9 34.380 100.4 100.93
3 - D1 1:28.523 1.581 11:06:09.362 19.636 134.4 34.731 140.9 34.156 101.8 101.15
4 - D1 1:27.561 0.619 11:07:36.923 19.362 145.8 34.269 140.9 33.930 101.2 102.26
5 - D1 1:27.515 0.573 11:09:04.438 19.200 146.5 34.323 140.9 33.992 100.7 102.31
6 - D1 1:29.908 2.966 11:10:34.346 P 19.302 146.2 34.527 141.8 IN PIT 99.59
7 - D1 5:09.987 3:43.045 11:15:44.333 OUTLAP 125.9 35.711 140.6 34.663 101.3 28.88
8 - D1 1:28.833 1.891 11:17:13.166 19.579 118.9 35.110 140.3 34.144 101.9 100.80
9 - D1 1:27.324 0.382 11:18:40.490 19.215 145.8 34.132 141.2 33.977 101.6 102.54
10 - D1 1:27.312 0.370 11:20:07.802 (3) 19.167 145.8 34.215 141.2 33.930 101.3 102.55
11 - D1 1:28.904 1.962 11:21:36.706 P 19.194 147.1 34.112 141.5 IN PIT 100.72
12 - D1 3:20.279 1:53.337 11:24:56.985 OUTLAP 142.4 34.842 141.8 41.436 102.7 44.70
13 - D1 1:27.303 0.361 11:26:24.288 (2) 19.093 146.8 33.945 142.1 34.265 101.6 102.56
14 - D1 1:26.942 11:27:51.230 (1) 19.116 145.8 33.990 142.1 33.836 101.5 102.99
15 - D1 1:29.334 2.392 11:29:20.564 P 19.085 146.2 34.645 141.5 IN PIT 100.23
16 - D1 6:47.450 5:20.508 11:36:08.014 OUTLAP 133.9 36.242 139.5 35.600 99.5 21.97
17 - D1 1:32.720 5.778 11:37:40.734 19.852 139.2 37.286 139.8 35.582 100.1 96.57
18 - D1 1:30.155 3.213 11:39:10.889 19.622 143.0 35.781 138.0 34.752 100.1 99.32
19 - D1 1:32.253 5.311 11:40:43.142 19.456 144.9 36.385 141.5 36.412 97.9 97.06
20 - D1 1:30.489 3.547 11:42:13.631 19.514 144.9 34.885 141.2 36.090 96.4 98.95
21 - D1 1:32.417 5.475 11:43:46.048 20.315 133.4 37.147 139.5 34.955 101.5 96.89
22 - D1 1:29.515 2.573 11:45:15.563 19.366 145.2 34.815 140.9 35.334 100.9 100.03
23 - D1 1:31.117 4.175 11:46:46.680 19.308 144.9 36.004 140.9 35.805 101.6 98.27
24 - D1 1:41.788 14.846 11:48:28.468 19.251 144.3 46.134 139.5 36.403 102.9 87.97
25 - D1 1:28.198 1.256 11:49:56.666 19.328 144.6 34.575 140.9 34.295 100.4 101.52
26 - D1 1:30.756 3.814 11:51:27.422 19.288 145.2 36.824 140.3 34.644 100.6 98.66
27 - D1 1:29.093 2.151 11:52:56.515 19.644 143.7 35.093 141.2 34.356 101.3 100.50
28 - D1 1:28.870 1.928 11:54:25.385 19.250 145.5 35.408 141.8 34.212 101.5 100.75
29 - D1 1:29.468 2.526 11:55:54.853 19.294 146.5 34.896 141.8 35.278 101.5 100.08
Results can be found at www.tsl-timing.com Page 1 of 16 Printed - 12:02 Thursday, 26 May 2022Date: 26/05/2022  Start: 11:00  Finish: 11:55Weather / Track :  / Donington Park GP: 2.4873 miles

@MartinThoma
Copy link
Member

That actually looks fine. Could it be that you're using an older PyPDF2 version?

Please try pip install PyPDF2 --upgrade to check :-)

@MartinThoma
Copy link
Member

@dl-racing For your project, a layout-preserving text extraction might be the best fit. pdftotext from https://poppler.freedesktop.org/ offers that:

pdftotext -layout -f 7 -l 7 missing_newlines.pdf

gives

2022 Intelligent Money British GT Championship
TEST SESSION 1 - SECTOR ANALYSIS

  SECTOR 1 = FL to I1, SECTOR 2 = I1 to I2, SECTOR 3 = I2 to FL, DIFF = Difference To Personal Best Lap, P = Crossed Finish Line in Pit Lane, D = Time Disallowed
   P1         77 GT3PA          Enduro Motorsport                                              McLaren 720S GT3
   IDEAL LAP TIME : 1:26.866                BEST LAP TIME : 1:26.942                 DIFFERENCE : 0.076
   D1: Morgan TILLBROOK       D2: Marcus CLUTTON

    LAP               SECTOR 1                     SECTOR 2                     SECTOR 3                LAP TIME           MPH              DIFF     TIME OF DAY
     1 - D1     OUTLAP        116.7            37.750       139.2            35.381       101.0                                                        11:03:12.126
     2 - D1      19.626       144.6            34.707       140.9            34.380       100.4          1:28.713         100.93            1.771      11:04:40.839
     3 - D1      19.636       134.4            34.731       140.9            34.156       101.8          1:28.523         101.15            1.581      11:06:09.362
     4 - D1      19.362       145.8            34.269       140.9            33.930       101.2          1:27.561         102.26            0.619      11:07:36.923
     5 - D1      19.200       146.5            34.323       140.9            33.992       100.7          1:27.515         102.31            0.573      11:09:04.438
     6 - D1      19.302       146.2            34.527       141.8            IN PIT                      1:29.908    P     99.59            2.966      11:10:34.346
     7 - D1     OUTLAP        125.9            35.711       140.6            34.663       101.3          5:09.987          28.88         3:43.045      11:15:44.333
     8 - D1      19.579       118.9            35.110       140.3            34.144       101.9          1:28.833         100.80            1.891      11:17:13.166
     9 - D1      19.215       145.8            34.132       141.2            33.977       101.6          1:27.324         102.54            0.382      11:18:40.490
    10 - D1      19.167       145.8            34.215       141.2            33.930       101.3          1:27.312   (3)   102.55            0.370      11:20:07.802
    11 - D1      19.194       147.1            34.112       141.5            IN PIT                      1:28.904    P    100.72            1.962      11:21:36.706
    12 - D1     OUTLAP        142.4            34.842       141.8            41.436       102.7          3:20.279          44.70         1:53.337      11:24:56.985
    13 - D1      19.093       146.8            33.945       142.1            34.265       101.6          1:27.303   (2)   102.56            0.361      11:26:24.288
    14 - D1      19.116       145.8            33.990       142.1            33.836       101.5          1:26.942   (1)   102.99                       11:27:51.230
    15 - D1      19.085       146.2            34.645       141.5            IN PIT                      1:29.334    P    100.23            2.392      11:29:20.564
    16 - D1     OUTLAP        133.9            36.242       139.5            35.600        99.5          6:47.450          21.97         5:20.508      11:36:08.014
    17 - D1      19.852       139.2            37.286       139.8            35.582       100.1          1:32.720          96.57            5.778      11:37:40.734
    18 - D1      19.622       143.0            35.781       138.0            34.752       100.1          1:30.155          99.32            3.213      11:39:10.889
    19 - D1      19.456       144.9            36.385       141.5            36.412        97.9          1:32.253          97.06            5.311      11:40:43.142
    20 - D1      19.514       144.9            34.885       141.2            36.090        96.4          1:30.489          98.95            3.547      11:42:13.631
    21 - D1      20.315       133.4            37.147       139.5            34.955       101.5          1:32.417          96.89            5.475      11:43:46.048
    22 - D1      19.366       145.2            34.815       140.9            35.334       100.9          1:29.515         100.03            2.573      11:45:15.563
    23 - D1      19.308       144.9            36.004       140.9            35.805       101.6          1:31.117          98.27            4.175      11:46:46.680
    24 - D1      19.251       144.3            46.134       139.5            36.403       102.9          1:41.788          87.97           14.846      11:48:28.468
    25 - D1      19.328       144.6            34.575       140.9            34.295       100.4          1:28.198         101.52            1.256      11:49:56.666
    26 - D1      19.288       145.2            36.824       140.3            34.644       100.6          1:30.756          98.66            3.814      11:51:27.422
    27 - D1      19.644       143.7            35.093       141.2            34.356       101.3          1:29.093         100.50            2.151      11:52:56.515
    28 - D1      19.250       145.5            35.408       141.8            34.212       101.5          1:28.870         100.75            1.928      11:54:25.385
    29 - D1      19.294       146.5            34.896       141.8            35.278       101.5          1:29.468         100.08            2.526      11:55:54.853




Weather / Track : /                                                                                                                        Donington Park GP: 2.4873 miles
                                                                                                                                 Date: 26/05/2022 Start: 11:00 Finish: 11:55

Results can be found at www.tsl-timing.com                                     Page 1 of 16                                           Printed - 12:02 Thursday, 26 May 2022

@MartinThoma
Copy link
Member

MartinThoma commented Oct 14, 2022

@pubpub-zz / @srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well?

I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout (not tables, but actual text columns).

For tables, I think the layout preserving mode is pretty much always desirable. However, I don't see how we could reliably detect that there is a table.

@dl-racing
Copy link
Author

dl-racing commented Oct 14, 2022 via email

@MartinThoma
Copy link
Member

MartinThoma commented Oct 14, 2022

I'm happy to hear that it works now!

Just to make sure I've got it right: The upgrade of PyPDF2 did the trick with the newlines, right? So the newlines work, but the whitespace is still something we could improve. Right?

@dl-racing
Copy link
Author

dl-racing commented Oct 14, 2022 via email

@pubpub-zz
Copy link
Collaborator

@pubpub-zz / @srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well?

I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout (not tables, but actual text columns).

For tables, I think the layout preserving mode is pretty much always desirable. However, I don't see how we could reliably detect that there is a table.

I hope to be able to do so (that was part of my roadmap in #1181 (comment))
Just finishing my current PR and this will be my next job😀

@MartinThoma
Copy link
Member

Very nice!

I'm closing this issue now as the original problem was solved by upgrading. I'll use the files to create a test / benchmark so that we can track our progress in the layout presentation area :-)

Thank you @dl-racing and @pubpub-zz for your input and the nice discussion ❤️

@dl-racing
Copy link
Author

dl-racing commented Oct 22, 2022

I've come across another erroneous example (even with the upgraded library).

Page 8, Free Practice 1 SECTOR ANALYSIS
(I've attached the page of interest, but the full PDF is available here: https://www.tsl-timing.com/file/?f=BF3GT/2022/221805bgt.pdf)

page_8_extracted_from_full_pdf
extracted_text_with_incorrect_linebreaks

@MartinThoma I've posted here instead of opening a new ticket as keeping the two cases together might be useful...can we reopen this ticket?

pdftotext works very well for my use case, but I'd like to help fix this case for pypdf2 :)

@MartinThoma MartinThoma reopened this Oct 22, 2022
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 6, 2024

The blank issue has been resolved for this correspondence.
It is necessary to consider placement by position, not by text input order (BT, ET order).
I think it is possible to change the simple addition of output to a list with row and column position information, but is this something you would like to see addressed?

If so, is this a new feature?

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 6, 2024

Sorry, you mentioned a bug with whitespace in layout mode. My mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants