-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two files which look identical (on first inspection) produce different line breaks when extracting text #1395
Comments
Hi @dl-racing I just gave this a try: import PyPDF2
print(f"PyPDF2=={PyPDF2.__version__}\n\n")
reader = PyPDF2.PdfReader("missing_newlines.pdf")
print(reader.pages[6].extract_text()) which gives me
|
That actually looks fine. Could it be that you're using an older PyPDF2 version? Please try |
@dl-racing For your project, a layout-preserving text extraction might be the best fit.
gives
|
@pubpub-zz / @srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well? I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout (not tables, but actual text columns). For tables, I think the layout preserving mode is pretty much always desirable. However, I don't see how we could reliably detect that there is a table. |
Many thanks Martin, both files now work correctly. FWIW I think layout
preservation is very important as the layout often carries the
meaning/context for a piece of data. The '37.750' is part of the 'SECTOR 2'
class of data by virtue of it's position directly underneath the phrase
'SECTOR 2'. If 'SECTOR 2' was sat underneath another field, the 37.750
would inherit not only 'SECTOR 2' but also the field above that also.
…On Fri, Oct 14, 2022 at 7:01 PM Martin Thoma ***@***.***> wrote:
@pubpub-zz <https://github.com/pubpub-zz> / @srogmann
<https://github.com/srogmann> Just out of curiosity: Do you think such a
layout-preserving mode could be possible with PyPDF2 as well?
I'm uncertain what that would entail and how often users would prefer it
compared to the current "reading-flow" extraction mode. This is especially
important when there is a multi-column layout.
—
Reply to this email directly, view it on GitHub
<#1395 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3TR72EW2U6I6WTLIXN7HK3WDGNXHANCNFSM6AAAAAARFMRKZ4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm happy to hear that it works now! Just to make sure I've got it right: The upgrade of PyPDF2 did the trick with the newlines, right? So the newlines work, but the whitespace is still something we could improve. Right? |
I believe so. I didn’t document the output annoyingly and I haven’t been
able to replicate since…
White space and for sure, the ordering/layout. The column order changes and
because of white space being truncated you can’t detect missing values in
place…
…On Fri, 14 Oct 2022 at 20:08, Martin Thoma ***@***.***> wrote:
Just to make sure I've got it right: The upgrade of PyPDF2 did the trick
with the newlines, right? So the newlines work, but the whitespace is still
something we could improve. Right?
—
Reply to this email directly, view it on GitHub
<#1395 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3TR72G7BTYGNPGRVCFRPGTWDGVSLANCNFSM6AAAAAARFMRKZ4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I hope to be able to do so (that was part of my roadmap in #1181 (comment)) |
Very nice! I'm closing this issue now as the original problem was solved by upgrading. I'll use the files to create a test / benchmark so that we can track our progress in the layout presentation area :-) Thank you @dl-racing and @pubpub-zz for your input and the nice discussion ❤️ |
I've come across another erroneous example (even with the upgraded library). Page 8, Free Practice 1 SECTOR ANALYSIS page_8_extracted_from_full_pdf @MartinThoma I've posted here instead of opening a new ticket as keeping the two cases together might be useful...can we reopen this ticket? pdftotext works very well for my use case, but I'd like to help fix this case for pypdf2 :) |
The blank issue has been resolved for this correspondence. If so, is this a new feature? |
Sorry, you mentioned a bug with whitespace in layout mode. My mistake. |
I'm raising this issue as a result of a super useful (and helpful!) chat with @MartinThoma.
For simplicity, I am trying to extract the first page of the 'SECTOR ANALYSIS' sections from both the attached PDFs.
One file (correct_newlines.pdf) produces each row as expected as a new line of text (albeit the columns are in a different but consistent order).
The other file (missing_newlines.pdf) has very similar data but produces fewer lines of text, with multiple lines concatenated without spaces between.
correct_newlines.pdf
missing_newlines.pdf
The text was updated successfully, but these errors were encountered: