-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtractText yields nothing for apparently good PDF #168
Comments
I would like to mention that I have many unprotected, machine-searchable (i.e., non-image) PDF files like this - I just posted one link. Unlike the last issue I opened about a freak PDF with a botched header, in this case PyPDF2 fails to get text from annoyingly many of the files I'm trying to process. Thanks for listening. |
@chrisinmtown I ran into a similar issue today; my PDF and yours are "page extraction: not allowed" according to Adobe Reader. :( |
@zevav thanks for the comment but please let's not confuse issues. Protected files are a whole different ball of wax and I don't expect PyPDF2 to extract anything from such files given no password. The link I provided above yields a PDF that is not password protected. On this document Adobe Acrobat makes no complaint about extracting text, it happily saves-as plain text and the result is totally usable. |
@chrisinmtown hey, sorry to send you in the wrong direction; in Acrobat on my machine that document does show (document info) as "page extraction not allowed." |
Thanks for clarifying. Now I'm concerned, I don't want to waste anyone's time here on non-issues! I am using Adobe Acrobat XI on Win7_x64. With this document open in Acrobat I pick File -> Properties, switch to the Security tab of the Document Properties dialog, and there I read "Security Method: No Security", and under the restrictions everything is allowed (Printing, Changing, Copying ...). Could there possibly be a difference in behavior between Reader and Acrobat on this document? |
Okay, my bad: I wrote "Acrobat" in my second comment, but I meant "Reader." Here's a screenshot of your file's info in that, on OS X.10, Reader 11.0.10. |
I see the exact same thing in the Win7 version of Acrobat Reader XI: Document Assembly and Page Extract Not Allowed; all the rest (Content Copying ..) are Allowed. FWIW, PyPDF2 declares this document unprotected. I'm starting to think the Properties window is reflecting features of Acrobat Reader rather than the document, do you agree? In my tests of Reader on other PDF documents, it invariably declares "Page Extract Not Allowed". Reader by definition cannot extract pages, right? Just to be clear, I am sticking to my position :) that the original document is a valid PDF, unprotected, with text content, and I really would like PyPDF2 to be extended so it can handle this doc. |
Dang, you're right! I didn't think to check a PDF that I know PyPDF2 can extract the text of; Reader does indeed show that property for all PDFs. :( What method in PyPDF2 tells you whether or not a document is protected? |
The relevant method on PdfFileReader is getIsEncrypted() |
I realise this is an old post, did you ever find the reason for text not being extracted? |
Facing same problem. PyPDF2 version 1.26 |
Sadly, the PDF mentioned above is no longer reachable. I think that #924 fixed the issue and hence I close this PR. It might also be a duplicate of the underlying cause of #242 . If you face the same issue, please open a new bug ticket and upload a PDF with the issue (to which you must have the copyright) |
The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 🥳🎊 A very big thank you goes to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you 🤗💚 In case the new function causes any issues, you can use `_extract_text_old` for the old functionality. Please also open a bug ticket in that case. There were several people who have attempted to bring similar improvements to PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive PR which also incorporated the latest changes of PyPDF2 2.0.0. Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and [asabramo](https://github.com/asabramo) for #464 🤗 New Features (ENH): - Massive text extraction improvement (#924). Closed many open issues: - Exceptions / missing spaces in extract_text() method (#17) 🕺 - Whitespace issues in extract_text() (#42) 💃 - pypdf2 reads the hifenated words in a new line (#246) - PyPDF2 failing to read unicode character (#37) - Unable to read bullets (#230) - ExtractText yields nothing for apparently good PDF (#168) 🎉 - Encoding issue in extract_text() (#235) - extractText() doesn't work on Chinese PDF (#252) - encoding error (#260) - Trouble with apostophes in names in text "O'Doul" (#384) - extract_text works for some PDF files, but not the others (#437) - Euro sign not being recognized by extractText (#443) - Failed extracting text from French texts (#524) - extract_text doesn't extract ligatures correctly (#598) - reading spanish text - mark convert issue (#635) - Read PDF changed from text to random symbols (#654) - .extractText() reads / as 1. (#789) - Update glyphlist (#947) - inspired by #464 - Allow adding PageRange objects (#948) Bug Fixes (BUG): - Delete .python-version file (#944) - Compare StreamObject.decoded_self with None (#931) Robustness (ROB): - Fix some conversion errors on non conform PDF (#932) Documentation (DOC): - Elaborate on PDF text extraction difficulties (#939) - Add logo (#942) - rotate vs Transformation().rotate (#937) - Example how to use PyPDF2 with AWS S3 (#938) - How to deprecate (#930) - Fix typos on robustness page (#935) - Remove scripts (pdfcat) from docs (#934) Developer Experience (DEV): - Ignore .python-version file - Mark deprecated code with no-cover (#943) - Automatically create Github releases from tags (#870) Testing (TST): - Text extraction for non-latin alphabets (#954) - Ignore PdfReadWarning in benchmark (#949) - writer.remove_text (#946) - Add test for Tree and _security (#945) Code Style (STY): - black, isort, Flake8, splitting buildCharMap (#950) Full Changelog: 2.0.0...2.1.0
PyPDF2 version 1.23 fails to extract any text from the first 3 pages of this PDF file:
http://emma.msrb.org/EP295293-EP10300-EP632440.pdf
The file seems well-formed to me; both Acrobat and evince display it nicely. The linux utility pdftotext converts it to text and I see the expected content just fine.
Here's the relevant bit of my little script:
Is there a gotcha here that I'm missing? Pls advise, thanks in advance for help.
The text was updated successfully, but these errors were encountered: