-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document.select() behaves weirdly in some particular kind of pdf files #3705
Comments
The motivation behind your approach is unclear to me. If the reason is to just limit the number of pages use a different way of doing this. text = chr(12).join([page.get_text() for page in doc if page.number < 30])
pathlib.Path("out.txt").write_bytes(text.encode()) I do however notice a bug in the base library which in fact yields a PDF from which text can no longer be extracted - as you describe. |
Text from sub-selected out.pdf: MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707890 |
The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time. |
Ok, I see.
|
Just as an intermediate information: |
Probably the approach with the best performance is this: text = ""
for page in doc:
if page.number >= 30: # leave the iterator immediately
break
text += page.get_text()
# etc. |
Thank you, Jorj. |
Fixed in 1.24.10. |
Description of the bug
Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf
Here is my code.
How to reproduce the bug
You can reproduce the Bug/issue by running the given script and attached pdf file.
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.10
The text was updated successfully, but these errors were encountered: