Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract text from the pdf #1

Open
edsu opened this issue Jul 28, 2018 · 0 comments
Open

Extract text from the pdf #1

edsu opened this issue Jul 28, 2018 · 0 comments

Comments

@edsu
Copy link
Collaborator

edsu commented Jul 28, 2018

It ought to be possible to extract the text directly with PyPDF2 instead of extracting the image of the page and running OCR on that. I was under the mistaken assumption that the text was not present in the PDFs and it was only in there as images. Thanks @pkiraly for the suggestion.

Ironically there might be a fair amount of work to make sure that the text blocks are in the correct order, and to get them out of the PDF, instead of running OCR on the image. But it ought to be more accurate if it can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant