Extract text from the pdf #1

edsu · 2018-07-28T14:09:29Z

It ought to be possible to extract the text directly with PyPDF2 instead of extracting the image of the page and running OCR on that. I was under the mistaken assumption that the text was not present in the PDFs and it was only in there as images. Thanks @pkiraly for the suggestion.

Ironically there might be a fair amount of work to make sure that the text blocks are in the correct order, and to get them out of the PDF, instead of running OCR on the image. But it ought to be more accurate if it can be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract text from the pdf #1

Extract text from the pdf #1

edsu commented Jul 28, 2018

Extract text from the pdf #1

Extract text from the pdf #1

Comments

edsu commented Jul 28, 2018