You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It ought to be possible to extract the text directly with PyPDF2 instead of extracting the image of the page and running OCR on that. I was under the mistaken assumption that the text was not present in the PDFs and it was only in there as images. Thanks @pkiraly for the suggestion.
Ironically there might be a fair amount of work to make sure that the text blocks are in the correct order, and to get them out of the PDF, instead of running OCR on the image. But it ought to be more accurate if it can be done.
The text was updated successfully, but these errors were encountered:
It ought to be possible to extract the text directly with PyPDF2 instead of extracting the image of the page and running OCR on that. I was under the mistaken assumption that the text was not present in the PDFs and it was only in there as images. Thanks @pkiraly for the suggestion.
Ironically there might be a fair amount of work to make sure that the text blocks are in the correct order, and to get them out of the PDF, instead of running OCR on the image. But it ought to be more accurate if it can be done.
The text was updated successfully, but these errors were encountered: