-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure ocr with ocrmypdf #595
Comments
OCRmyPDF has a plugin interface that would allow you to replace Tesseract with a different OCR engine such as Azure. To the best of my knowledge no one has published a plugin that does this (or for that matter, any plugin, since the plugin interface is quite new). OCRmyPDF can only interpret the hOCR format or a text only PDF, so you'd have to convert Azure's output to one of those two as well, since unfortunately it does not support either standard (last time I looked, anyway). |
The azure output looks something like Is it possible to convert this into one of the formats that you mentioned ? |
If you look at hOCR format example given on Wikipedia, I would say yes. Besides, have look here: Another alternative could be https://github.com/JaidedAI/EasyOCR but it outputs in a simple list only. I think that could be converted in hOCR easily. |
I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does. |
Its not that hard to implement. I have a very basic python code, that uses google vision api to get better results. For the orientation of the page I use tesseract, because its the easiest way. But for text recognition the image is sent to vision api, the json response get converted to hocr and you have your textlayer. |
@kkrell2016 May I ask how the conversion from json to hocr happen? Have you written your own script for that purpose? |
@isspid I found a project called gcv2hocr and combined it with some custom python code. The custom python script can be run as a plugin in ocrmypdf. I also uploaded it to my github, should be publicly available. I had to modify gcv2hocr a bit to make it work with the current Google Vision API. If you have any questions please contact me |
I think the above plugin interface (e.g. Then I can then call # something like this for multiple pages?
helper = hocrtransform.HocrTransform(
hocr_filename=hocr_file, # or list_of_hocr_files
dpi=150
)
helper.to_pdf(out_filename=output_pdf) # a multi page pdf Our use case is that we send a batch of pages to Azure OCR(otherwise it'll be very slow to process many pages for us) and it returns an Azure OCR result object for all the pages. I can loop through each page object of the Azure OCR result object and generate either a list of hocr objects (where one hocr object corresponds to a page) or a single hocr object(if that's possible) |
I guess I can call the azure engine in the global part of the file and then cache it and then when I noticed |
Curious about all this myself. Anyone have a working example of converting e.g. the Azure output to hOCR? |
@shamoon Looks we found the same thread ^^ |
Although the document say they can produce hOCR, I cannot find any workable solution. Issue: Azure-Samples/cognitive-services-REST-api-samples#109 |
I have an HOCR file and a pdf file, but how do i actually apply the HOCR file to the pdf? It came from Google Cloud via this: https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-document-to-hocr |
@jbarlow83 I'm considering building Azure support for OCRmyPDF. Would you recommend going the hOCR route or the direct-to-PDF route from the EasyOCR plugin? |
Personally I say just direct to PDF. The PDF standard is already a nightmare as it is without having to figure out how to even create the hOCR file in the first place 💀 |
ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time.
How do I use Azure ocr API as the OCR engine keeping everything else the same
The text was updated successfully, but these errors were encountered: