Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure ocr with ocrmypdf #595

Open
sandipan1 opened this issue Jul 20, 2020 · 16 comments
Open

Azure ocr with ocrmypdf #595

sandipan1 opened this issue Jul 20, 2020 · 16 comments

Comments

@sandipan1
Copy link

ocrmypdf works great with pdfs with scanned images . However in case of handwritten letter, the tessaract-ocr engine struggles many a time.
How do I use Azure ocr API as the OCR engine keeping everything else the same

@jbarlow83
Copy link
Collaborator

OCRmyPDF has a plugin interface that would allow you to replace Tesseract with a different OCR engine such as Azure. To the best of my knowledge no one has published a plugin that does this (or for that matter, any plugin, since the plugin interface is quite new).

OCRmyPDF can only interpret the hOCR format or a text only PDF, so you'd have to convert Azure's output to one of those two as well, since unfortunately it does not support either standard (last time I looked, anyway).

@sandipan1
Copy link
Author

The azure output looks something like
{"status": "Succeeded", "recognitionResult": {"lines": [{"boundingBox": [292, 146, 780, 144, 781, 218, 293, 220], "text": "string1", "words": [{"boundingBox": [297, 150, 774, 145, 775, 218, 300, 218], "text": "string2"}]}, {"boundingBox": [327, 215, 748, 219, 747, 255, 326, 252], "text": "string3 string4", "words": [{"boundingBox": [330, 219, 496, 219, 498, 253, 332, 251], "text": "string3"}, "text": "string4"}]}]}}

Is it possible to convert this into one of the formats that you mentioned ?

@PackElend
Copy link

Is it possible to convert this into one of the formats that you mentioned?

If you look at hOCR format example given on Wikipedia, I would say yes. Besides, have look here:
https://stackoverflow.com/questions/62074677/generate-hocr-from-microsoft-computer-vision-ocr

Another alternative could be https://github.com/JaidedAI/EasyOCR but it outputs in a simple list only. I think that could be converted in hOCR easily.

@All3xJ
Copy link

All3xJ commented Feb 11, 2022

I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

@kkrell2016
Copy link

I asked here #915 for google api, but still the same question of yours. the ideal would be to grab the informations from these websites api, and then "paste" it inside the PDF as invisible text, like OCRmyPDF altrady does.

Its not that hard to implement. I have a very basic python code, that uses google vision api to get better results. For the orientation of the page I use tesseract, because its the easiest way. But for text recognition the image is sent to vision api, the json response get converted to hocr and you have your textlayer.

@isspid
Copy link

isspid commented Mar 29, 2023

@kkrell2016 May I ask how the conversion from json to hocr happen? Have you written your own script for that purpose?

@kkrell2016
Copy link

@isspid I found a project called gcv2hocr and combined it with some custom python code. The custom python script can be run as a plugin in ocrmypdf. I also uploaded it to my github, should be publicly available. I had to modify gcv2hocr a bit to make it work with the current Google Vision API.

If you have any questions please contact me

@RAbraham
Copy link

I think the above plugin interface (e.g. generate_pdf(input_file, output_pdf, output_text, options)) will be called for each page in the pdf instead of the whole pdf? Is there an interface which gives the entire pdf and we can return back a list of hocr files generated from an azure OCR result or a single hocr file for many pages(if this is possible in the hocr format, I have to learn).

Then I can then call

# something like this for multiple pages?
    helper = hocrtransform.HocrTransform(
        hocr_filename=hocr_file, # or list_of_hocr_files
        dpi=150
    )

    helper.to_pdf(out_filename=output_pdf) # a multi page pdf

Our use case is that we send a batch of pages to Azure OCR(otherwise it'll be very slow to process many pages for us) and it returns an Azure OCR result object for all the pages. I can loop through each page object of the Azure OCR result object and generate either a list of hocr objects (where one hocr object corresponds to a page) or a single hocr object(if that's possible)

@RAbraham
Copy link

I guess I can call the azure engine in the global part of the file and then cache it and then when generate_pdf is called, just pick it from there. but how I know which key in the cache to pick up? e.g. I'll key the cache by page number for e.g. but I won't know from generate_pdf which page it is for, as it does not provide a page number iiuc?

I noticed generate_pdfa. Would that be useful here? and then I call helper above in a loop and then merge the single page files got from helper.to_pdf?

@shamoon
Copy link

shamoon commented Mar 2, 2024

Curious about all this myself. Anyone have a working example of converting e.g. the Azure output to hOCR?

@deajan
Copy link

deajan commented Mar 15, 2024

@shamoon Looks we found the same thread ^^
I'm currently trying to make easyocr compatible with paperless-ngx, see paperless-ngx/paperless-ngx#6056 (reply in thread)
I found an azure to hocr script, and will probably write mine for easyocr.

@hcoona
Copy link

hcoona commented Jun 27, 2024

@ThioJoe
Copy link

ThioJoe commented Nov 28, 2024

I have an HOCR file and a pdf file, but how do i actually apply the HOCR file to the pdf?

It came from Google Cloud via this: https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-document-to-hocr

@jonashaag
Copy link

@jbarlow83 I'm considering building Azure support for OCRmyPDF. Would you recommend going the hOCR route or the direct-to-PDF route from the EasyOCR plugin?

@ThioJoe
Copy link

ThioJoe commented Dec 17, 2024

@jbarlow83 I'm considering building Azure support for OCRmyPDF. Would you recommend going the hOCR route or the direct-to-PDF route from the EasyOCR plugin?

Personally I say just direct to PDF. The PDF standard is already a nightmare as it is without having to figure out how to even create the hOCR file in the first place 💀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests