In this problem we are given a bunch of resturant bills in pdf format. We have to extract text from the images of bills given in ".pdf" files.
The major python modules used for solving the above mentioned problem are as follows:
- Wand
- OpenCV
- PyTesseract OCR
-
Use Wand to convert pdf to image of any resolution (here we have used 700 x 700) and save it in
images
folder. -
Read the generated image using OpenCV
-
Use PyTesseract to read text from the images and save the data obtained in a json file (here
json/img-to-text.json
).
In this project Python
version 3.7.7 is used.
First create a new anaconda environment and then activate the environment:
# Create environmemt.
conda create -n bill-reader python=3.7
# Activate environment.
conda activate bill-reader
Then install the following python packages using pip:
$ pip install wand
$ pip install pytesseract
$ pip install opencv-python
STEP-1
Open Python terminal by typing the following command in anaconda command prompt:
$ python
This will open a python terminal.
STEP-2
from wand.image import Image as wi
If you get error any error proceed to Step-3:
Visit the following link and follow the instructions given for your respective OS.
For Wndows.
Checkboxes that must be ticked while installing are as follows:
And then check again repeat Steps 1 and 2. Hopefully it will solve the import error with wand module.
STEP-3
If there is no error, then wand module is working fine. And we will exit the terminal.
quit()
STEP-4
Now open 01-pdf-to-image.ipynb
file and run the cells in your jupyter-notebook.
If you get DelegateError
, do the follows:
- INSTALL GHOSTSCRIPT
STEP-1
Open Python terminal by typing the following command in anaconda command prompt:
$ python
This will open a python terminal.
Step-2:
Visit this link and download the write installer according to your python architecture (32 or 64).
Then install it and make a note of the installation location
.
Then open '02-image-to-text.ipynb' file and in cell 1 update the path mensioned to your installing location.
STEP-1
Open Python terminal by typing the following command in anaconda command prompt:
$ python
This will open a python terminal.
STEP-2
import cv2
If you get error any error proceed to Step-3:
Refer this answer -- https://stackoverflow.com/questions/19876079/cannot-find-module-cv2-when-using-opencv
else refer this: opencv_installation_instructions
It worked for me.
-
First run
01-pdf-to-image.ipynb
.It will take some time to execute completely depending upon your computer hardware.
-
Now run
02-image-to-text.ipynb
.It will also take some time to execute.
MIT © 2020 Deepankar
1.0.0