-
Notifications
You must be signed in to change notification settings - Fork 0
/
IndicOCR.html
25 lines (20 loc) · 2.15 KB
/
IndicOCR.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<!DOCTYPE html>
<html>
<head>
<title>Rohit Saluja Webpage</title>
<meta charset="UTF-8">
</head>
<body>
<hr> </p>
<h2 style="color:blue;">IndicOCR:</h2>
Optical Character Recognition (OCR) is the process of converting the document images into an editable electronic format. This has many advantages like data compression, enabling search or edit options in the images/text, and creating the database for other applications like Machine Translation, Speech Recognition, and enhancing dictionaries and language models.
OCR in Indian Languages is quite challenging due to richness in inflections. Using Open Source and Commercial OCR systems, we have observed the Word Error Rates (WER) of around 20-50% on typewriter printed documents according to our experiments. Also, developing a highly accurate OCR system with an accuracy as high as 90% is not useful unless aided by the mechanism to identify errors. So, we started with the problem of developing an end-to-end framework for Error Detection and Corrections in Indic-OCR. We have beaten the state-of-the-art results in “Error Detection in Indic-OCR” for languages with varied inflections and have solved the Out of Vocabulary problem for “Error Correction in Indic-OCR” in our ICDAR-2017 conference paper. We have also developed OpenOCRCorrect, an adaptive framework for correcting OCR errors in the Indian Documents. We have presented the benefit of reduction in human efforts due to OpenOCRCorrect in our workshop paper at ICDAR-OST 2017.
While developing an end to end OCR system now, we are facing issues related to layout segmentation in highly fragmental Indic documents. We are planning to use Differentiable Neural Computers for the task. In addition to marking segments as text, heading, list, table, image etc. we are also interested in identifying the font (or nearest machine font if the document has typewriter printed fonts) in the text segments.
<p>
1. <a href="icdar2017.html"> ICDAR2017</a> contain model details, dataset etc. for the ICDAR 2017 conference paper.
</p>
<p>
2. <a href="icdarOst2017.html"> ICDAR-OST 2017</a> contain video, framework etc. for the ICDAR-OST 2017 workshop paper.
</p>
</body>
</html>