In this modern era , everybody wants the information to be stored, retrieved and processed as fast as possible. Paper based Documentation has become obsolete these days , all the information are captured in digital (electronic data) format and stored in secure place. The methodology discussed in the present paper reduces the storage cost , ensures the security and increases the data accessibility. Most of the organizations have already stepped into the digital world. Here, the concern is, their past information is yet to be migrated from image or paper based document into digital text based documents. So, every organization expects some progress in such kind of conversion or migration or some modern methodology which serves their purpose.
Such a concept became an emerging technology in image processing as an Optical Character Recognition methodology.
Here, we present a brief analysis to identify, understand new solution, evaluate and execute this methodology. Also we propose to overcome the issues which have been identified in the existing Optical Character Recognition Solutions.
[ The present work proposes solution only for typed text characters, not for hand – written text ]
1 . INTRODUCTION
Optical Character Recognition henceforth will be termed as OCR, is a process of extracting the relevant information from raw data. This has been defined especially in text domain, as “Sentence or Word or Character extraction from an image source. In order to preserve the content from high volume of Image data into light weight text data in an electronic format ”. The flow diagram ( Image 1.1 ) depicts the process.
There are six steps involved here, two among them need user intervention ( to feed in the input image and to store the result in preferred document format). Other four steps are explained in detail below,
a. Data Feed-In:
Scanned document(s), as an image, are provided to the system via scanners or by other means.
Here, the input data can be single file / multiple files.
b. Pre – Processing:
This is not maintained as mandatory process in some existing solutions, but will improve the output quality.
Different pre-processing techniques have been used like Noise removal ( Salt-pepper noise removal, ink plot removal ,etc.) Zero degree positioning (Scanned document might be tilted ; increasing the ambiguity, with poor results, because text direction always matters in OCR processing.
This pre – processing is achieved through the basic image processing techniques.
c. Character Recognition:
After the pre-processing, grouped text / words are segregated as a single character and processed. Each character is recognized based on the technology used. There are different approaches to recognize the characters; most commonly, Pattern Based approach is used, which provides more accuracy than other approaches.
d. Text – Optimization:
Using the pre-defined patterns the extracted character images are compared. The matched image will be converted as text...