Text Extraction from Document Image

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Text Extraction from Document Image

Version
Download 22
File Size 593.52 KB
File Count 1
Create Date 9 May 2023
Last Updated 9 May 2023

Text Extraction from Document Image

Anusha C, Saket Mishra, Rohit Metre, Harsh Gurawalia

Department of CSE, PES University, Bangalore-79, Karnataka

Email: anusha20c@gmail.com, saketmishra113@gmail.com, rohitmetre2000@gmail.com, harshrocks2442@gmail.com

Contact: +91 7676240083, +91 9731963460, +91 9611910110, +91 9901360787

Guided By:

Dr.SapnaV.M, Assistant Professor, Dept. of CSE, PES UNIVERSITY,Bangalore,Karnataka

Email: sapnavm@pes.edu

Abstract:. We'll be putting together an OCR system pipeline. A Convolutional neural network will be used to classify each individual character. CNN requires less training than a fully linked network because it has fewer parameters. To make this work, we'll first split the lines, then the words, and ultimately the individual characters that will be sent to CNN. The English character dataset that has been acquired will be used to train the CNN. The EMNIST dataset (Extended Modified National Institute of Standards and Technology) has around 8 lakh samples divided into 62 classes (10 digits + 26 lowercase alphabets + 26 uppercase alphabets). We discovered another CHARS74k dataset since this dataset comprises handwritten characters. CHARS74k has 62 classes, identical to EMNIST, and is a normalized dataset with 1016 samples for each character class. To build words, we will merge the expected character label from CNN. It's possible that the prediction is inaccurate and contains some misclassification. As a result, some adjustments are required. To accomplish this, we will utilize an English word spell checker to locate all similar words and select the most appropriate one.

Download