Text Extraction from Document Image
- Version
- Download 19
- File Size 593.52 KB
- File Count 1
- Create Date 9 May 2023
- Last Updated 9 May 2023
Text Extraction from Document Image
Anusha C, Saket Mishra, Rohit Metre, Harsh Gurawalia
Department of CSE, PES University, Bangalore-79, Karnataka
Email: anusha20c@gmail.com, saketmishra113@gmail.com, rohitmetre2000@gmail.com, harshrocks2442@gmail.com
Contact: +91 7676240083, +91 9731963460, +91 9611910110, +91 9901360787
Guided By:
Dr.SapnaV.M, Assistant Professor, Dept. of CSE, PES UNIVERSITY,Bangalore,Karnataka
Email: sapnavm@pes.edu
Abstract:. We'll be putting together an OCR system pipeline. A Convolutional neural network will be used to classify each individual character. CNN requires less training than a fully linked network because it has fewer parameters. To make this work, we'll first split the lines, then the words, and ultimately the individual characters that will be sent to CNN. The English character dataset that has been acquired will be used to train the CNN. The EMNIST dataset (Extended Modified National Institute of Standards and Technology) has around 8 lakh samples divided into 62 classes (10 digits + 26 lowercase alphabets + 26 uppercase alphabets). We discovered another CHARS74k dataset since this dataset comprises handwritten characters. CHARS74k has 62 classes, identical to EMNIST, and is a normalized dataset with 1016 samples for each character class. To build words, we will merge the expected character label from CNN. It's possible that the prediction is inaccurate and contains some misclassification. As a result, some adjustments are required. To accomplish this, we will utilize an English word spell checker to locate all similar words and select the most appropriate one.
Download