IMAGE TO SPEECH CONVERTER SOFTWARE USING DEEP LEARNING
IMAGE TO SPEECH CONVERTER SOFTWARE USING DEEP LEARNING:
Authors:
Chelimila Manasa, M Parushuram, K AYYAPPA
Department of Computer Engineering, Methodist College of Engineering and Technology,
Abids, Hyderabad, Telangana,500001, India.
Dr. Syed Azahad
Department of Computer Engineering, Methodist College of Engineering and Technology, Abids, Hyderabad, Telangana, 500001, India.
ABSTRACT
This project presents a real-time image-to-speech system designed to enhance accessibility for individuals with visual impairments by transforming visual content into meaningful spoken descriptions. The proposed framework leverages deep learning models to interpret images and generate natural language captions that accurately describe the visual scene. Convolutional Neural Networks (CNNs) are utilised to extract detailed and discriminative visual features, forming the foundation for understanding objects and their spatial relationships within an image. These extracted features are then processed by a Long Short-Term Memory (LSTM) network equipped with an attention mechanism, enabling the model to focus on relevant regions of the image while producing contextually rich and coherent textual descriptions.
To convert the generated captions into audible speech, the system incorporates a Text-to-Speech (TTS) engine, completing a seamless pipeline from image acquisition to spoken output. The model is trained and evaluated using the MS-COCO dataset, which provides diverse and complex image-caption pairs. Performance is assessed through widely recognised captioning metrics, including BLEU and METEOR scores, ensuring a reliable evaluation of linguistic accuracy and descriptive quality.
By integrating advanced techniques from computer vision and natural language processing, this system demonstrates the impactful role of artificial intelligence in assistive technology. The architecture is optimised for real-time execution, making it suitable for deployment on low-cost devices and edge-based platforms. Overall, the proposed solution offers a practical, efficient, and scalable tool aimed at improving independence and enhancing the quality of life for visually impaired users.
Keywords: Image Captioning, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Attention Mechanism, Deep Learning Computer Vision, Natural Language Processing (NLP), Text-to-Speech (TTS), Assistive Technology, MS-COCO Dataset