IMAGE TO SPEECH CONVERTER SOFTWARE USING DEEP LEARNING

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

IMAGE TO SPEECH CONVERTER SOFTWARE USING DEEP LEARNING

Version

File Size 382.35 KB

Downloads 6

Files 1

Published 7 April 2026

Updated 7 April 2026

IMAGE TO SPEECH CONVERTER SOFTWARE USING DEEP LEARNING:

Authors:

Chelimila Manasa, M Parushuram, K AYYAPPA

Department of Computer Engineering, Methodist College of Engineering and Technology,

Abids, Hyderabad, Telangana,500001, India.

Dr. Syed Azahad

Department of Computer Engineering, Methodist College of Engineering and Technology, Abids, Hyderabad, Telangana, 500001, India.

ABSTRACT

This project presents a real-time image-to-speech system designed to enhance accessibility for individuals with visual impairments by transforming visual content into meaningful spoken descriptions. The proposed framework leverages deep learning models to interpret images and generate natural language captions that accurately describe the visual scene. Convolutional Neural Networks (CNNs) are utilised to extract detailed and discriminative visual features, forming the foundation for understanding objects and their spatial relationships within an image. These extracted features are then processed by a Long Short-Term Memory (LSTM) network equipped with an attention mechanism, enabling the model to focus on relevant regions of the image while producing contextually rich and coherent textual descriptions.

To convert the generated captions into audible speech, the system incorporates a Text-to-Speech (TTS) engine, completing a seamless pipeline from image acquisition to spoken output. The model is trained and evaluated using the MS-COCO dataset, which provides diverse and complex image-caption pairs. Performance is assessed through widely recognised captioning metrics, including BLEU and METEOR scores, ensuring a reliable evaluation of linguistic accuracy and descriptive quality.

By integrating advanced techniques from computer vision and natural language processing, this system demonstrates the impactful role of artificial intelligence in assistive technology. The architecture is optimised for real-time execution, making it suitable for deployment on low-cost devices and edge-based platforms. Overall, the proposed solution offers a practical, efficient, and scalable tool aimed at improving independence and enhancing the quality of life for visually impaired users.

Keywords: Image Captioning, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Attention Mechanism, Deep Learning Computer Vision, Natural Language Processing (NLP), Text-to-Speech (TTS), Assistive Technology, MS-COCO Dataset