Real-Time Offline AI Image Describer for Visually Impaired: BLIP-Based Captioning with Hazard Prioritisation, OCR Integration, and Cosine-Similarity Change Detection

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Real-Time Offline AI Image Describer for Visually Impaired: BLIP-Based Captioning with Hazard Prioritisation, OCR Integration, and Cosine-Similarity Change Detection

Version

File Size 911.85 KB

Downloads 4

Files 1

Published 17 May 2026

Updated 17 May 2026

Real-Time Offline AI Image Describer for Visually Impaired: BLIP-Based Captioning with Hazard Prioritisation, OCR Integration, and Cosine-Similarity Change Detection

Authors:

Kanishka Singhal, Mansi, Dr. Archana Kumar

Abstract—More than 285 million people worldwide live with some form of visual impairment, yet the AI-powered assistive tools most widely available to them—Microsoft Seeing AI, Google Lookout, and similar cloud-hosted applications—share a cluster of practical shortcomings that limit real-world usefulness: they stop working whenever internet connectivity drops, they read every scene element in the same neutral tone regardless of danger, and they repeat themselves needlessly when nothing meaningful has changed. This paper describes a system built specifically to address each of those problems within a single, self-contained pipeline. The approach combines the Salesforce BLIP vision-language model for natural-language scene description, Tesseract OCR v5 for reading text embedded in the scene, cosine-similarity comparison of BLIP image embeddings to decide whether a new description is actually worth saying, and a keyword-based hazard detector that escalates urgent warnings through a faster, more prominent text-to-speech voice—all running offline through pyttsx3 on ordinary laptop hardware.

We evaluated the system on 150 annotated frames drawn from five different scene types. Overall caption accuracy came out to 76.0%, rising to 86.7% in well-lit indoor conditions. OCR reached an F1 score of 80.7% across scene-text categories, hazard recall was 91.3% with a non-hazard precision of 100%, and change detection cut redundant audio output by 55%. A side-by-side comparison with four deployed assistive systems confirms that none of them simultaneously covers all four capability dimensions without requiring GPU hardware. We also provide a formal mathematical treatment of the caption generation objective, the change detection criterion, and the hazard function. Because the system runs entirely on a CPU with 4 GB of RAM, it represents a meaningful step toward genuinely accessible assistive technology in resource-constrained settings.

Index Terms—assistive technology, BLIP, cosine similarity, hazard detection, image captioning, OCR, offline AI, text-to-speech, visually impaired, Vision Transformer (ViT).