Real-Time Offline AI Image Describer for Visually Impaired: BLIP-Based Captioning with Hazard Prioritisation, OCR Integration, and Cosine-Similarity Change Detection
Real-Time Offline AI Image Describer for Visually Impaired: BLIP-Based Captioning with Hazard Prioritisation, OCR Integration, and Cosine-Similarity Change Detection
Authors:
Kanishka Singhal, Mansi, Dr. Archana Kumar
Abstract—More than 285 million people worldwide live with some form of visual impairment, yet the AI-powered assistive tools most widely available to them—Microsoft Seeing AI, Google Lookout, and similar cloud-hosted applications—share a cluster of practical shortcomings that limit real-world usefulness: they stop working whenever internet connectivity drops, they read every scene element in the same neutral tone regardless of danger, and they repeat themselves needlessly when nothing meaningful has changed. This paper describes a system built specifically to address each of those problems within a single, self-contained pipeline. The approach combines the Salesforce BLIP vision-language model for natural-language scene description, Tesseract OCR v5 for reading text embedded in the scene, cosine-similarity comparison of BLIP image embeddings to decide whether a new description is actually worth saying, and a keyword-based hazard detector that escalates urgent warnings through a faster, more prominent text-to-speech voice—all running offline through pyttsx3 on ordinary laptop hardware.
We evaluated the system on 150 annotated frames drawn from five different scene types. Overall caption accuracy came out to 76.0%, rising to 86.7% in well-lit indoor conditions. OCR reached an F1 score of 80.7% across scene-text categories, hazard recall was 91.3% with a non-hazard precision of 100%, and change detection cut redundant audio output by 55%. A side-by-side comparison with four deployed assistive systems confirms that none of them simultaneously covers all four capability dimensions without requiring GPU hardware. We also provide a formal mathematical treatment of the caption generation objective, the change detection criterion, and the hazard function. Because the system runs entirely on a CPU with 4 GB of RAM, it represents a meaningful step toward genuinely accessible assistive technology in resource-constrained settings.
Index Terms—assistive technology, BLIP, cosine similarity, hazard detection, image captioning, OCR, offline AI, text-to-speech, visually impaired, Vision Transformer (ViT).