Visionary AI: Multimodal Image Captioning Using Blip-2

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Visionary AI: Multimodal Image Captioning Using Blip-2

Version

File Size 331.26 KB

Downloads 24

Files 1

Published 15 April 2026

Updated 15 April 2026

Visionary AI: Multimodal Image Captioning Using Blip-2

J. Janaki Ram, M. Siddardha, D. Vinodh Kumar, G. Sunil Kumar, B. Anjanadevi
Department of Information Engineering and Computational Technology, MVGR College of Engineering (A),
Vizianagaram, Andhra Pradesh, India

Abstract—Generating coherent natural language descriptions from visual content sits at a genuinely difficult intersection of computer vision and natural language processing—one where progress has accelerated sharply in recent years yet deployment-ready systems remain comparatively rare. This work introduces Visionary AI, a full-stack multimodal web application that harnesses the BootstrappedLanguage-Image Pre-training 2 (BLIP-2) model to produce semantically rich captions for staticimages, video sequences, and live camera feeds. At its core, BLIP-2 couples a frozen vision encoder with a lightweight Querying Transformer (Q-Former) and a frozen large language model (LLM) decoder, yieldingdescriptions that are both contextually grounded and grammatically fluent while requiringsubstantially fewer trainable parameters than conventional end-to-end architectures. Beyond baseline English captioning, the platform extendsits utility through four operationally meaningful modules: multilingual output spanning eight or more languages (including Hindi, Spanish, French, German, and Japanese); browser-native voice narration via the Web Speech API for hands-free captionconsumption; video processing through configurable key-frame extraction and temporal narrative synthesis; and sixcontextual caption variants—Creative, Technical, Social, Minimal, Narrative, and Atmospheric—tailored to specific deployment needs. On the Flickr8k benchmark, the systemattains BLEU-4 = 0.34, METEOR = 0.27, and CIDEr = 0.72, surpassing prior encoder-decoder baselines by a meaningful margin.