VISUAL QUESTION ANSWERING

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

VISUAL QUESTION ANSWERING

Version
Download 39
File Size 269.41 KB
File Count 1
Create Date 24 April 2025
Last Updated 24 April 2025

VISUAL QUESTION ANSWERING

Authors:

Dr. Rhea Srinivas, R Navneeth Naidu, P Anil Kumar, S Rohith Kumar, Santosh Nallala, Mithun P

Abstract - Vision-Language Pre-Training (VLP) significantly improves performance for a variety of multimodal tasks. However, existing models are often specialized in understanding or generation, which limits their versatility. Furthermore, trust in text data for large, loud web text remains the optimal approach for monitoring. To address these challenges, we propose VLX, a uniform VLP framework that distinguishes both vision languages and generation tasks. VLX introduces a new type of data optimization strategy. This strategy allows the generator to create high-quality synthetic training data, highlight the identifier noise, and allow the web to use the data records collected to more efficiently use the data records. Our framework achieves cutting-edge results with important benchmarks, including image text call (+3.1% average recall @1), visual answer questions (+2.0% accuracy), and multimodal capacitage (+2.5% cider). Additionally, VLX demonstrates the robust transferability of zero-shot transmissions to video language tasks without any additional tweaks. Publish codes, models and data records to promote future research.

Download