VISUAL QUESTION ANSWERING
- Version
- Download 11
- File Size 269.41 KB
- File Count 1
- Create Date 24 April 2025
- Last Updated 24 April 2025
VISUAL QUESTION ANSWERING
Authors:
Dr. Rhea Srinivas, R Navneeth Naidu, P Anil Kumar, S Rohith Kumar, Santosh Nallala, Mithun P
Abstract - Vision-Language Pre-Training (VLP) significantly improves performance for a variety of multimodal tasks. However, existing models are often specialized in understanding or generation, which limits their versatility. Furthermore, trust in text data for large, loud web text remains the optimal approach for monitoring. To address these challenges, we propose VLX, a uniform VLP framework that distinguishes both vision languages and generation tasks. VLX introduces a new type of data optimization strategy. This strategy allows the generator to create high-quality synthetic training data, highlight the identifier noise, and allow the web to use the data records collected to more efficiently use the data records. Our framework achieves cutting-edge results with important benchmarks, including image text call (+3.1% average recall @1), visual answer questions (+2.0% accuracy), and multimodal capacitage (+2.5% cider). Additionally, VLX demonstrates the robust transferability of zero-shot transmissions to video language tasks without any additional tweaks. Publish codes, models and data records to promote future research.