AI Knowledge Extraction System
AI Knowledge Extraction System
Sura Chandu
Department of Computer Science
Rajiv Gandhi University of Knowledge Technologies
Basar, Telangana, India b200152@rgukt.ac.in
Vollala Saiprakash
Department of Computer Science
Rajiv Gandhi University of Knowledge Technologies
Basar, Telangana, India b200770@rgukt.ac.in
Sayyam Sai Kumar
Department of Computer Science
Rajiv Gandhi University of Knowledge Technologies Basar, Telangana, India B201489@rgukt.ac.in
Buddannagari Latha
Assistant Professor, Dept. of Computer Science Rajiv Gandhi University of Knowledge Technologies
Basar, Telangana, India latha.reddy5808@gmail.com
Abstract—The proposed system architecture illustrates a com- prehensive RAG + CAG (Retrieval-Augmented Generation and Context-Augmented Generation) Multi-Source Knowledge Ex- traction System designed to dynamically ingest, process, and synthesize information from diverse digital mediums. At its foun- dation, a robust Content Extraction Module gathers unstructured data from various sources—including PDF documents, websites (utilizing web scrapers), YouTube videos via transcript extraction, and images via Optical Character Recognition (OCR). This raw data then undergoes rigorous pre-processing and “smart chunking” before being transformed by an Embedding Generator and indexed within a scalable Vector Database (Chroma). When a user submits a question, the system generates a query embedding to seamlessly retrieve the most contextually relevant information chunks. Finally, an advanced RAG + CAG Engine synthesizes these retrieved chunks alongside source metadata and conversa- tion context, ultimately delivering highly accurate, synthesized answers. Extensive quantitative evaluations demonstrate a 92% retrieval accuracy and a near-zero hallucination rate, proving the viability of this localized, hybrid architecture for enterprise and academic deployment.Index Terms—Retrieval-Augmented Generation (RAG), Large Language Models (LLM), Optical Character Recognition (OCR), Natural Language Processing (NLP), Document Extraction, Vec- tor Database, Context-Augmented Generation (CAG), Multi- Modal Processing.