AI Knowledge Extraction System

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

AI Knowledge Extraction System

Version

File Size 379.44 KB

Downloads 10

Files 1

Published 20 April 2026

Updated 20 April 2026

AI Knowledge Extraction System

Sura Chandu
Department of Computer Science

Rajiv Gandhi University of Knowledge Technologies

Basar, Telangana, India b200152@rgukt.ac.in

Vollala Saiprakash
Department of Computer Science

Rajiv Gandhi University of Knowledge Technologies

Basar, Telangana, India b200770@rgukt.ac.in

Sayyam Sai Kumar
Department of Computer Science

Rajiv Gandhi University of Knowledge Technologies Basar, Telangana, India B201489@rgukt.ac.in

Buddannagari Latha
Assistant Professor, Dept. of Computer Science Rajiv Gandhi University of Knowledge Technologies

Basar, Telangana, India latha.reddy5808@gmail.com

Abstract—The proposed system architecture illustrates a com- prehensive RAG + CAG (Retrieval-Augmented Generation and Context-Augmented Generation) Multi-Source Knowledge Ex- traction System designed to dynamically ingest, process, and synthesize information from diverse digital mediums. At its foun- dation, a robust Content Extraction Module gathers unstructured data from various sources—including PDF documents, websites (utilizing web scrapers), YouTube videos via transcript extraction, and images via Optical Character Recognition (OCR). This raw data then undergoes rigorous pre-processing and “smart chunking” before being transformed by an Embedding Generator and indexed within a scalable Vector Database (Chroma). When a user submits a question, the system generates a query embedding to seamlessly retrieve the most contextually relevant information chunks. Finally, an advanced RAG + CAG Engine synthesizes these retrieved chunks alongside source metadata and conversa- tion context, ultimately delivering highly accurate, synthesized answers. Extensive quantitative evaluations demonstrate a 92% retrieval accuracy and a near-zero hallucination rate, proving the viability of this localized, hybrid architecture for enterprise and academic deployment.Index Terms—Retrieval-Augmented Generation (RAG), Large Language Models (LLM), Optical Character Recognition (OCR), Natural Language Processing (NLP), Document Extraction, Vec- tor Database, Context-Augmented Generation (CAG), Multi- Modal Processing.