Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Version

File Size 324.55 KB

Downloads 1

Files 1

Published 7 May 2026

Updated 7 May 2026

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Author: Manoj Kumar

Department: Computer Science & Artificial Intelligence Central University of Andhra Pradesh

Co-Author: Dr. P. Sumalatha Department: Computer Science & Artificial Intelligence

Central University of Andhra Pradesh

Abstract - This work focuses on the two crucial bottlenecks in Retrieval-Augmented Generation (RAG): high inference latency and expensive computation cost. RAG enhances the factual correctness of Large Language Models (LLMs) by utilizing external knowledge sources; However, the auto-To achieve a good recall, the system applies a two-way retrieval mechanism:

Dense retrieval: This module uses BAAI/bge-small-en-v1.5 to create a 384-dimensional embedding space, where queries and documents are mapped. The embeddings are stored in a FAISS IndexFlatL2.
Sparse retrieval: This module uses TF-IDF vectorization to capture an exact keyword match.
Reranking: A simple heuristic mechanism of matching substrings of the query terms in the documents selected to perform the final step of ranking them for prompt

introduce a novel speculative RAG framework that combines hybrid retrieval with speculative decoding. Our framework employs hybrid dense vector search (BGE-small) and sparse keyword search (TF-IDF) for enhanced recall followed by a lightweight re-ranker model. A compact draft model (TinyLlama-1.1B) makes tentative future token generation, then a larger verifier model (Mistral-7B) checks the correctness of the candidate token. Empirical experiments demonstrate 33% and 29% improvement in inference latency and token reduction, respectively, compared to the baseline of a typical 7B model, while retaining 94% accuracy on factoid tasks.

Key Words: Retrieval-Augmented Generation, Speculative Decoding, Large Language Models, Inference Optimization, Hybrid Retrieval

International Scientific Journal of Engineering and Management

An International Scholarly || Multidisciplinary || Open Access || Indexing in all major Database & Metadata

The journal follows the UGC Guidelines and is evaluated for inclusion in the Web of Science

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Categories & Tags

Similar Downloads

What is the difference between a Research Paper and a Review Paper?

What is DOI?

What do you need to do during production of your Research Paper?

What are the advantages of publishing a research paper?

Ways to Support your Academic Wellbeing which preparing the Research Paper/Article

How to improve your Research Paper writing Skills?

Is DOI compulsory to publish a research paper in a Journal?

In what ways does research paper give weight to career development?

How to develop a Research Paper from Scratch

How Plagiarism report plays crucial role in Research Paper Publication?

What is DOI?

Quick Links

Contact Us