International Scientific Journal of Engineering and Management

An International Scholarly || Multidisciplinary || Open Access || Indexing in all major Database & Metadata
The journal follows the UGC Guidelines and is evaluated for inclusion in the Web of Science
ISSN: 2583-6129

Impact Factor: 8.072

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Version
File Size 324.55 KB
Downloads 1
Files 1
Published 7 May 2026
Updated 7 May 2026

Speculative Retrieval-Augmented Generation for Cost-Efficient Large Language Model Inference

Author: Manoj Kumar

Department: Computer Science & Artificial Intelligence Central University of Andhra Pradesh

Co-Author: Dr. P. Sumalatha Department: Computer Science & Artificial Intelligence

Central University of Andhra Pradesh

Abstract - This work focuses on the two crucial bottlenecks in Retrieval-Augmented Generation (RAG): high inference latency and expensive computation cost. RAG enhances the factual correctness of Large Language Models (LLMs) by utilizing external knowledge sources; However, the auto-To achieve a good recall, the system applies a two-way retrieval mechanism:

  • Dense retrieval: This module uses BAAI/bge-small-en-v1.5 to create a 384-dimensional embedding space, where queries and documents are mapped. The embeddings are stored in a FAISS IndexFlatL2.
  • Sparse retrieval: This module uses TF-IDF vectorization to capture an exact keyword match.
  • Reranking: A simple heuristic mechanism of matching substrings of the query terms in the documents selected to perform the final step of ranking them for prompt

introduce a novel speculative RAG framework that combines hybrid retrieval with speculative decoding. Our framework employs hybrid dense vector search (BGE-small) and sparse keyword search (TF-IDF) for enhanced recall followed by a lightweight re-ranker model. A compact draft model (TinyLlama-1.1B) makes tentative future token generation, then a larger verifier model (Mistral-7B) checks the correctness of the candidate token. Empirical experiments demonstrate 33% and 29% improvement in inference latency and token reduction, respectively, compared to the baseline of a typical 7B model, while retaining 94% accuracy on factoid tasks.

Key Words: Retrieval-Augmented Generation, Speculative Decoding, Large Language Models, Inference Optimization, Hybrid Retrieval

Download
or download free
[changelog]

Categories & Tags

Similar Downloads

No related download found!
ISJEM Journal

Author's Blog

What is the difference between a Research Paper and a Review Paper?

A research paper and a review paper are both scholarly documents, but they serve different purposes and have different characteristics....
Read More
Author's Blog

What is DOI?

A Digital Object Identifier (DOI) is a unique alphanumeric string that is used to identify and provide a persistent link...
Read More
Author's Blog

What do you need to do during production of your Research Paper?

During the production of a research paper, the following steps need to be taken: conducting research, organizing and analyzing data,...
Read More
Author's Blog

What are the advantages of publishing a research paper?

Publishing a research paper can have many advantages for researchers, including: Career advancement, professional recognition, opportunities for collaboration, increased visibility,...
Read More
Author's Blog

Ways to Support your Academic Wellbeing which preparing the Research Paper/Article

To support your academic wellbeing while publishing a research paper, it's important to set realistic goals, manage your time effectively,...
Read More
Author's Blog

How to improve your Research Paper writing Skills?

Read extensively: One of the best ways to improve your research paper skills is to read extensively in your field...
Read More
Author's Blog

Is DOI compulsory to publish a research paper in a Journal?

DOI is not strictly required to publish a research paper, but it is highly recommended. Basically, the International Scientific Journal...
Read More
Author's Blog

In what ways does research paper give weight to career development?

Publishing a research paper can give weight to a researcher's career development in several ways, such as: establishing oneself as...
Read More
Author's Blog

How to develop a Research Paper from Scratch

Developing a research paper involves several steps including: choosing a topic, conducting background research, formulating a research question or hypothesis,...
Read More
Author's Blog

How Plagiarism report plays crucial role in Research Paper Publication?

Plagiarism is a major concern in the academic and research community, as it undermines the integrity of the research and...
Read More