A Review of Quantization Techniques for Large Language Models: From Post-Training Quantization to Extreme 1-Bit Methods

Notification

Announcement!

ISJEM Invites papers for various areas like engineering, Management, Science & other multi discplinary subjects. Please submit your paper for review.

ISJEM assigns a digital object identifier (DOI) to each published paper, making it easier for the paper to be cited in various major databases like Google Scholar, ResearchGate, Academia.edu, etc…

ISJEM takes 24–48 hours to publish a research paper. Within 24 hours, the submitted paper will be reviewed and notified of its status, and it will be published once the processing fee is successfully received.

A Review of Quantization Techniques for Large Language Models: From Post-Training Quantization to Extreme 1-Bit Methods

Version

File Size 279.11 KB

Downloads 44

Files 1

Published 21 April 2026

Updated 21 April 2026

A Review of Quantization Techniques for Large Language Models: From Post-Training Quantization to Extreme 1-Bit Methods

Smit Vaghasiya, Baisampayan Dey

Mentor: Sinalben Patel

Abstract:Quantization reduces the numerical precision of LLM weights and activations from floating-point to low-bit integer formats, making it possible to run large models on hardware that would otherwise be off-limits. This review covers post-training quantization (PTQ), quantization-aware training (QAT), mixed precision, extreme sub-1-bit methods, and Key-Value (KV) cache compression, drawing on transformer-based LLM research up to early 2026. Standard PTQ methods like GPTQ and SpQR achieve near-lossless compression at 3–4 bits, with memory reductions exceeding 4× on 175B-parameter models. BitNet b1.58 pushes further, matching full-precision performance at 1.58-bit ternary weights from 3B parameters onward. More recently, sub-1-bit PTQ frameworks like NanoQuant and LittleBit have made it possible to run 70B-scale models on consumer GPUs, while KV cache methods like TurboQuant and CommVQ support million-token contexts within reasonable memory limits. We also cover two underappreciated issues: how Chain-of-Thought reasoning models break under standard quantization, and how low-bit quantization can inadvertently reverse machine unlearning.Index Terms—Large Language Models, Post-Training Quantization, Quantization-Aware Training, KV Cache, Sub-1-Bit Compression, Model Compression, Low-Bit Inference