A Review of Quantization Techniques for Large Language Models: From Post-Training Quantization to Extreme 1-Bit Methods
A Review of Quantization Techniques for Large Language Models: From Post-Training Quantization to Extreme 1-Bit Methods
Smit Vaghasiya, Baisampayan Dey
Mentor: Sinalben Patel
Abstract:Quantization reduces the numerical precision of LLM weights and activations from floating-point to low-bit integer formats, making it possible to run large models on hardware that would otherwise be off-limits. This review covers post-training quantization (PTQ), quantization-aware training (QAT), mixed precision, extreme sub-1-bit methods, and Key-Value (KV) cache compression, drawing on transformer-based LLM research up to early 2026. Standard PTQ methods like GPTQ and SpQR achieve near-lossless compression at 3–4 bits, with memory reductions exceeding 4× on 175B-parameter models. BitNet b1.58 pushes further, matching full-precision performance at 1.58-bit ternary weights from 3B parameters onward. More recently, sub-1-bit PTQ frameworks like NanoQuant and LittleBit have made it possible to run 70B-scale models on consumer GPUs, while KV cache methods like TurboQuant and CommVQ support million-token contexts within reasonable memory limits. We also cover two underappreciated issues: how Chain-of-Thought reasoning models break under standard quantization, and how low-bit quantization can inadvertently reverse machine unlearning.Index Terms—Large Language Models, Post-Training Quantization, Quantization-Aware Training, KV Cache, Sub-1-Bit Compression, Model Compression, Low-Bit Inference