TSR-GEMM: Tile-Selective Precision Recovery for Robust Mixed-Precision Matrix Multiplication on GPU Tensor Cores
TSR-GEMM: Tile-Selective Precision Recovery for Robust Mixed-Precision Matrix Multiplication on GPU Tensor Cores
Authors:
Dr. Pavithra L1, Vedant Singh Chauhan2, Ananya Singh3, Vinayak Shrivastava4, Abhinav Rai5
1Department of Computational Intelligence, SRMIST
Chennai, India
{vc2685, as1178, vs, ar}@srmist.edu.in
Abstract:
Modern deep learning frameworks increasingly rely on mixed-precision general matrix multiplication (GEMM) to exploit the throughput advantages of half-precision (FP16) Tensor Cores on NVIDIA GPUs. While FP16 GEMM delivers substan-tial speedups over FP32 computation, it introduces numerical errors that are spatially non-uniform across the output matrix—concentrated in tiles whose input sub-blocks exhibit high condi-tion numbers or significant cancellation. Existing recovery mech-anisms, such as iterative refinement, operate at matrix-global granularity and therefore cannot exploit this spatial locality. We introduce TSR-GEMM (Tile-Selective Residual GEMM), a three-phase mixed-precision GEMM pipeline that (1) performs the bulk computation in FP16 using Tensor Cores while simultaneously accumulating per-tile norm statistics, (2) evaluates a lightweight instability score for each output tile based on input panel and output tile norms, and (3) selectively re-computes only flagged tiles in FP32 via cuBLAS. TSR-GEMM exposes a single tunable threshold τ that governs the precision–performance trade-off. On an NVIDIA RTX 3050 Ti GPU across matrix dimensions from 512×512 to 4096×4096, TSR-GEMM achieves FP32-comparable accuracy (5.4 × 10−8 relative error) at full recovery, while at 70% tile recovery it reduces error by 8× over pure FP16 with only a 12% throughput reduction relative to the no-recovery baseline. The τ sweep reveals a smooth, well-behaved Pareto frontier, confirming the instability score as a reliable predictor of per-tile numerical risk.
Index Terms—mixed-precision arithmetic, GEMM, Tensor Cores, CUDA, numerical accuracy, tile-selective recovery, GPU computing