Quantization: Running Big Models on Smaller GPUs (GenAI Series, Part 20)

Quantization stores a model at lower precision so it needs far less memory. How FP16, INT8 and INT4 trade a little quality for big savings, plus distillation and pruning.

by

Dr. Pranay Jha

June 18, 2026

9 minutes

Read Time

Generative AI Series · Part 20 of 30

TL;DR · Key Takeaways

Quantization stores a model’s numbers at lower precision, fewer bits each, so the whole model takes far less memory.
A model’s memory footprint is roughly its parameter count times the bytes per parameter. Halve the bytes and you halve the memory.
Going from 16-bit to 4-bit can shrink a model around 4x, often with surprisingly little quality loss, which is what lets big models run on modest GPUs.
Related tricks, distillation (train a small model to mimic a big one) and pruning (remove dead weights), attack the same problem from other angles.

Here is a problem with hard numbers. A 70-billion-parameter model, stored at the usual 16 bits per parameter, needs about 140 gigabytes of memory just to hold its weights. No single consumer GPU has that, and even many data-center cards do not. By the arithmetic, that model simply cannot run on the hardware most people have. And yet people run 70B models on a single GPU all the time. The thing that makes the impossible routine is quantization, and it rests on one almost suspiciously simple idea: store each number using fewer bits.

Precision is just bits per number. Drop it and the memory bill drops in direct proportion.

What “precision” actually means

Every parameter in a model is a number, and a number has to be stored in some number of bits. Precision is just how many bits you spend on each one. More bits mean you can represent a value more exactly, with finer gradations and a wider range; fewer bits mean coarser steps and a narrower range. The common formats are a ladder: 32-bit floating point (the old default, very precise, 4 bytes per number), 16-bit (the modern training and serving default, 2 bytes), 8-bit, and 4-bit, each halving the storage of the one before.

Quantization is the act of taking a model trained at high precision and re-expressing its weights at lower precision, rounding each number to the nearest value the smaller format can hold. Think of it like saving a photo at lower bit depth: you lose some subtle gradations, but if done well the picture still looks right. The memory math is immediate: footprint is roughly parameter count times bytes per parameter, so a 70B model is about 140GB at 16-bit, 70GB at 8-bit, and 35GB at 4-bit. That last number is the difference between needing a small cluster and fitting on one card.

What you trade, and what you keep

Nothing is free, so what does lower precision cost? In principle, accuracy: round every weight to a coarser grid and the model’s outputs drift slightly from what the full-precision version would say. The genuinely surprising empirical result of the last few years is how little this costs when done carefully. Modern 8-bit quantization is, for most purposes, indistinguishable from 16-bit. Even 4-bit, with the right method, keeps quality close enough that the small loss is a bargain for a 4x reduction in memory. There is a floor, going to extremely low precision does break things, but the practical sweet spots hold up far better than intuition suggests.

And you gain more than memory. Smaller numbers mean less data to move between memory and the processor, and as a later part will hammer home, moving data is often the real bottleneck in inference. So a quantized model frequently runs faster too, not just lighter. This is the triangle every deployment juggles: precision, size, and quality pull against each other, and quantization lets you slide along that trade deliberately rather than being stuck at the expensive corner. For most real workloads, a well-quantized 4-bit or 8-bit model is the pragmatic default, not a compromise.

It is worth knowing there are two moments you can quantize, because they behave differently. Post-training quantization is the common one: take a finished model and compress it afterward, optionally using a small calibration sample to tune the rounding. It is fast, needs no retraining, and is what most downloadable “4-bit” models use. Quantization-aware training goes further by simulating the low precision during training, so the model learns weights that survive the squeeze gracefully. It costs a full training run, but it can hold quality at bit-widths where naive post-training compression would stumble. For most teams pulling a community-quantized model off a hub, post-training is the whole story; quantization-aware training is the tool you reach for when you are building the model yourself and need to push precision lower than the easy methods allow. The practical impact of all this is hard to overstate: it is the reason capable models now run on a single workstation card, or even a laptop, instead of demanding a rack.

You cannot have maximum quality, minimum size, and full precision at once. You pick a point.

Two cousins: distillation and pruning

Quantization shrinks a model by spending fewer bits per weight, but it is not the only way to make a model smaller and cheaper. Distillation takes a different route: you train a small “student” model to imitate the outputs of a large “teacher,” transferring much of the big model’s behaviour into a fraction of the parameters. The student never matches the teacher exactly, but it can get remarkably close on the tasks that matter while being far cheaper to run. Many of the small, fast models you can run locally are distilled from larger ones.

Pruning attacks the model’s wiring directly. Not every weight in a trained network pulls its weight, so to speak, and pruning removes the ones that contribute least, leaving a sparser network that does nearly the same job with fewer connections. In practice these techniques stack: a model can be distilled to a smaller size, pruned of dead weight, and then quantized to low precision, each step compounding the savings. The shared goal across all three is the theme of this whole phase, getting the most capability out of the least hardware, because the hardware is where the money is.

Reality check: “quantized” is not one thing, and the method matters as much as the bit count. A lazily quantized 4-bit model can be noticeably worse, while a carefully quantized one of the same size is nearly lossless. So do not judge by the number alone; judge by running your own eval (Part 18) on the actual quantized build you plan to ship.

▾ Go Deeper (optional, for technical readers)

Why can 4-bit inference come so close to 16-bit when, naively, you have thrown away three-quarters of the bits? Two reasons. First, neural network weights are highly redundant and noise-tolerant; the model was trained to be robust, and small rounding perturbations across millions of weights tend to average out rather than compound. Second, good quantization is not uniform rounding. Methods like GPTQ and AWQ are smarter: they look at which weights actually matter for the model’s outputs (using a small calibration dataset) and protect those, spending the limited precision budget where it counts. They also quantize in small groups with their own scale factors, so a few large values do not force the whole block onto a coarse grid.

The thing that breaks naive quantization is outliers. In large models, a small number of activation values are enormous compared to the rest, and if you scale everything to accommodate them, the ordinary values collapse into too few levels and information is lost. Techniques handle this by isolating the outlier dimensions and keeping them at higher precision while quantizing the rest aggressively, which is the core insight behind methods like LLM.int8(). When does low precision still fail? Very low bit-widths (2-bit and below) generally do degrade quality meaningfully today, sensitive layers sometimes need to stay at higher precision, and tasks demanding exact numerical work are less forgiving. The honest summary: 8-bit is nearly always safe, 4-bit is usually safe with a good method and a quick eval, and below that you are trading real quality for memory and should measure carefully before believing it is free.

This is Part 20 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It builds on what parameters are (Part 4) and sets up the GPU memory wall in the parts ahead.

Precision at a glance (70B model)

Precision	Bytes/param	~70B weights	Trade-off
FP32	4	~280 GB	Most precise, rarely needed
FP16 / BF16	2	~140 GB	The standard default
INT8	1	~70 GB	Near-lossless with a good method
INT4	0.5	~35 GB	Big savings, small quality risk

Halve the bytes per parameter and the memory footprint halves with it.

The Bottom Line

Quantization is the quiet workhorse that puts big models within reach of normal hardware. Spend fewer bits per parameter and a model’s memory footprint falls in lock step, turning a 140GB monster into a 35GB model that fits on one GPU and often runs faster too. The remarkable part is how cheap the quality cost has become with good methods, which is why low-precision serving is now the default rather than a desperate measure. Distillation and pruning push from other directions toward the same goal.

The practical stance I would take: reach for 8-bit without a second thought, reach for 4-bit with a quick eval to confirm, and be suspicious of anyone promising lossless results below that. These tricks all exist to dodge a single hard limit, the amount of fast memory on a GPU and the cost of moving data through it. That limit is so central to how generative AI runs that it deserves its own part, which is exactly where Phase 5 begins: why GenAI runs on GPUs at all, and the memory wall that governs everything.

References

LLM.int8(): 8-bit matrix multiplication for transformers at scale (Dettmers et al., 2022)
GPTQ: accurate post-training quantization (Frantar et al., 2022)
Distilling the Knowledge in a Neural Network (Hinton et al., 2015)

Generative AI Series · Part 20 of 30
« Part 19: data over model size | Generative AI Complete Guide | Next: Part 21, guardrails and responsible AI »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts