TL;DR · Key Takeaways
- Scaling inference is a tug-of-war between latency (speed for one user) and throughput (total work across all users). Pushing one usually costs the other.
- Batching is the main dial: bigger batches raise throughput and lower cost per token, but make each individual user wait a little longer.
- When a model or load outgrows one GPU, you split the work with tensor parallelism (across GPUs in a box) or pipeline parallelism (across stages).
- Autoscale on the right signal, GPU and queue metrics, not CPU usage. Scaling on the wrong number is a classic and expensive mistake.
Every team scaling an AI service eventually runs into the same wall, and it is not a bug, it is a law. You can make the service fast for each user, or you can make it cheap by serving many users at once, but pushing hard on one tends to give ground on the other. There is no setting that maximises both, only a curve of trade-offs you get to choose a point on. Understanding that curve, and the handful of techniques for bending it, is what scaling inference actually means. The serving engine from Part 24 gives you the controls; this part is about how to set them.
The trade-off at the centre of everything
Latency is how quickly a single user gets their answer. Throughput is how many tokens the whole system produces per second across everyone. They sound complementary, but on a fixed GPU they fight, and the referee is batching. Put more requests in a batch and you use the hardware more efficiently, because, as Part 23 explained, you read the model’s weights from memory once and serve the whole batch with them. Throughput climbs and cost per token falls. But every user in that batch is now sharing the GPU’s attention with more neighbours, so each one’s tokens arrive a little slower. Bigger batch, better economics, worse individual latency.
This is why there is no universally correct configuration, only the right one for your product. A live chatbot lives and dies on responsiveness, so you keep batches smaller and accept a higher cost per token to keep latency low. An overnight job summarising a million documents has no human waiting, so you push batches as large as memory allows and optimise purely for throughput and cost. The same model on the same hardware should be tuned completely differently for those two cases. The first skill of scaling is knowing which side of this trade your workload actually sits on, because optimising for the wrong one quietly wastes money or annoys users.
When one GPU is not enough
Sometimes the model itself is too big for a single card’s VRAM, or the load is too high for one GPU to carry. Then you split the work, and there are two main ways to do it. Tensor parallelism slices each layer’s math across several GPUs that work on the same token simultaneously, each handling a fraction of every matrix multiply and exchanging results constantly. It keeps latency low because all the GPUs push on each token together, but that constant exchange demands a very fast connection between them, which is why tensor parallelism is normally kept inside a single server where the GPUs are linked by high-speed interconnect.
Pipeline parallelism splits the model the other way, by layers. GPU one holds the first chunk of layers, GPU two the next, and so on, and a request flows through them like an assembly line. Because each GPU only passes a small result to the next stage, pipeline parallelism tolerates slower links and can stretch across multiple servers, which is how truly enormous models are served. Its cost is “pipeline bubbles,” idle gaps when stages wait for work to reach them, which adds latency and takes care to minimise. Real large-scale deployments combine both: tensor parallelism within each box for speed, pipeline parallelism across boxes for capacity. The shape of your hardware and network decides the mix.
It is worth separating two questions people blur together. Splitting one model across GPUs (tensor and pipeline parallelism) is how you handle a model too big for one card, or push the latency of a single huge model down. Serving more users is usually a different and simpler move: run many independent copies of the model, each on its own GPU or group of GPUs, behind a load balancer that spreads requests across them. This is replica, or data-parallel, scaling, and it is the everyday workhorse of production AI. If your model already fits on one GPU and you just need to handle more traffic, you do not split the model at all, you clone it and balance across the clones. The clean rule of thumb: split the model when the model is the problem, add replicas when the volume is the problem, and combine them only when you genuinely have both. Confusing these two is how teams reach for complex parallelism they never needed.
Autoscale on the right number
Demand is never flat, so you add and remove GPU capacity as load changes, the same autoscaling idea used everywhere in computing. But AI serving has a trap waiting in it. The default metric most autoscalers reach for is CPU usage, and for an AI workload that number is almost meaningless: the CPU can look bored while the GPU is completely saturated and requests are piling up. Scale on CPU and you will under-provision badly, watching latency spike while your dashboard insists everything is fine. This is one of the most common and costly mistakes teams make moving AI to production.
The fix is to scale on signals that reflect the real bottleneck. GPU utilization, the depth of the request queue, and time-to-first-token are the meaningful ones: they tell you when the accelerators are full and users are starting to wait. Add capacity when the queue grows or first-token latency creeps up, shed it when the GPUs go quiet. The other half of the trap is that AI capacity is slow and expensive to add, spinning up a GPU and loading a large model is not instant, so good autoscaling for inference is more deliberate than the snappy CPU-based scaling of a stateless web app. You provision for the spikes you can predict and scale gracefully around the ones you cannot.
Operating it: orchestration, GPU sharing, and observability
Everything so far assumes the GPUs are there and kept busy. In production, something has to place workloads onto them, share them safely between teams, and tell you when one is sick, and that operational layer is where a great deal of real GenAI infrastructure work actually lives. The de facto control plane is Kubernetes: a GPU operator installs the drivers and exposes each card to the scheduler, and serving frameworks like KServe or Ray Serve run your model as an autoscaling service on top, wiring the latency-and-throughput dials from this part into something you can deploy, roll back, and monitor. If you are not on Kubernetes, the same jobs, scheduling, health checks, rolling updates, still have to be solved some other way.
A modern accelerator is often too big to hand to one small workload, so you partition it. NVIDIA’s MIG (Multi-Instance GPU) slices a card into hardware-isolated instances each with its own memory and compute; vGPU and time-slicing share a card more softly. This is how you pack several light models, or several tenants, onto expensive hardware without them treading on each other, the multi-tenancy problem every platform team eventually hits. Getting partitioning right is a direct lever on the utilization that, as these parts keep arguing, decides your real cost: a half-used H100 carved into four MIG slices serving four models can beat one model leaving three-quarters of the card idle.
And you cannot manage what you cannot see. CPU dashboards lie about GPU health, so you watch GPU-native signals: per-card utilization and memory (exposed by NVIDIA’s DCGM), the request queue depth, and the token latencies from earlier. Those are what reveal a saturated card, a model thrashing its KV cache, or a node about to fail. If you want these operational pieces worked through on a concrete platform, I cover the trade-offs in my Private AI series: GPU partitioning (vGPU, MIG, passthrough), GPU monitoring, and self-service model serving.
▾ Go Deeper (optional, for technical readers)
Two advanced techniques bend the latency-throughput curve rather than just sliding along it. Speculative decoding attacks the fundamental slowness of generating one token at a time. A small, cheap “draft” model rapidly proposes several tokens ahead, and the large model then verifies all of them in a single forward pass. Because verification is parallel and the big model’s weights were going to be read anyway, multiple tokens can be confirmed for roughly the memory cost of one, and any wrong guesses are simply discarded. When the draft model is well matched to the target, this can speed up generation substantially with no change to output quality, partly defeating the memory-bound penalty from Part 23.
Disaggregated prefill and decode tackles the split-personality problem of inference. Recall that prefill is compute-bound and decode is memory-bound; running them on the same GPUs means each interferes with the other’s ideal scheduling. Disaggregation puts them on separate pools of GPUs, a set tuned for chewing through prompts and a set tuned for streaming out tokens, with the KV cache handed across between them. This lets each pool run at its own optimal batch size and configuration, improving both the time-to-first-token and the steady token rate at scale. Both techniques are increasingly standard in large serving stacks, and both are reminders that “scaling” is not only adding hardware; it is also being smarter about the work each piece of hardware does. The catch is operational complexity, so they earn their keep at scale and are overkill for a modest single-GPU deployment.
This is Part 25 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It builds on the memory wall (Part 23) and the inference engines (Part 24).
The Bottom Line
Scaling inference is the art of placing yourself on the latency-throughput curve on purpose. Batching is the dial that trades one for the other, parallelism (tensor within a box, pipeline across boxes) is how you grow past a single GPU, and autoscaling keeps capacity matched to demand, as long as you scale on GPU and queue signals rather than the CPU number that lies to you. Speculative decoding and disaggregation are the clever moves that bend the curve when you reach real scale.
The discipline I would insist on: name your latency budget first, then chase throughput up to that line and no further. Almost every scaling mistake is really a failure to decide which side of the trade the product needs. One thread keeps recurring through all of this, the speed of the links between GPUs, and it points at a layer we have so far treated as a given. The next part goes there, into the network and storage that quietly govern large-scale AI.
References
- Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)
- Megatron-LM: tensor and pipeline parallelism (Shoeybi et al., 2019)
- LLM inference performance engineering (Databricks)
« Part 24: inference engines compared | Generative AI Complete Guide | Next: Part 26, network and storage »








