TL;DR · Key Takeaways
- An inference engine is the software that serves a model efficiently. The same GPU can serve a handful of users or many, depending on it.
- Two ideas do most of the heavy lifting: continuous batching (keep the GPU full by swapping requests in and out) and paged attention (manage the KV cache without waste).
- vLLM is the easy, flexible default. TensorRT-LLM squeezes maximum performance on NVIDIA hardware. SGLang shines on shared-prefix and structured workloads.
- Managed options like NVIDIA NIM package these up so you trade some control for a supported, ready-to-run service.
Two teams can run the exact same model on the exact same GPU and get wildly different results: one serves five users before it chokes, the other serves fifty smoothly. The model did not change. What changed is the inference engine, the unglamorous serving software sitting between the model and the users, and it is one of the highest-leverage choices in the whole stack. Phase 23 ended on the memory wall and the KV cache; this part is about the software built specifically to beat them, and how the leading options differ when you have to pick one.
What an inference engine actually does
You could, in principle, load a model with a few lines of basic code and generate text. It would work, and it would be catastrophically inefficient, serving one request at a time while the expensive GPU sat mostly idle between tokens. An inference engine is the specialised layer that turns that toy into a service. Its job is to keep the GPU as busy as possible across many simultaneous users, schedule their requests intelligently, and squeeze the most tokens per second out of the hardware you are paying for. In the economics of Part 22, the engine is what determines your real cost per token, because it sets your utilization.
Everything an engine does traces back to the constraints from the last part. The GPU is fast but memory-bound, and the KV cache grows and fragments as users come and go. So the engine’s cleverness is mostly about two things: keeping work flowing so the cores never starve, and packing the KV cache so tightly that more users and longer contexts fit in the same fixed VRAM. Almost every headline feature you will read about is a variation on those two goals.
It helps to know the numbers engines are actually optimising, because they pull against each other. Time to first token is how long the user waits before anything appears, dominated by the prefill phase and what makes a chat feel responsive. Inter-token latency is the speed of the words streaming out after that, the steady drip of decode. And throughput is the total tokens per second across all users, which is what sets your cost. The tension is that the moves which raise throughput, bigger batches packing the GPU, can hurt the latency any single user feels, because their request now shares the chip with more neighbours. An engine’s scheduler is constantly refereeing this trade, and the “best” settings depend entirely on whether you are running a latency-sensitive chatbot or a throughput-sensitive batch job. Keep these three metrics in mind, because every engine claim is really a claim about where it lands on them.
The two ideas that changed serving
The first is continuous batching. The naive way to batch is static: gather a group of requests, run them together until all are done, then start the next group. The problem is that requests finish at different times, a short answer is done while a long one keeps going, so the GPU slots freed by the finished requests sit empty, waiting for the slowest in the batch. Continuous batching fixes this by treating the batch as a living thing: the instant one request finishes, a waiting one takes its place, mid-flight. The GPU stays packed, utilization soars, and total throughput can multiply several times over with no change to the model.
The second is paged attention, and it borrows a trick from operating systems. Traditionally the KV cache for each request needed a single contiguous block of memory, sized for the longest possible conversation, which wasted enormous amounts of VRAM on space that might never be used and left the memory fragmented. Paged attention splits the KV cache into small fixed-size pages that can live anywhere in memory, exactly like virtual memory pages on your computer. Nothing is reserved that is not needed, fragmentation vanishes, and far more requests fit in the same VRAM. Pages can even be shared between requests with a common prefix. This one idea, introduced by vLLM, roughly doubled how many users a given GPU could hold, and it is now table stakes.
The contenders, and when each fits
vLLM is the popular default, and for good reason. It is open-source, supports a huge range of models, is straightforward to stand up, and delivers excellent throughput thanks to the paged attention it pioneered. For most teams self-hosting, vLLM is the sensible starting point: you will get strong performance without a research project, and the community is large. Reach past it only when you have a specific reason.
In practice, standing up vLLM is close to a one-liner. You point it at a model and it serves an OpenAI-compatible endpoint:
vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2
That single command pulls the weights, builds the paged-attention KV cache, and exposes an API, which is much of why vLLM became the default. Two more names you will meet: Hugging Face’s TGI (Text Generation Inference), another capable open server, and NVIDIA Triton, which is a serving layer that often runs TensorRT-LLM underneath rather than an engine in its own right. This space moves quickly, so treat any specific ranking here as a snapshot of the field as of 2026 and benchmark the current releases on your own workload before committing.
TensorRT-LLM is NVIDIA’s high-performance engine. It compiles a model into highly optimised kernels tuned for specific NVIDIA GPUs, which can deliver the lowest latency and highest efficiency available, the last drop of performance from the hardware. The cost is complexity and rigidity: the build step is heavier, it is tied to NVIDIA, and changing models is less casual. It earns its place when performance per GPU directly drives your economics and you are willing to invest engineering to get it. SGLang is the newer challenger, strong on workloads with shared prefixes (think many requests that reuse the same long system prompt or document) and on structured, constrained generation, which makes it attractive for agent-style and high-volume templated applications. Finally, managed packages like NVIDIA NIM wrap optimised engines (often TensorRT-LLM) into ready-to-run containers with stable APIs, trading some control and flexibility for a supported, deploy-it-and-go experience that enterprises often prefer.
One capability cuts across all of them and is easy to overlook until you need it: spreading a model across multiple GPUs. When a model is too large to fit in one card’s VRAM, the engine has to split it, a technique called tensor parallelism that shares each layer’s math across several GPUs working in lockstep. Every serious engine supports this, but they differ in how efficiently they do it and how much fast interconnect between the GPUs they assume. That dependency is not a footnote; for multi-GPU serving the network between the cards can matter as much as the cards themselves, which is exactly the thread the next two parts pull on. For now, just note that “which engine” and “how many GPUs” are linked questions, not separate ones.
| Engine | Strength | Reach for it when |
|---|---|---|
| vLLM | Easy, flexible, great throughput | Your default for self-hosting |
| TensorRT-LLM | Peak performance on NVIDIA GPUs | Squeezing every drop, at scale |
| SGLang | Shared prefixes, structured output | Agents, templated, prefix-heavy loads |
| Managed (NIM) | Packaged, supported, stable API | You want it run for you |
▾ Go Deeper (optional, for technical readers)
The reason engines schedule so carefully is the split personality of inference from Part 23: prefill and decode have opposite bottlenecks. Prefill processes the whole prompt in parallel and is compute-bound; decode generates one token at a time and is memory-bound. Mixed naively, they interfere, a big prefill can stall the steady drip of decode tokens for everyone in the batch, spiking latency. Modern engines handle this with techniques like chunked prefill (slicing a long prompt so it interleaves with ongoing decodes) and, at larger scale, disaggregated serving that runs prefill and decode on separate pools of GPUs so each is tuned for its own bottleneck.
Two other levers are worth naming. Prefix caching (a strength SGLang leans on heavily, via its radix-tree approach) stores the KV cache for a shared prefix once and reuses it across every request that begins the same way, which is enormous when thousands of calls share a long system prompt or document. And speculative decoding, covered more next part, uses a small draft model to guess several tokens ahead that the big model then verifies in one pass, partly sidestepping the one-token-at-a-time memory penalty. The takeaway for choosing an engine: the right pick depends on which of these your workload stresses. Prefix-heavy, agentic, or structured workloads reward different optimisations than a stream of unique one-off prompts. If you are deploying NVIDIA’s packaged stack specifically, I walk through NIM microservices in practice in my Private AI NIM microservices write-up.
This is Part 24 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It is the direct sequel to Part 23 on the memory wall.
The Bottom Line
The inference engine is the difference between a GPU that serves a few users and one that serves many, and it does its work through two big ideas: continuous batching to keep the hardware full, and paged attention to use the KV-cache memory without waste. Get the engine right and your cost per token, your latency, and your capacity all improve at once, without touching the model.
My verdict for most teams: start with vLLM because it is capable and easy, move to TensorRT-LLM when peak performance pays for the added complexity, look at SGLang when your workload is prefix-heavy or structured, and consider a managed package like NIM when you would rather buy the operations than build them. Whichever you choose, benchmark on your own traffic, because the only ranking that matters is the one on your workload. Engine chosen, the next question is how to scale beyond one GPU, and the tug-of-war between latency and throughput that governs it.
References
- vLLM and PagedAttention (vLLM project)
- TensorRT-LLM (NVIDIA)
- SGLang and RadixAttention (LMSYS)
« Part 23: GPUs and the memory wall | Generative AI Complete Guide | Next: Part 25, scaling inference »








