The quantize.py script exits cleanly. The trtllm-build command finishes. The engine loads. Throughput is 30 percent below projections and the first accuracy check on a coding task fails hard. That sequence happens often enough that it has a name in my practice: the quiet failure of the optimization pass. The fix is almost always the same: the engineer picked a precision format the GPU does not support at full speed, or calibrated on the wrong dataset, or never checked whether TensorRT-LLM was actually doing in-flight batching. This post breaks down the layers beneath a NIM so you can stop debugging in the dark.
What TensorRT Does: Build-Time Compilation, Not a Runtime Wrapper
TensorRT is a model compiler. You give it a network graph (ONNX, or a TensorRT API definition), a target GPU, and a precision policy. It produces a binary engine file that is tied to that GPU architecture, driver version, and input shape range. The engine is not portable. A Hopper H100 engine will not run on an Ampere A100, and vice versa. That portability boundary bites teams that build engines on a CI runner with a different GPU than production.
Three things happen during the build phase. First, graph optimization: TensorRT fuses adjacent operations, eliminates redundant casts, and simplifies constants. Second, kernel selection: for each fused subgraph it profiles candidate CUDA kernels against the target GPU and picks the fastest one for the declared input shape range. Third, precision calibration: if you request INT8 or FP8, TensorRT either reads per-tensor scales from Q/DQ nodes embedded in the graph, or runs a calibration dataset through the network to derive dynamic ranges.
The kernel selection step is why build times are long. A 70B parameter model on four H100s can take 20 to 40 minutes to build. That cost is paid once; the resulting engine launches in seconds and runs at near-peak Tensor Core utilization because every op already has its kernel selected and fused.
Kernel Fusion: What It Actually Looks Like
A transformer attention block in an unfused graph runs as five or more separate CUDA kernel launches: a QKV projection GEMM, a softmax, a context GEMM, a layer norm, a residual add. Each kernel launch has overhead, each result trips through GPU memory. Fused, that sequence collapses into one or two kernels that keep intermediate results in register or shared memory. For multi-head attention, TensorRT-LLM ships a custom Flash MHA (FMHA) kernel that handles the entire QKV + softmax + context attention in one pass with tiling. For gated-MLP blocks on Hopper, it adds GEMM + SwiGLU fusion that merges two matmul ops and one SwiGLU activation into a single dispatch. The throughput difference is not marginal: on Llama-3.3-70B across four H100-SXM-80GB GPUs, NVIDIA’s own benchmarks show the full suite of FP8 fusions yielding 6,049 tokens/sec versus 2,474 tokens/sec for tuned FP16 without those fusions — a 144 percent gain.
What TensorRT-LLM Adds: The LLM Runtime Layer
TensorRT handles general neural networks well. But LLMs have a problem TensorRT alone cannot address: the autoregressive decode loop means request lengths are unknown at build time, batches grow and shrink dynamically, and KV cache memory balloons as sequence length increases. TensorRT-LLM is the layer that solves those problems.
Its four critical additions are:
Paged KV cache. KV cache is partitioned into fixed-size blocks allocated from a pool. New tokens consume only the blocks they need, and blocks are freed when a request completes. This eliminates the fragmentation that kills GPU memory utilization when you mix short and long sequences in the same batch. The behavior mirrors paged virtual memory in an OS: the logical sequence is contiguous; the physical allocation is not.
In-flight batching (continuous batching). Rather than waiting for every request in a batch to finish before accepting new ones, TensorRT-LLM ejects a completed request from the batch mid-flight and slots in a new arrival. The batch composition changes every decode step. This keeps GPU occupancy near-constant under variable-length workloads instead of the sawtooth idle pattern of static batching.
Tensor and pipeline parallelism. Engine files can be built for TP=N to shard weight matrices column-wise across N GPUs, with NCCL all-reduce after each layer. PP=M pipelines M groups of layers across M GPU stages, reducing per-GPU memory at the cost of pipeline bubble latency. Hybrid TP+PP is common on large NVL72 pods.
Quantization integration. TensorRT-LLM ships a ModelOpt-backed quantize.py workflow that produces calibrated checkpoints, which trtllm-build then compiles to precision-specific engines with the relevant Tensor Core paths selected.
TensorRT vs TensorRT-LLM: Scope Comparison
| Capability | TensorRT (core) | TensorRT-LLM (adds) |
|---|---|---|
| Primary input | ONNX / TRT API network graph | HuggingFace checkpoints, ModelOpt quantized checkpoints |
| Graph optimization | Yes (layer fusion, kernel profiling) | Yes + LLM-specific FMHA, GEMM+SwiGLU, Reduce-Norm fusion |
| Batching | Static batch at build time | In-flight / continuous batching with paged KV cache |
| KV cache | Not managed | Paged, block-allocated, optional FP8 KV quantization |
| Multi-GPU | Manual; not built-in | Tensor parallelism (TP), pipeline parallelism (PP), hybrid |
| Quantization | INT8 PTQ, FP8 via Q/DQ | FP8, NVFP4 [VERIFY: GA status], INT4 AWQ, INT8 SmoothQuant, W4A8 AWQ |
| Serving API | C++ API, Python bindings | trtllm-serve (OpenAI-compatible), LLM Python API, Triton backend |
| GPU scope | Volta and newer | FP8/FP4 paths require Ada Lovelace / Hopper / Blackwell or newer |
Quantization: Precision Choices and Their Real Cost
Every quantization decision trades some amount of model accuracy for throughput and memory. The question is not whether to quantize; on H100 and B200 it is whether you can afford not to. The question is where on the precision spectrum your accuracy budget allows you to land.
Precision Tiers in Detail
FP8 (E4M3 / E5M2). Supported on Ada Lovelace, Hopper, and Blackwell. This is the default precision tier for NIM-packaged models on H100. The build workflow runs post-training quantization (PTQ) with a calibration dataset to derive per-tensor or per-channel scales, embeds those as Q/DQ nodes, and fuses them into adjacent GEMM kernels during engine build. FP8 KV cache quantization is an additional flag that meaningfully boosts throughput. Validated on 512 calibration batches at 2048-token length. The accuracy loss on standard benchmarks at 70B scale is small but not zero; always run your task-specific evals before promoting to production.
NVFP4. A 4-bit float format introduced with Blackwell. Requires aggressive kernel fusion, including a Random Hadamard Transform and 2D scaling, to keep quantization overhead under control. NVFP4 is the precision that enables very large models to fit on fewer GPUs with Blackwell dedicated FP4 Tensor Core paths [VERIFY: full production availability on B200 as of mid-2026]. The throughput ceiling is higher than FP8 but so is the accuracy risk; treat it as experimental unless you have benchmark evidence on your model.
INT8 AWQ (Activation-aware Weight Quantization). Quantizes weights to INT8 or INT4 while keeping activations in FP16, using per-channel scaling derived from activation statistics rather than a fixed calibration dataset. Less sensitive to calibration quality than symmetric INT8 PTQ. W4A8 AWQ is a variant that uses 4-bit weights with 8-bit activations, supported on Hopper with FP8 hardware. Useful when your model does not have an FP8 checkpoint available from NVIDIA or the model zoo.
INT8 SmoothQuant. Migrates quantization difficulty from activations to weights using a mathematically equivalent per-channel rescaling before quantization. Works on a wider GPU range than FP8 but yields lower throughput headroom than FP8 on Hopper. Use it when you need to quantize on Ampere (A100) where FP8 Tensor Cores do not exist.
| Format | GPU Minimum | Throughput vs FP16 | Accuracy Risk | Calibration Needed? |
|---|---|---|---|---|
| FP16 / BF16 | Volta+ | Baseline (1x) | None | No |
| INT8 SmoothQuant | Ampere (A100) | ~1.5x | Low-medium | Yes (activation stats) |
| INT4 AWQ (W4A16) | Ampere+ | ~2x (memory-bound) | Medium | Yes (AWQ search) |
| FP8 (E4M3) | Ada / Hopper / Blackwell | ~2.4x (tuned) | Low (70B+) | Yes (PTQ calib dataset) |
| W4A8 AWQ | Hopper (FP8 hw) | ~2.5-3x [VERIFY] | Medium | Yes |
| NVFP4 | Blackwell (B200+) | Higher than FP8 [VERIFY] | Medium-high | Yes (complex 2D scaling) |
Gotcha
Enabling FP8 quantization on a Turing or Ampere GPU does not silently fall back to FP16. The engine build fails with an error like unsupported data type FP8 for device capability 7.x. If your NGC container was built against a Hopper image and you deploy it to an A100 node, the NIM pod will not start. Check GPU compute capability before selecting a quantization format. Ada Lovelace is 8.9, Hopper is 9.0, Ampere is 8.0.
The Real Artifact: Building an FP8 Engine with trtllm-build
The CLI workflow has two stages: quantize the checkpoint, then build the engine. Here is the sequence for a Llama-3.3-70B model on 4xH100-SXM with FP8 weights and FP8 KV cache, pulled from the TensorRT-LLM documentation:
# Stage 1: quantize to FP8 checkpoint
# Run this on an Ada / Hopper / Blackwell GPU
python examples/quantization/quantize.py \
--model_dir /path/to/Llama-3.3-70B \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512 \
--output_dir ./checkpoint_fp8_4gpu
# Stage 2: build the TRT-LLM engine
trtllm-build \
--checkpoint_dir ./checkpoint_fp8_4gpu \
--output_dir ./engine_fp8_4gpu \
--gemm_plugin disable \
--reduce_fusion enable \
--user_buffer enable \
--gemm_swiglu_plugin fp8 \
--tp_size 4 \
--max_batch_size 512 \
--max_num_tokens 16384
Expected output: After 20-40 minutes of kernel profiling, the engine directory contains four .engine files (one per TP rank) plus a config.json. You can verify with:
ls -lh ./engine_fp8_4gpu/
# rank0.engine rank1.engine rank2.engine rank3.engine config.json
# Quick benchmark after build:
trtllm-bench throughput \
--engine_dir ./engine_fp8_4gpu \
--dataset /path/to/bench_tokens.json
Common failure mode: If the accuracy eval shows a meaningful drop on a coding or math benchmark, the calibration dataset is likely mismatched. The quantize.py script defaults to cnn_dailymail summarization. A model fine-tuned for code or reasoning will have activation distributions that differ from news summaries. Re-run with a calibration set drawn from the same domain as your production traffic. The second failure mode is an unsupported layer type error during build if the model contains custom ops (rotary embeddings or attention variants) not yet in the TensorRT-LLM op library. Check the supported models list before attempting a hand-build for a new architecture.
Worked Example
Llama-3.3-70B on four H100-SXM-80GB GPUs with TP=4, FP8 weights + FP8 KV cache, reduce-norm fusion, user buffers, and GEMM+SwiGLU plugin enabled. Benchmark workload: ISL 128, OSL 128, 512 concurrent requests.
FP16 baseline (tuned): 2,474 tokens/sec, 147.6 ms TTFT, 14.7 ms inter-token latency.
FP8 baseline (light tuning): 3,390 tokens/sec, 96.2 ms TTFT, 12.4 ms inter-token latency.
FP8 fully tuned (KV cache + reduce fusion + user buffers + SwiGLU): 6,049 tokens/sec, 88.0 ms TTFT, 10.8 ms inter-token latency.
Adding FP8 KV cache alone pushed throughput from 3,390 to 5,300 tokens/sec. The further 14 percent gain came from reduce-norm fusion and user buffers. Memory footprint dropped from ~140 GB (FP16) to ~72 GB (FP8), freeing two GPUs worth of headroom on an 8xH100 node.
Source: NVIDIA TensorRT-LLM FP8 quantization guide (nvidia.github.io/TensorRT-LLM).
When to Hand-Build Engines vs Let a NIM Do It
A NIM microservice ships with pre-built, NVIDIA-validated engines for supported GPU SKUs. Those engines are tested, have known accuracy numbers from NVIDIA, and cover the standard configuration (a given model, a given TP degree, FP8 or FP16). For the vast majority of production deployments, you should use the NIM-packaged engine and not rebuild from scratch.
You need to hand-build an engine when:
- You have a fine-tuned model not available in the NGC catalog.
- You need a non-standard TP degree, for example TP=8 for a 70B model when you want lower per-request latency rather than maximum throughput.
- You want to experiment with quantization settings the NIM does not expose, such as a custom calibration dataset or a non-default KV cache precision.
- You are running a GPU SKU for which no pre-built NIM engine exists (some newer Blackwell configurations).
- You need to enable speculative decoding with a custom draft model not part of the standard NIM bundle.
What to validate first before hand-building: confirm the GPU compute capability matches the chosen precision format, confirm the model architecture is in the TensorRT-LLM supported models list, run the engine builder inside the official NGC container for your TensorRT-LLM version (do not mix container versions and checkpoint versions), and always run an accuracy eval before declaring the engine production-ready.
The Verdict
TensorRT and TensorRT-LLM are not optional components you can bypass for convenience. They are the reason you get 6,000 tokens per second out of a four-GPU node instead of 2,400. The optimization pass is not magic; it is a deterministic compilation step with predictable failure modes. The right mental model is: build time is your compiler, run time is your runtime, and the engine is the binary you ship. Treat engine files as release artifacts with version pinning, GPU-target tagging, and accuracy regression tests before promotion.
My recommendation: start with FP8 on Hopper or Blackwell with the NIM-packaged engine. Validate accuracy on your task before locking in precision. If the NIM engine does not meet your latency SLA, hand-build with FP8 KV cache, reduce-norm fusion, and user buffers enabled, benchmark with trtllm-bench at your production ISL/OSL mix, and only then evaluate whether NVFP4 or INT4 AWQ is worth the additional accuracy risk.
When NOT to use hand-built engines: if you do not have a repeatable accuracy evaluation pipeline, do not hand-build in production. The silent accuracy collapse from mismatched calibration data is harder to detect than a crash, and the consequences are harder to explain.
Next: Part 19 covers Triton Inference Server vs NIM, the serving-layer decision above the engine. If this post clarified where the performance budget actually lives, leave a comment with the GPU SKU and model you are running; the answers are surprisingly SKU-specific.
References
- NVIDIA TensorRT-LLM: FP8 Quantization Performance Guide
- NVIDIA TensorRT-Cloud: Building a TensorRT-LLM Engine
- NVIDIA Technical Blog: In-Flight Batching in TensorRT-LLM
- NVIDIA TensorRT: Optimizing TensorRT Performance
- NVIDIA Technical Blog: Optimizing Inference on LLMs with TensorRT-LLM



