Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

TensorRT and TensorRT-LLM: Optimization, Quantization, and Engine Building (NVIDIA AI Series, Part 18)

What TensorRT does at build time versus what TensorRT-LLM adds at runtime — kernel fusion, paged KV cache, in-flight batching, and quantization choices from FP8 to NVFP4 — and when to hand-build engines instead of relying on a NIM.

NVIDIA AI Series · Part 18 of 30
TL;DR: TensorRT is a GPU inference compiler for general neural networks. TensorRT-LLM wraps it with LLM-specific runtime machinery: paged KV cache, in-flight batching, tensor and pipeline parallelism, and a quantization toolkit covering FP8, NVFP4, INT8 AWQ, and INT8 SmoothQuant. Together they sit underneath every NIM microservice. Understand what happens at engine build time versus request time, pick the right precision tier for your accuracy budget, and you will avoid the most common production failures: unsupported-layer crashes on Turing or older GPUs, silent accuracy collapse from aggressive INT4 quantization, and throughput ceilings from under-configured in-flight batching.
Who this is for: AI infrastructure architects and platform engineers who deploy NIM microservices or Triton backends on Hopper or Blackwell clusters, and need to understand what the inference stack is actually doing so they can diagnose latency regressions, memory overflows, or accuracy failures before escalating to NVIDIA support.

The quantize.py script exits cleanly. The trtllm-build command finishes. The engine loads. Throughput is 30 percent below projections and the first accuracy check on a coding task fails hard. That sequence happens often enough that it has a name in my practice: the quiet failure of the optimization pass. The fix is almost always the same: the engineer picked a precision format the GPU does not support at full speed, or calibrated on the wrong dataset, or never checked whether TensorRT-LLM was actually doing in-flight batching. This post breaks down the layers beneath a NIM so you can stop debugging in the dark.

What TensorRT Does: Build-Time Compilation, Not a Runtime Wrapper

TensorRT is a model compiler. You give it a network graph (ONNX, or a TensorRT API definition), a target GPU, and a precision policy. It produces a binary engine file that is tied to that GPU architecture, driver version, and input shape range. The engine is not portable. A Hopper H100 engine will not run on an Ampere A100, and vice versa. That portability boundary bites teams that build engines on a CI runner with a different GPU than production.

Three things happen during the build phase. First, graph optimization: TensorRT fuses adjacent operations, eliminates redundant casts, and simplifies constants. Second, kernel selection: for each fused subgraph it profiles candidate CUDA kernels against the target GPU and picks the fastest one for the declared input shape range. Third, precision calibration: if you request INT8 or FP8, TensorRT either reads per-tensor scales from Q/DQ nodes embedded in the graph, or runs a calibration dataset through the network to derive dynamic ranges.

The kernel selection step is why build times are long. A 70B parameter model on four H100s can take 20 to 40 minutes to build. That cost is paid once; the resulting engine launches in seconds and runs at near-peak Tensor Core utilization because every op already has its kernel selected and fused.

TensorRT Build-Time Pipeline What happens before a single inference request arrives Model Graph ONNX / TRT API Graph Optimization layer fusion, cast elim. Kernel Selection profile + pick per op Precision Calibration Q/DQ fusion, scales Binary Engine GPU-locked Build time: 20-40 min for 70B on 4xH100. Engine is architecture-locked. Do NOT copy a Hopper engine to Blackwell. Rebuild for each GPU generation.
Figure 1. TensorRT build-time pipeline. The binary engine is architecture-locked to the GPU it was compiled for.

Kernel Fusion: What It Actually Looks Like

A transformer attention block in an unfused graph runs as five or more separate CUDA kernel launches: a QKV projection GEMM, a softmax, a context GEMM, a layer norm, a residual add. Each kernel launch has overhead, each result trips through GPU memory. Fused, that sequence collapses into one or two kernels that keep intermediate results in register or shared memory. For multi-head attention, TensorRT-LLM ships a custom Flash MHA (FMHA) kernel that handles the entire QKV + softmax + context attention in one pass with tiling. For gated-MLP blocks on Hopper, it adds GEMM + SwiGLU fusion that merges two matmul ops and one SwiGLU activation into a single dispatch. The throughput difference is not marginal: on Llama-3.3-70B across four H100-SXM-80GB GPUs, NVIDIA’s own benchmarks show the full suite of FP8 fusions yielding 6,049 tokens/sec versus 2,474 tokens/sec for tuned FP16 without those fusions — a 144 percent gain.

Kernel Fusion: Attention Block BEFORE FUSION QKV GEMM Softmax Ctx GEMM LayerNorm Resid Add 5 kernel launches, multiple GMEM round-trips AFTER FUSION Fused FMHA Kernel 1 launch, intermediates in SMEM/registers fuse FP8 GEMM + SwiGLU fusion on Hopper further merges 2 matmuls + activation into 1 kernel. Result: 144% throughput gain on Llama-3.3-70B (FP8 tuned vs FP16 tuned, 4xH100-SXM).
Figure 2. Kernel fusion collapses five kernel launches into one fused FMHA kernel with shared-memory intermediates.

What TensorRT-LLM Adds: The LLM Runtime Layer

TensorRT handles general neural networks well. But LLMs have a problem TensorRT alone cannot address: the autoregressive decode loop means request lengths are unknown at build time, batches grow and shrink dynamically, and KV cache memory balloons as sequence length increases. TensorRT-LLM is the layer that solves those problems.

Its four critical additions are:

Paged KV cache. KV cache is partitioned into fixed-size blocks allocated from a pool. New tokens consume only the blocks they need, and blocks are freed when a request completes. This eliminates the fragmentation that kills GPU memory utilization when you mix short and long sequences in the same batch. The behavior mirrors paged virtual memory in an OS: the logical sequence is contiguous; the physical allocation is not.

In-flight batching (continuous batching). Rather than waiting for every request in a batch to finish before accepting new ones, TensorRT-LLM ejects a completed request from the batch mid-flight and slots in a new arrival. The batch composition changes every decode step. This keeps GPU occupancy near-constant under variable-length workloads instead of the sawtooth idle pattern of static batching.

Tensor and pipeline parallelism. Engine files can be built for TP=N to shard weight matrices column-wise across N GPUs, with NCCL all-reduce after each layer. PP=M pipelines M groups of layers across M GPU stages, reducing per-GPU memory at the cost of pipeline bubble latency. Hybrid TP+PP is common on large NVL72 pods.

Quantization integration. TensorRT-LLM ships a ModelOpt-backed quantize.py workflow that produces calibrated checkpoints, which trtllm-build then compiles to precision-specific engines with the relevant Tensor Core paths selected.

In-Flight Batching Timeline Requests join and leave the active batch per decode step; GPU stays busy Decode step T1 T2 T3 T4 T5 T6 T7 Req A generating Req B generating Req C joined mid-batch A done, C in No idle gap. GPU batch size stays constant as requests complete and new ones enter.
Figure 3. In-flight batching ejects completed requests and slots in new arrivals per decode step, keeping GPU occupancy constant.

TensorRT vs TensorRT-LLM: Scope Comparison

Capability TensorRT (core) TensorRT-LLM (adds)
Primary input ONNX / TRT API network graph HuggingFace checkpoints, ModelOpt quantized checkpoints
Graph optimization Yes (layer fusion, kernel profiling) Yes + LLM-specific FMHA, GEMM+SwiGLU, Reduce-Norm fusion
Batching Static batch at build time In-flight / continuous batching with paged KV cache
KV cache Not managed Paged, block-allocated, optional FP8 KV quantization
Multi-GPU Manual; not built-in Tensor parallelism (TP), pipeline parallelism (PP), hybrid
Quantization INT8 PTQ, FP8 via Q/DQ FP8, NVFP4 [VERIFY: GA status], INT4 AWQ, INT8 SmoothQuant, W4A8 AWQ
Serving API C++ API, Python bindings trtllm-serve (OpenAI-compatible), LLM Python API, Triton backend
GPU scope Volta and newer FP8/FP4 paths require Ada Lovelace / Hopper / Blackwell or newer

Quantization: Precision Choices and Their Real Cost

Every quantization decision trades some amount of model accuracy for throughput and memory. The question is not whether to quantize; on H100 and B200 it is whether you can afford not to. The question is where on the precision spectrum your accuracy budget allows you to land.

Quantization Precision Spectrum FP16 / BF16 FP8 INT8 AWQ NVFP4 INT4 AWQ Throughput higher Accuracy lower FP8 is the sweet spot for Hopper/Blackwell: ~144% throughput gain with minimal accuracy loss at 70B scale.
Figure 4. Precision spectrum from FP16 baseline to INT4. Throughput rises left to right; accuracy risk rises with it.

Precision Tiers in Detail

FP8 (E4M3 / E5M2). Supported on Ada Lovelace, Hopper, and Blackwell. This is the default precision tier for NIM-packaged models on H100. The build workflow runs post-training quantization (PTQ) with a calibration dataset to derive per-tensor or per-channel scales, embeds those as Q/DQ nodes, and fuses them into adjacent GEMM kernels during engine build. FP8 KV cache quantization is an additional flag that meaningfully boosts throughput. Validated on 512 calibration batches at 2048-token length. The accuracy loss on standard benchmarks at 70B scale is small but not zero; always run your task-specific evals before promoting to production.

NVFP4. A 4-bit float format introduced with Blackwell. Requires aggressive kernel fusion, including a Random Hadamard Transform and 2D scaling, to keep quantization overhead under control. NVFP4 is the precision that enables very large models to fit on fewer GPUs with Blackwell dedicated FP4 Tensor Core paths [VERIFY: full production availability on B200 as of mid-2026]. The throughput ceiling is higher than FP8 but so is the accuracy risk; treat it as experimental unless you have benchmark evidence on your model.

INT8 AWQ (Activation-aware Weight Quantization). Quantizes weights to INT8 or INT4 while keeping activations in FP16, using per-channel scaling derived from activation statistics rather than a fixed calibration dataset. Less sensitive to calibration quality than symmetric INT8 PTQ. W4A8 AWQ is a variant that uses 4-bit weights with 8-bit activations, supported on Hopper with FP8 hardware. Useful when your model does not have an FP8 checkpoint available from NVIDIA or the model zoo.

INT8 SmoothQuant. Migrates quantization difficulty from activations to weights using a mathematically equivalent per-channel rescaling before quantization. Works on a wider GPU range than FP8 but yields lower throughput headroom than FP8 on Hopper. Use it when you need to quantize on Ampere (A100) where FP8 Tensor Cores do not exist.

Format GPU Minimum Throughput vs FP16 Accuracy Risk Calibration Needed?
FP16 / BF16 Volta+ Baseline (1x) None No
INT8 SmoothQuant Ampere (A100) ~1.5x Low-medium Yes (activation stats)
INT4 AWQ (W4A16) Ampere+ ~2x (memory-bound) Medium Yes (AWQ search)
FP8 (E4M3) Ada / Hopper / Blackwell ~2.4x (tuned) Low (70B+) Yes (PTQ calib dataset)
W4A8 AWQ Hopper (FP8 hw) ~2.5-3x [VERIFY] Medium Yes
NVFP4 Blackwell (B200+) Higher than FP8 [VERIFY] Medium-high Yes (complex 2D scaling)

Gotcha

Enabling FP8 quantization on a Turing or Ampere GPU does not silently fall back to FP16. The engine build fails with an error like unsupported data type FP8 for device capability 7.x. If your NGC container was built against a Hopper image and you deploy it to an A100 node, the NIM pod will not start. Check GPU compute capability before selecting a quantization format. Ada Lovelace is 8.9, Hopper is 9.0, Ampere is 8.0.

The Real Artifact: Building an FP8 Engine with trtllm-build

The CLI workflow has two stages: quantize the checkpoint, then build the engine. Here is the sequence for a Llama-3.3-70B model on 4xH100-SXM with FP8 weights and FP8 KV cache, pulled from the TensorRT-LLM documentation:

# Stage 1: quantize to FP8 checkpoint
# Run this on an Ada / Hopper / Blackwell GPU
python examples/quantization/quantize.py \
  --model_dir /path/to/Llama-3.3-70B \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --calib_size 512 \
  --output_dir ./checkpoint_fp8_4gpu

# Stage 2: build the TRT-LLM engine
trtllm-build \
  --checkpoint_dir ./checkpoint_fp8_4gpu \
  --output_dir ./engine_fp8_4gpu \
  --gemm_plugin disable \
  --reduce_fusion enable \
  --user_buffer enable \
  --gemm_swiglu_plugin fp8 \
  --tp_size 4 \
  --max_batch_size 512 \
  --max_num_tokens 16384

Expected output: After 20-40 minutes of kernel profiling, the engine directory contains four .engine files (one per TP rank) plus a config.json. You can verify with:

ls -lh ./engine_fp8_4gpu/
# rank0.engine  rank1.engine  rank2.engine  rank3.engine  config.json

# Quick benchmark after build:
trtllm-bench throughput \
  --engine_dir ./engine_fp8_4gpu \
  --dataset /path/to/bench_tokens.json

Common failure mode: If the accuracy eval shows a meaningful drop on a coding or math benchmark, the calibration dataset is likely mismatched. The quantize.py script defaults to cnn_dailymail summarization. A model fine-tuned for code or reasoning will have activation distributions that differ from news summaries. Re-run with a calibration set drawn from the same domain as your production traffic. The second failure mode is an unsupported layer type error during build if the model contains custom ops (rotary embeddings or attention variants) not yet in the TensorRT-LLM op library. Check the supported models list before attempting a hand-build for a new architecture.

Worked Example

Llama-3.3-70B on four H100-SXM-80GB GPUs with TP=4, FP8 weights + FP8 KV cache, reduce-norm fusion, user buffers, and GEMM+SwiGLU plugin enabled. Benchmark workload: ISL 128, OSL 128, 512 concurrent requests.

FP16 baseline (tuned): 2,474 tokens/sec, 147.6 ms TTFT, 14.7 ms inter-token latency.
FP8 baseline (light tuning): 3,390 tokens/sec, 96.2 ms TTFT, 12.4 ms inter-token latency.
FP8 fully tuned (KV cache + reduce fusion + user buffers + SwiGLU): 6,049 tokens/sec, 88.0 ms TTFT, 10.8 ms inter-token latency.

Adding FP8 KV cache alone pushed throughput from 3,390 to 5,300 tokens/sec. The further 14 percent gain came from reduce-norm fusion and user buffers. Memory footprint dropped from ~140 GB (FP16) to ~72 GB (FP8), freeing two GPUs worth of headroom on an 8xH100 node.

Source: NVIDIA TensorRT-LLM FP8 quantization guide (nvidia.github.io/TensorRT-LLM).

When to Hand-Build Engines vs Let a NIM Do It

A NIM microservice ships with pre-built, NVIDIA-validated engines for supported GPU SKUs. Those engines are tested, have known accuracy numbers from NVIDIA, and cover the standard configuration (a given model, a given TP degree, FP8 or FP16). For the vast majority of production deployments, you should use the NIM-packaged engine and not rebuild from scratch.

You need to hand-build an engine when:

  • You have a fine-tuned model not available in the NGC catalog.
  • You need a non-standard TP degree, for example TP=8 for a 70B model when you want lower per-request latency rather than maximum throughput.
  • You want to experiment with quantization settings the NIM does not expose, such as a custom calibration dataset or a non-default KV cache precision.
  • You are running a GPU SKU for which no pre-built NIM engine exists (some newer Blackwell configurations).
  • You need to enable speculative decoding with a custom draft model not part of the standard NIM bundle.

What to validate first before hand-building: confirm the GPU compute capability matches the chosen precision format, confirm the model architecture is in the TensorRT-LLM supported models list, run the engine builder inside the official NGC container for your TensorRT-LLM version (do not mix container versions and checkpoint versions), and always run an accuracy eval before declaring the engine production-ready.

In practice: I have seen teams spend a week hand-building and tuning an FP8 engine for a 7B model that was already available as a NIM, shaving 8 percent from latency versus the NIM default. Whether that is worth the maintenance burden depends entirely on your SLA tightness. If you are at the scale where 8 percent of latency at p99 has real cost consequences, then yes, go hand-build and maintain your own engine versioning. If you are running a development environment or a moderate-traffic service, take the NIM engine, profile it, and invest the saved time in your application layer.

The Verdict

TensorRT and TensorRT-LLM are not optional components you can bypass for convenience. They are the reason you get 6,000 tokens per second out of a four-GPU node instead of 2,400. The optimization pass is not magic; it is a deterministic compilation step with predictable failure modes. The right mental model is: build time is your compiler, run time is your runtime, and the engine is the binary you ship. Treat engine files as release artifacts with version pinning, GPU-target tagging, and accuracy regression tests before promotion.

My recommendation: start with FP8 on Hopper or Blackwell with the NIM-packaged engine. Validate accuracy on your task before locking in precision. If the NIM engine does not meet your latency SLA, hand-build with FP8 KV cache, reduce-norm fusion, and user buffers enabled, benchmark with trtllm-bench at your production ISL/OSL mix, and only then evaluate whether NVFP4 or INT4 AWQ is worth the additional accuracy risk.

When NOT to use hand-built engines: if you do not have a repeatable accuracy evaluation pipeline, do not hand-build in production. The silent accuracy collapse from mismatched calibration data is harder to detect than a crash, and the consequences are harder to explain.

Next: Part 19 covers Triton Inference Server vs NIM, the serving-layer decision above the engine. If this post clarified where the performance budget actually lives, leave a comment with the GPU SKU and model you are running; the answers are surprisingly SKU-specific.

References

NVIDIA AI Series · Part 18 of 30
« Previous: Part 17  |  NVIDIA AI Guide  |  Next: Part 19 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading