Every team I talk to has the same complaint six months after their first NIM deployment: "We thought the GPU was the expensive part. Turns out we just did not know how to use it." They are running $3-per-hour H100s at 8% utilization and wondering why the cost-per-token math does not work. The answer is almost always the same: batch size of one, no continuous batching, no quantization, and a KV cache that is half-empty. This post fixes that.
The Two Latency Numbers That Actually Matter
LLM inference has two distinct latency components, and they respond to different pressure points. Conflating them is the most common mistake I see in SLO design.
TTFT — Time to First Token: The wall-clock time from the moment your API call lands to when the first output token is returned. TTFT is dominated by the prefill phase: the model processes the entire input prompt in one forward pass, filling the KV cache. Prefill is compute-bound (lots of matrix multiplications in parallel). A 4,096-token system prompt on an H100 takes materially longer to prefill than a 128-token prompt. For interactive applications — chatbots, coding assistants, anything where a human is watching — TTFT above 2-3 seconds is where users start abandoning sessions.
ITL — Inter-Token Latency (also called TPOT, time per output token): The average time between successive output tokens once generation starts. ITL is bandwidth-bound: each decode step reads the full KV cache for all active tokens from HBM, does a relatively small amount of math, and emits one token per request in the batch. For real-time streaming (voice, live typing), ITL needs to stay below ~30-50ms per token to feel fluid. For batch document generation, you care much less about ITL and much more about total throughput.
The Trade-off: Throughput vs Latency
Throughput (tokens/second across all users) and per-request latency pull in opposite directions. To maximize throughput, you want large batches — many requests processed simultaneously, amortizing the per-layer weight reads across as many tokens as possible. But each additional request in a batch adds memory pressure (more KV cache), and during prefill, batching multiple long prompts together increases TTFT for requests that land mid-batch. There is no free lunch. The operating point depends on your workload mix and your SLOs.
In practice: if you are serving a chatbot where humans read the output in real time, you probably need TTFT under 1.5s and ITL under 40ms. If you are running document summarization pipelines overnight, you can saturate the GPU with massive batches and TTFT of 20 seconds is irrelevant — you care only about total job throughput and cost per million output tokens.
Metric Definitions You Need at Your Fingertips
| Metric | Full Name | What It Measures | Why It Matters |
|---|---|---|---|
| TTFT | Time to First Token | ms from request arrival to first output token | Interactive UX; chatbot responsiveness |
| ITL / TPOT | Inter-Token Latency / Time Per Output Token | ms between successive output tokens | Streaming smoothness; voice app floor |
| TPS | Tokens Per Second | Total output tokens/sec across all requests | System throughput; cost denominator |
| MBU | Memory Bandwidth Utilization | % of peak HBM bandwidth consumed | True utilization; decode efficiency ceiling |
| GPU Util % | SM Activity | % of SMs active during inference window | Often misleading during decode; use MBU instead |
| KV Cache Hit Rate | Prefix Cache Reuse | % of prefill work skipped via cached prefixes | TTFT reduction for repeated system prompts |
| $/1M tokens | Cost per Million Output Tokens | GPU-hour cost / (TPS x 3600) x 1,000,000 | The unit economics number your CFO will ask about |
GPU Memory and the KV Cache: The Real Constraint
Before you can understand cost, you need to understand where GPU memory goes. On an H100 SXM with 80GB HBM3, the memory budget splits roughly as follows for a Llama-3 70B model at FP16:
- Model weights: ~140GB at FP16 — this is why Llama 70B requires at least 2x H100s for FP16, or a single H100/H200 with FP8 quantization bringing it to ~70GB.
- Activations and framework overhead: 2-6GB depending on batch size and sequence length.
- KV cache: Everything left over. This is where the battle is fought.
The KV cache stores the key and value tensors for every attention layer and every token currently in flight. For a 70B model (80 layers, 8 KV heads, 128 head dim, FP16), each token costs roughly 80 x 2 x 8 x 128 x 2 bytes = 327KB of KV cache. A batch of 100 concurrent requests each generating 2,048 tokens consumes over 65GB just for KV state — more than the model weights on a single GPU. This is why KV cache is the binding constraint on batch size, and batch size is the binding constraint on cost per token.
GPU Utilization metric is nearly useless for LLM decode monitoring. It counts SM activity, and the SMs are "active" even when they are stalled waiting for HBM reads. A decode phase can show 85% GPU util while the GPU is spending 70% of cycles waiting for memory. The number you want is Memory Bandwidth Utilization from DCGM or nvidia-smi’s memory_bandwidth_util. Anything below 50% MBU during decode means your batch is too small or your KV cache is under-filled. Target 70-85% MBU sustained for a cost-efficient deployment.
Continuous Batching: The Biggest Free Lunch in Inference
Static batching was the original approach: group N requests, run them together, return results when all N finish. The problem: requests finish at different times. When request 3 of 8 finishes at step 200 but request 7 runs to step 800, the GPU sits idle processing request 7 alone for steps 201-800 while the other 7 slots sit empty. You paid for a batch, got throughput of 1.
Continuous batching (also called in-flight batching or iteration-level scheduling) solves this by treating the batch as a fluid pool. After each decode iteration, the scheduler checks for finished requests and immediately slots in new ones from the waiting queue. The GPU batch stays filled without waiting for the slowest request in a cohort. This is the default mode in vLLM, TensorRT-LLM, and NIM since at least 2024. If you are not on a continuous-batching runtime, switch before doing anything else. The throughput improvement is routinely 5-10x at matched latency vs static batching.
Chunked Prefill: Protecting Latency at High Load
A newer refinement available in TensorRT-LLM and vLLM: chunked prefill. When a long prompt arrives (say, 8K tokens), the naive scheduler would run its entire prefill in one shot, blocking all decode work on other active requests for hundreds of milliseconds. Chunked prefill splits that prefill into smaller chunks (e.g., 512 tokens per chunk) interleaved with decode iterations. TTFT for the new request increases slightly, but ITL for existing requests stays stable. For mixed interactive/batch workloads, this is the right default. NIM configures this automatically based on the model and GPU.
Quantization: Shrinking Weights and KV Cache Together
Quantization does two things simultaneously: it reduces the model weight size (freeing HBM for KV cache) and it increases the raw compute FLOPS available (H100 FP8 tensor cores deliver ~1,979 TFLOPS vs ~989 TFLOPS for FP16). Both effects help throughput. FP8 is now the default precision for production Llama deployments via NIM and TensorRT-LLM on Hopper, and FP4 is the target format on Blackwell (B200, GB200 NVL72) where native FP4 tensor cores deliver another 2x multiplier over FP8.
For most Llama-class models (7B to 70B), FP8 has negligible quality degradation on standard benchmarks — typically under 0.5% drop on MMLU and similar evals versus FP16. INT4 weight-only quantization (AWQ, GPTQ) cuts weight size another 2x but can degrade quality noticeably on long-context and reasoning tasks. I recommend FP8 as the default; drop to INT4 only when the model does not fit in FP8 and you cannot add GPUs, and always run your own quality evaluation on your actual task before shipping.
KV Cache Quantization: The Hidden Win
Beyond weight quantization, TensorRT-LLM and vLLM both support KV cache quantization (INT8 or FP8 KV). Since the KV cache is the primary bandwidth-consumer during decode, halving its size (FP16 to FP8 KV) increases the effective batch size that fits in HBM by roughly 1.6-1.8x and reduces per-step memory bandwidth demand proportionally. This is often a 20-30% throughput gain at zero quality impact on most models. Enable it. The NIM Llama containers have it on by default for H100/H200.
The Cost Per Token Formula
The math is simple once you have the right numbers. The cost formula for output tokens is:
Cost per 1M output tokens = (GPU_hour_rate / 3600) / output_tokens_per_second x 1,000,000 Where: GPU_hour_rate = $/GPU-hour (cloud spot or reserved, or TCO/hour on-prem) output_TPS = sustained output tokens/sec across all concurrent requests Denominator scales with: batch size, quantization, model size, hardware
The only way to lower cost per token without changing hardware is to increase output TPS. The only ways to increase output TPS are: larger batch (more KV cache headroom via quantization or bigger GPU), faster compute (FP8/FP4 or newer GPU), or less work per token (smaller model, shorter context). Everything else is noise.
Worked Example
Scenario: Llama-3 70B, single H100 SXM 80GB, FP8 precision, cloud at $2.50/GPU-hour [VERIFY: spot pricing varies by provider and availability]. NIM container, continuous batching enabled, FP8 KV cache. Target: interactive chatbot with TTFT SLO of 1.5s, ITL SLO of 40ms.
Step 1: Establish peak throughput at SLO. At the TTFT/ITL SLOs above, benchmark shows sustained output TPS of approximately 1,800 tokens/sec at batch size 32 [VERIFY: run your own NIM benchmarking; this is an illustrative number consistent with published H100 FP8 data]. At batch size 64 with relaxed TTFT (3s), TPS climbs to roughly 2,800 tokens/sec.
Step 2: Compute cost per million tokens.
- Interactive (B=32, 1,800 TPS): ($2.50/3600) / 1,800 x 1,000,000 = ~$0.39/1M output tokens
- Batch (B=64, 2,800 TPS): ($2.50/3600) / 2,800 x 1,000,000 = ~$0.25/1M output tokens
- Naive batch-size-1 (~85 TPS, no continuous batching): ($2.50/3600) / 85 x 1,000,000 = ~$8.17/1M output tokens
Conclusion: Continuous batching + FP8 + right batch size cuts cost from ~$8/1M to under $0.40/1M on the same hardware. The configuration does 20x more work than the default out-of-box deployment. All numbers assume 100% GPU utilization across the hour — real deployments average 40-70% utilization due to traffic variation, so multiply by 1.4-2.5 for real-world average cost. Add input token cost by substituting prefill TPS.
Cost Per Million Tokens by GPU and Precision
The following table uses illustrative throughput numbers aligned with published benchmark ranges. All GPU-hour rates are cloud spot/on-demand estimates and vary by provider, region, and commitment level. All numbers should be validated against your specific workload before quoting in a budget. Throughput figures assume continuous batching, optimized KV cache, and batch sizes tuned to the ITL SLO noted.
| GPU | HBM | Precision | Output TPS (70B, B=32) | Est. $/GPU-hr [VERIFY] | Est. $/1M out tokens |
|---|---|---|---|---|---|
| H100 SXM 80GB | 80GB HBM3 | FP16 | ~900 [VERIFY] | $2.50 | ~$0.77 |
| H100 SXM 80GB | 80GB HBM3 | FP8 | ~1,800 [VERIFY] | $2.50 | ~$0.39 |
| H200 SXM 141GB | 141GB HBM3e | FP8 | ~3,200 [VERIFY] | $3.50 | ~$0.30 |
| B200 SXM 192GB | 192GB HBM3e | FP4 | ~8,000 [VERIFY] | $6.00-$8.00 | ~$0.21-$0.28 |
| H100 SXM 80GB (naive, no batching) | 80GB HBM3 | FP16, B=1 | ~85 | $2.50 | ~$8.17 |
All throughput and pricing figures are worked-example estimates. Mark [VERIFY] items before use in procurement. Assumes 2x H100 for FP16 70B (cost shown per-GPU, model requires 2 GPUs).
The Latency-Throughput Frontier: How to Read a Benchmark
Every serious inference benchmark publishes a latency-throughput frontier: a curve showing achievable ITL (or P99 TTFT) as a function of total output TPS. The curve bends sharply as you approach the GPU throughput ceiling. The region near the knee of the curve is where you want to operate: throughput is high, latency is still within SLO. Push past the knee and latency explodes without proportional throughput gain.
When you see a vendor quoting "20,000 tokens/sec on H100" without a latency qualifier, that number is almost certainly measured at or past the knee — TTFT and ITL are in the seconds range at that throughput. Useless for interactive workloads. Always ask: at what P95 TTFT and ITL does that TPS number hold?
What to Do in Production: The Priority Stack
- Switch to a continuous-batching runtime (NIM, vLLM, TensorRT-LLM). This single change typically delivers 5-10x throughput improvement over naive serving. Cost per token drops by the same factor.
- Enable FP8 quantization (TensorRT-LLM or NIM default on H100/H200). Doubles throughput on Hopper vs FP16, halves weight memory, frees HBM for larger batches.
- Enable FP8 or INT8 KV cache quantization. Another 20-30% throughput on top of weight quantization. Usually on by default in NIM; verify with
nim config show. - Profile your actual batch size at your SLOs. Run
genai-perf(NVIDIA’s benchmarking tool from the NIM ecosystem) against your deployment, sweeping concurrency from 1 to 128. Find the knee of the latency-throughput frontier. Set max-batch-size to the concurrency at the knee. - Enable prefix caching for repeated system prompts. If your application uses a static or slowly-changing system prompt, the KV cache for that prefix can be reused across requests. TTFT drops by 50-90% for those requests.
- Consider disaggregated prefill/decode (NVIDIA Dynamo, covered in Part 20) when TTFT and throughput SLOs cannot be met simultaneously on shared GPUs. Dedicating separate GPU pools to prefill vs decode lets each be tuned independently.
When Not to Chase Lower Cost Per Token
Large batches and high utilization are right for predictable, sustained traffic. They are wrong for:
- Highly variable traffic with spiky peaks: A deployment tuned for 80% average utilization will blow its TTFT SLO during the 5-minute peak where utilization hits 100%. Size for peak, not average, unless you have autoscaling that can cold-start in under 30 seconds (you probably do not, since NIM images are large).
- Safety-critical or compliance-gated inference: If each response requires audit logging, PII scanning, or output guardrails running serially, the latency budget for those layers may make batching counterproductive. Each added step consumes TTFT budget.
- Very long context (>32K tokens): KV cache per-request is enormous. Batch size collapses to near 1 regardless of how much HBM you have. At that point, throughput optimization shifts to speculative decoding and disaggregated serving rather than batching.
What to Validate Before You Go to Production
Before declaring your inference stack production-ready, run through this checklist:
- Benchmark at your actual input/output distribution, not the benchmark distribution. A deployment tuned for 128-token inputs behaves completely differently with 4K-token inputs. Prefill time scales with input length; your TTFT SLO may be violated at long-context requests even when batch metrics look fine.
- Validate quality at your quantization setting. Run your model evaluation suite (MMLU, task-specific evals) against FP8 before shipping. For most Llama-class models this is a formality, but for fine-tuned models or models with unusual activation distributions, FP8 can be more destructive.
- Load test with realistic concurrency. Use
genai-perforlocustwith request arrival patterns matching your production traffic (Poisson arrivals, not synchronized bursts). P99 latency under realistic load is what matters for SLO design, not median latency under synthetic ramp tests. - Monitor MBU, not just GPU utilization %. Wire DCGM metrics into your observability stack (Prometheus + Grafana). Alert on MBU below 40% (under-utilized) and on P99 TTFT exceeding 2x your SLO target (early warning before SLO breach).
- Account for model loading time in your autoscaling design. NIM containers cold-start in 3-10 minutes depending on model size and whether the image is cached. Your horizontal pod autoscaler (HPA) scale-out lead time must be longer than your traffic ramp-up time, or you need pre-warmed standby replicas. This is the operability gap most teams discover only after their first traffic spike.
The Verdict
Inference economics come down to a single principle: the GPU is only as useful as the fraction of its memory bandwidth you are consuming with real work. Everything else follows from that. Continuous batching keeps the slots filled. FP8 quantization doubles the useful work per byte of HBM bandwidth. KV cache quantization extends how large a batch you can hold. Chunked prefill protects TTFT SLOs at high load.
On a well-configured H100 FP8 deployment with NIM and continuous batching, a Llama-3 70B model runs at under $0.50/1M output tokens at interactive latencies. The same model on the same GPU with naive configuration runs at $8-15/1M tokens. That gap is not hardware. It is configuration and understanding.
I recommend H200 over H100 for new on-prem deployments specifically for its KV cache headroom advantage: the 141GB HBM3e enables 2-3x larger effective batch sizes on 70B models compared to H100 FP8, which directly translates to lower cost per token at similar hardware cost (once the H200 supply stabilizes). For teams already on H100 clusters, the configuration improvements above will yield far more cost reduction than a hardware refresh.
What I would not do: buy more GPUs before profiling the ones you have. Run genai-perf against your current deployment today. If your peak MBU is below 60%, you have a configuration problem, not a capacity problem. Fix it first.
If you are working through on-prem GPU cluster sizing and want to wire the economics from this post into a full cost model, the NVIDIA AI Guide covers the full stack. And for teams interested in training economics on the same hardware, Part 22 on the NeMo framework picks up where this part leaves off.
« Previous: Part 20 — NVIDIA Dynamo | NVIDIA AI Guide | Next: Part 22 »
References
- NVIDIA TensorRT-LLM: H100 vs A100 Performance — 10,000 tok/s at 100ms TTFT
- NVIDIA TensorRT-LLM: H200 achieves 12,000 tok/s on Llama2-13B
- Anyscale Docs: LLM Latency and Throughput Metrics (TTFT, ITL definitions)
- GMI Cloud: Cost Per Million Tokens — LLM Inference on GPU Cloud
- Metrum AI: Llama 4 Maverick on H200 vs B200 using vLLM — Performance Analysis
- NVIDIA NIM LLM Benchmarking — Official Performance Data



