Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Inference Economics: Throughput, Latency, Batching and Cost Per Token (NVIDIA AI Series, Part 21)

TTFT, ITL, continuous batching, KV cache pressure, FP8 quantization — this is how you compute and actually drive down $/1M tokens on NVIDIA H100, H200, and B200 GPUs without breaking your latency SLOs.

NVIDIA AI Series · Part 21 of 30
TL;DR: Inference cost is almost never about raw FLOPS — it is about memory bandwidth and how well you keep the GPU busy with real work. TTFT (time to first token) and ITL (inter-token latency) are the two user-visible latency numbers you need SLOs on. Continuous batching is the single highest-impact change you can make to cut cost per token. Quantization (FP8 or FP4) is the second. On an H100 SXM at $2.50/GPU-hour, a well-tuned Llama-3 70B deployment lands around $1.50-$2.50 per million output tokens. A naive batch-size-1 deployment of the same model on the same hardware costs $15-$25 per million. The difference is configuration, not hardware.
Who this is for: Platform engineers and AI infrastructure architects who are deploying or sizing LLM inference clusters. You should be familiar with GPU basics (covered in the full NVIDIA AI Guide). This part focuses on the economics: how the numbers connect, where the money goes, and what to actually do about it. Part 18 covers TensorRT-LLM quantization mechanics; Part 20 covers Dynamo disaggregated serving at scale. This part is the bridge that makes those choices financially legible.

Every team I talk to has the same complaint six months after their first NIM deployment: "We thought the GPU was the expensive part. Turns out we just did not know how to use it." They are running $3-per-hour H100s at 8% utilization and wondering why the cost-per-token math does not work. The answer is almost always the same: batch size of one, no continuous batching, no quantization, and a KV cache that is half-empty. This post fixes that.

The Two Latency Numbers That Actually Matter

LLM inference has two distinct latency components, and they respond to different pressure points. Conflating them is the most common mistake I see in SLO design.

TTFT — Time to First Token: The wall-clock time from the moment your API call lands to when the first output token is returned. TTFT is dominated by the prefill phase: the model processes the entire input prompt in one forward pass, filling the KV cache. Prefill is compute-bound (lots of matrix multiplications in parallel). A 4,096-token system prompt on an H100 takes materially longer to prefill than a 128-token prompt. For interactive applications — chatbots, coding assistants, anything where a human is watching — TTFT above 2-3 seconds is where users start abandoning sessions.

ITL — Inter-Token Latency (also called TPOT, time per output token): The average time between successive output tokens once generation starts. ITL is bandwidth-bound: each decode step reads the full KV cache for all active tokens from HBM, does a relatively small amount of math, and emits one token per request in the batch. For real-time streaming (voice, live typing), ITL needs to stay below ~30-50ms per token to feel fluid. For batch document generation, you care much less about ITL and much more about total throughput.

TTFT vs ITL: What Happens in a Single Inference Request Prompt arrives -> Prefill phase -> First token -> Decode tokens one by one 0ms time PREFILL (compute-bound) TTFT Token 1 Tok 2 Tok 3 Tok 4 ITL per token Prefill: parallelizes over all input tokens — FLOP-heavy, latency scales with prompt length Decode: one token at a time — bandwidth-heavy, latency scales with KV cache size and batch TTFT SLO: 1-3s for interactive. ITL SLO: 25-50ms for streaming, relaxed for batch.
Figure 1: A single inference request consists of a compute-bound prefill phase (sets TTFT) followed by sequential bandwidth-bound decode steps (sets ITL). They respond to different tuning levers.

The Trade-off: Throughput vs Latency

Throughput (tokens/second across all users) and per-request latency pull in opposite directions. To maximize throughput, you want large batches — many requests processed simultaneously, amortizing the per-layer weight reads across as many tokens as possible. But each additional request in a batch adds memory pressure (more KV cache), and during prefill, batching multiple long prompts together increases TTFT for requests that land mid-batch. There is no free lunch. The operating point depends on your workload mix and your SLOs.

In practice: if you are serving a chatbot where humans read the output in real time, you probably need TTFT under 1.5s and ITL under 40ms. If you are running document summarization pipelines overnight, you can saturate the GPU with massive batches and TTFT of 20 seconds is irrelevant — you care only about total job throughput and cost per million output tokens.

Metric Definitions You Need at Your Fingertips

Metric Full Name What It Measures Why It Matters
TTFT Time to First Token ms from request arrival to first output token Interactive UX; chatbot responsiveness
ITL / TPOT Inter-Token Latency / Time Per Output Token ms between successive output tokens Streaming smoothness; voice app floor
TPS Tokens Per Second Total output tokens/sec across all requests System throughput; cost denominator
MBU Memory Bandwidth Utilization % of peak HBM bandwidth consumed True utilization; decode efficiency ceiling
GPU Util % SM Activity % of SMs active during inference window Often misleading during decode; use MBU instead
KV Cache Hit Rate Prefix Cache Reuse % of prefill work skipped via cached prefixes TTFT reduction for repeated system prompts
$/1M tokens Cost per Million Output Tokens GPU-hour cost / (TPS x 3600) x 1,000,000 The unit economics number your CFO will ask about

GPU Memory and the KV Cache: The Real Constraint

Before you can understand cost, you need to understand where GPU memory goes. On an H100 SXM with 80GB HBM3, the memory budget splits roughly as follows for a Llama-3 70B model at FP16:

  • Model weights: ~140GB at FP16 — this is why Llama 70B requires at least 2x H100s for FP16, or a single H100/H200 with FP8 quantization bringing it to ~70GB.
  • Activations and framework overhead: 2-6GB depending on batch size and sequence length.
  • KV cache: Everything left over. This is where the battle is fought.

The KV cache stores the key and value tensors for every attention layer and every token currently in flight. For a 70B model (80 layers, 8 KV heads, 128 head dim, FP16), each token costs roughly 80 x 2 x 8 x 128 x 2 bytes = 327KB of KV cache. A batch of 100 concurrent requests each generating 2,048 tokens consumes over 65GB just for KV state — more than the model weights on a single GPU. This is why KV cache is the binding constraint on batch size, and batch size is the binding constraint on cost per token.

Gotcha: The DCGM GPU Utilization metric is nearly useless for LLM decode monitoring. It counts SM activity, and the SMs are "active" even when they are stalled waiting for HBM reads. A decode phase can show 85% GPU util while the GPU is spending 70% of cycles waiting for memory. The number you want is Memory Bandwidth Utilization from DCGM or nvidia-smi’s memory_bandwidth_util. Anything below 50% MBU during decode means your batch is too small or your KV cache is under-filled. Target 70-85% MBU sustained for a cost-efficient deployment.
GPU Memory Budget: Llama-3 70B More KV cache headroom = larger batch = lower cost per token H100 80GB / FP16 (2x GPU required) Weights ~70GB/GPU ~8GB KV H100 80GB / FP8 (single GPU) Weights ~35GB ~43GB KV cache headroom H200 141GB / FP8 (single GPU) Weights ~35GB ~104GB KV cache headroom Larger KV headroom = larger batch size = lower $/1M tokens H200 141GB HBM3e provides 2.4x the KV cache room of H100 FP8 on same model — primary advantage for 70B+ inference.
Figure 2: GPU memory allocation for Llama-3 70B at FP16 and FP8. Moving to FP8 on a single H100 frees ~43GB for KV cache; the H200 extends that to ~104GB, directly enabling larger batch sizes and lower per-token cost.

Continuous Batching: The Biggest Free Lunch in Inference

Static batching was the original approach: group N requests, run them together, return results when all N finish. The problem: requests finish at different times. When request 3 of 8 finishes at step 200 but request 7 runs to step 800, the GPU sits idle processing request 7 alone for steps 201-800 while the other 7 slots sit empty. You paid for a batch, got throughput of 1.

Continuous batching (also called in-flight batching or iteration-level scheduling) solves this by treating the batch as a fluid pool. After each decode iteration, the scheduler checks for finished requests and immediately slots in new ones from the waiting queue. The GPU batch stays filled without waiting for the slowest request in a cohort. This is the default mode in vLLM, TensorRT-LLM, and NIM since at least 2024. If you are not on a continuous-batching runtime, switch before doing anything else. The throughput improvement is routinely 5-10x at matched latency vs static batching.

Chunked Prefill: Protecting Latency at High Load

A newer refinement available in TensorRT-LLM and vLLM: chunked prefill. When a long prompt arrives (say, 8K tokens), the naive scheduler would run its entire prefill in one shot, blocking all decode work on other active requests for hundreds of milliseconds. Chunked prefill splits that prefill into smaller chunks (e.g., 512 tokens per chunk) interleaved with decode iterations. TTFT for the new request increases slightly, but ITL for existing requests stays stable. For mixed interactive/batch workloads, this is the right default. NIM configures this automatically based on the model and GPU.

Static vs Continuous Batching: GPU Slot Utilization Each row = one GPU batch slot over time. Gray = idle. Red/blue = active. STATIC BATCHING idle waste CONTINUOUS BATCHING new req new reqs fill slots Static: 3/4 slots idle for 40-60% of the batch window Continuous: GPU slots stay filled — 5-10x throughput improvement at matched latency Continuous batching is not optional for production. It is the baseline.
Figure 3: Static batching wastes GPU slots when fast requests finish early. Continuous batching immediately fills vacated slots, keeping throughput near the GPU ceiling.

Quantization: Shrinking Weights and KV Cache Together

Quantization does two things simultaneously: it reduces the model weight size (freeing HBM for KV cache) and it increases the raw compute FLOPS available (H100 FP8 tensor cores deliver ~1,979 TFLOPS vs ~989 TFLOPS for FP16). Both effects help throughput. FP8 is now the default precision for production Llama deployments via NIM and TensorRT-LLM on Hopper, and FP4 is the target format on Blackwell (B200, GB200 NVL72) where native FP4 tensor cores deliver another 2x multiplier over FP8.

For most Llama-class models (7B to 70B), FP8 has negligible quality degradation on standard benchmarks — typically under 0.5% drop on MMLU and similar evals versus FP16. INT4 weight-only quantization (AWQ, GPTQ) cuts weight size another 2x but can degrade quality noticeably on long-context and reasoning tasks. I recommend FP8 as the default; drop to INT4 only when the model does not fit in FP8 and you cannot add GPUs, and always run your own quality evaluation on your actual task before shipping.

KV Cache Quantization: The Hidden Win

Beyond weight quantization, TensorRT-LLM and vLLM both support KV cache quantization (INT8 or FP8 KV). Since the KV cache is the primary bandwidth-consumer during decode, halving its size (FP16 to FP8 KV) increases the effective batch size that fits in HBM by roughly 1.6-1.8x and reduces per-step memory bandwidth demand proportionally. This is often a 20-30% throughput gain at zero quality impact on most models. Enable it. The NIM Llama containers have it on by default for H100/H200.

The Cost Per Token Formula

The math is simple once you have the right numbers. The cost formula for output tokens is:

Cost per 1M output tokens =
  (GPU_hour_rate / 3600) / output_tokens_per_second x 1,000,000

Where:
  GPU_hour_rate     = $/GPU-hour (cloud spot or reserved, or TCO/hour on-prem)
  output_TPS        = sustained output tokens/sec across all concurrent requests
  
Denominator scales with: batch size, quantization, model size, hardware

The only way to lower cost per token without changing hardware is to increase output TPS. The only ways to increase output TPS are: larger batch (more KV cache headroom via quantization or bigger GPU), faster compute (FP8/FP4 or newer GPU), or less work per token (smaller model, shorter context). Everything else is noise.

Worked Example

Scenario: Llama-3 70B, single H100 SXM 80GB, FP8 precision, cloud at $2.50/GPU-hour [VERIFY: spot pricing varies by provider and availability]. NIM container, continuous batching enabled, FP8 KV cache. Target: interactive chatbot with TTFT SLO of 1.5s, ITL SLO of 40ms.

Step 1: Establish peak throughput at SLO. At the TTFT/ITL SLOs above, benchmark shows sustained output TPS of approximately 1,800 tokens/sec at batch size 32 [VERIFY: run your own NIM benchmarking; this is an illustrative number consistent with published H100 FP8 data]. At batch size 64 with relaxed TTFT (3s), TPS climbs to roughly 2,800 tokens/sec.

Step 2: Compute cost per million tokens.

  • Interactive (B=32, 1,800 TPS): ($2.50/3600) / 1,800 x 1,000,000 = ~$0.39/1M output tokens
  • Batch (B=64, 2,800 TPS): ($2.50/3600) / 2,800 x 1,000,000 = ~$0.25/1M output tokens
  • Naive batch-size-1 (~85 TPS, no continuous batching): ($2.50/3600) / 85 x 1,000,000 = ~$8.17/1M output tokens

Conclusion: Continuous batching + FP8 + right batch size cuts cost from ~$8/1M to under $0.40/1M on the same hardware. The configuration does 20x more work than the default out-of-box deployment. All numbers assume 100% GPU utilization across the hour — real deployments average 40-70% utilization due to traffic variation, so multiply by 1.4-2.5 for real-world average cost. Add input token cost by substituting prefill TPS.

Cost Per Million Tokens by GPU and Precision

The following table uses illustrative throughput numbers aligned with published benchmark ranges. All GPU-hour rates are cloud spot/on-demand estimates and vary by provider, region, and commitment level. All numbers should be validated against your specific workload before quoting in a budget. Throughput figures assume continuous batching, optimized KV cache, and batch sizes tuned to the ITL SLO noted.

GPU HBM Precision Output TPS (70B, B=32) Est. $/GPU-hr [VERIFY] Est. $/1M out tokens
H100 SXM 80GB 80GB HBM3 FP16 ~900 [VERIFY] $2.50 ~$0.77
H100 SXM 80GB 80GB HBM3 FP8 ~1,800 [VERIFY] $2.50 ~$0.39
H200 SXM 141GB 141GB HBM3e FP8 ~3,200 [VERIFY] $3.50 ~$0.30
B200 SXM 192GB 192GB HBM3e FP4 ~8,000 [VERIFY] $6.00-$8.00 ~$0.21-$0.28
H100 SXM 80GB (naive, no batching) 80GB HBM3 FP16, B=1 ~85 $2.50 ~$8.17

All throughput and pricing figures are worked-example estimates. Mark [VERIFY] items before use in procurement. Assumes 2x H100 for FP16 70B (cost shown per-GPU, model requires 2 GPUs).

The Latency-Throughput Frontier: How to Read a Benchmark

Every serious inference benchmark publishes a latency-throughput frontier: a curve showing achievable ITL (or P99 TTFT) as a function of total output TPS. The curve bends sharply as you approach the GPU throughput ceiling. The region near the knee of the curve is where you want to operate: throughput is high, latency is still within SLO. Push past the knee and latency explodes without proportional throughput gain.

When you see a vendor quoting "20,000 tokens/sec on H100" without a latency qualifier, that number is almost certainly measured at or past the knee — TTFT and ITL are in the seconds range at that throughput. Useless for interactive workloads. Always ask: at what P95 TTFT and ITL does that TPS number hold?

The Latency-Throughput Frontier ITL rises slowly then bends sharply near throughput ceiling. Operate near — not past — the knee. Output Throughput (tokens/sec total) ITL (ms per token) 0 Low Medium Near ceiling 10ms 50ms 100ms 200ms+ SLO Knee Operate here: high TPS, within SLO Latency blowout
Figure 4: The latency-throughput frontier for LLM inference. ITL is flat until the GPU nears its throughput ceiling, then rises steeply. Target the operating region just left of the knee: throughput is near-maximum, ITL is within SLO.

What to Do in Production: The Priority Stack

In practice: The right order to attack inference economics. Do these in sequence — each step unlocks the next one.
  1. Switch to a continuous-batching runtime (NIM, vLLM, TensorRT-LLM). This single change typically delivers 5-10x throughput improvement over naive serving. Cost per token drops by the same factor.
  2. Enable FP8 quantization (TensorRT-LLM or NIM default on H100/H200). Doubles throughput on Hopper vs FP16, halves weight memory, frees HBM for larger batches.
  3. Enable FP8 or INT8 KV cache quantization. Another 20-30% throughput on top of weight quantization. Usually on by default in NIM; verify with nim config show.
  4. Profile your actual batch size at your SLOs. Run genai-perf (NVIDIA’s benchmarking tool from the NIM ecosystem) against your deployment, sweeping concurrency from 1 to 128. Find the knee of the latency-throughput frontier. Set max-batch-size to the concurrency at the knee.
  5. Enable prefix caching for repeated system prompts. If your application uses a static or slowly-changing system prompt, the KV cache for that prefix can be reused across requests. TTFT drops by 50-90% for those requests.
  6. Consider disaggregated prefill/decode (NVIDIA Dynamo, covered in Part 20) when TTFT and throughput SLOs cannot be met simultaneously on shared GPUs. Dedicating separate GPU pools to prefill vs decode lets each be tuned independently.

When Not to Chase Lower Cost Per Token

Large batches and high utilization are right for predictable, sustained traffic. They are wrong for:

  • Highly variable traffic with spiky peaks: A deployment tuned for 80% average utilization will blow its TTFT SLO during the 5-minute peak where utilization hits 100%. Size for peak, not average, unless you have autoscaling that can cold-start in under 30 seconds (you probably do not, since NIM images are large).
  • Safety-critical or compliance-gated inference: If each response requires audit logging, PII scanning, or output guardrails running serially, the latency budget for those layers may make batching counterproductive. Each added step consumes TTFT budget.
  • Very long context (>32K tokens): KV cache per-request is enormous. Batch size collapses to near 1 regardless of how much HBM you have. At that point, throughput optimization shifts to speculative decoding and disaggregated serving rather than batching.
My Take: The single most common mistake I see in on-prem NVIDIA AI Enterprise deployments is over-purchasing GPU hardware as a substitute for proper serving configuration. A team buys 8x H100s because "we need the capacity", then runs them at 5% GPU utilization with a batch size of 1 and wonders why cost-per-inference is 20x higher than a cloud API. The hardware is not the problem. Two H100s with proper NIM configuration, FP8 KV cache quantization, and continuous batching will outperform eight H100s running naive single-request serving in both throughput and cost. Do the configuration work first. Size the hardware second.

What to Validate Before You Go to Production

Before declaring your inference stack production-ready, run through this checklist:

  1. Benchmark at your actual input/output distribution, not the benchmark distribution. A deployment tuned for 128-token inputs behaves completely differently with 4K-token inputs. Prefill time scales with input length; your TTFT SLO may be violated at long-context requests even when batch metrics look fine.
  2. Validate quality at your quantization setting. Run your model evaluation suite (MMLU, task-specific evals) against FP8 before shipping. For most Llama-class models this is a formality, but for fine-tuned models or models with unusual activation distributions, FP8 can be more destructive.
  3. Load test with realistic concurrency. Use genai-perf or locust with request arrival patterns matching your production traffic (Poisson arrivals, not synchronized bursts). P99 latency under realistic load is what matters for SLO design, not median latency under synthetic ramp tests.
  4. Monitor MBU, not just GPU utilization %. Wire DCGM metrics into your observability stack (Prometheus + Grafana). Alert on MBU below 40% (under-utilized) and on P99 TTFT exceeding 2x your SLO target (early warning before SLO breach).
  5. Account for model loading time in your autoscaling design. NIM containers cold-start in 3-10 minutes depending on model size and whether the image is cached. Your horizontal pod autoscaler (HPA) scale-out lead time must be longer than your traffic ramp-up time, or you need pre-warmed standby replicas. This is the operability gap most teams discover only after their first traffic spike.

The Verdict

Inference economics come down to a single principle: the GPU is only as useful as the fraction of its memory bandwidth you are consuming with real work. Everything else follows from that. Continuous batching keeps the slots filled. FP8 quantization doubles the useful work per byte of HBM bandwidth. KV cache quantization extends how large a batch you can hold. Chunked prefill protects TTFT SLOs at high load.

On a well-configured H100 FP8 deployment with NIM and continuous batching, a Llama-3 70B model runs at under $0.50/1M output tokens at interactive latencies. The same model on the same GPU with naive configuration runs at $8-15/1M tokens. That gap is not hardware. It is configuration and understanding.

I recommend H200 over H100 for new on-prem deployments specifically for its KV cache headroom advantage: the 141GB HBM3e enables 2-3x larger effective batch sizes on 70B models compared to H100 FP8, which directly translates to lower cost per token at similar hardware cost (once the H200 supply stabilizes). For teams already on H100 clusters, the configuration improvements above will yield far more cost reduction than a hardware refresh.

What I would not do: buy more GPUs before profiling the ones you have. Run genai-perf against your current deployment today. If your peak MBU is below 60%, you have a configuration problem, not a capacity problem. Fix it first.

If you are working through on-prem GPU cluster sizing and want to wire the economics from this post into a full cost model, the NVIDIA AI Guide covers the full stack. And for teams interested in training economics on the same hardware, Part 22 on the NeMo framework picks up where this part leaves off.

NVIDIA AI Series · Part 21 of 30
« Previous: Part 20 — NVIDIA Dynamo  |  NVIDIA AI Guide  |  Next: Part 22 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading