Your inference cluster is hitting a wall. Prefill on long prompts is consuming GPU compute while decode workers sit idle waiting for the first token to finish. Meanwhile, your decode pool is memory-bandwidth-starved and you are buying more H200s to add KV cache capacity. These are not independent problems — they are the same problem, and they have different optimal hardware configurations. Packing both phases on the same GPU is the root cause. NVIDIA Dynamo exists to fix that.
Why Prefill and Decode Are Fundamentally Different Workloads
Every LLM inference request runs two phases. Prefill processes the entire input prompt in one forward pass — it is compute-bound because the GPU is doing a large matrix-matrix multiply across the full context. Decode generates tokens one at a time — each step is a matrix-vector multiply, and the bottleneck is how fast you can stream model weights and KV cache out of HBM. Same model, same GPU, radically different performance profiles.
On a single GPU or a homogeneous pool, you compromise on both. You size batch sizes and tensor parallelism for one workload and hurt the other. At small scale this is acceptable — the cost of the compromise is low. At 32+ GPUs serving thousands of concurrent users on a 671B MoE model, it becomes your largest source of wasted capacity.
The Dynamo Architecture: Five Components That Matter
Dynamo 1.0, released in March 2026, is a production-grade distributed inference framework. It is not a replacement for vLLM, SGLang, or TensorRT-LLM — it sits above them, orchestrating and routing across pools of inference workers. The five components you need to understand are:
| Component | Role | Where it runs |
|---|---|---|
| Dynamo Frontend | Accepts OpenAI-compatible requests, injects agent hints (latency target, expected output length, cache control), passes to router | CPU node or lightweight pod |
| KV-Aware Router | Routes each request to the prefill worker with the highest KV prefix overlap, balancing cache hit rate against queue depth; uses a global radix tree | CPU, co-located with frontend |
| Prefill Workers | Run the compute-bound prefill pass; produce KV cache blocks and transfer them to decode workers via NIXL | GPU pool (smaller TP, higher batch) |
| Decode Workers | Run autoregressive decode; receive KV blocks from prefill; sized for KV cache capacity and memory bandwidth | GPU pool (larger TP, larger KV budget) |
| NIXL | NVIDIA Inference Xfer Library — point-to-point, async, non-blocking KV cache transfer from prefill VRAM to decode VRAM; works over NVLink, InfiniBand, RoCE, or Ethernet | Fabric layer |
| Planner / DGDR | Takes SLO targets (TTFT, ITL, throughput), profiles the cluster, and outputs an optimized Dynamo Graph Deployment (DGD) — prefill/decode pool sizes, TP degree, router weights — without manual tuning | Kubernetes controller |
The KV-Aware Router in Detail
The router is the piece most people underestimate. It is not round-robin and it is not just load-based. It hashes each incoming request and looks it up in a global radix tree that tracks which KV blocks exist on which worker. When a request shares a long prefix with a cached sequence — a system prompt, a shared document, a multi-turn conversation — the router sends it to the worker that already holds those blocks, avoiding the full prefill recompute cost.
The key tuning knob is --router-kv-overlap-score-weight. At 1.0 it prioritises prefix cache overlap entirely, which collapses TTFT at the cost of decode queue imbalance. At 0.0 it reverts to pure load balancing. Production deployments typically sit between 0.3 and 0.7 depending on whether the workload is more multi-turn (higher overlap weight) or single-shot (lower). Based on real DeepSeek R1 user query traces, KV-aware routing delivered a 3x improvement in TTFT and 2x reduction in average request latency versus round-robin at scale.
NIXL: The KV Transfer Path
NIXL (NVIDIA Inference Xfer Library) is the piece that makes disaggregation practical rather than theoretical. It moves KV cache blocks directly from the prefill worker’s VRAM to the decode worker’s VRAM, non-blocking, so both GPUs keep their forward passes running while the transfer happens in the background. It abstracts the transport: on a GB200 NVL72 rack the KV blocks move over NVLink; across nodes they move over InfiniBand or RoCE; on AWS EFA deployments they move over LIBFABRIC. The application code does not change.
The failure mode to know about: at small scale or with short context lengths, the NIXL transfer latency dominates. If your average prompt is 200 tokens, the KV blocks are tiny and the network round-trip to move them may cost more wall time than just re-running prefill on the decode worker. Disaggregation only wins when the KV payload is large enough for the transfer to be cheaper than recompute — that threshold is roughly context lengths of 2k+ tokens at production concurrency.
The Planner and DGDR: Removing the Sizing Guesswork
The hardest part of disaggregated serving is sizing the prefill and decode pools relative to each other. Get it wrong and you create a new bottleneck: either the prefill pool is undersized and decode workers queue waiting for KV blocks, or the decode pool is undersized and completed prefills pile up waiting for a free decode slot.
Dynamo 1.0 ships the Dynamo Graph Deployment Request (DGDR), which unifies the planner and AIConfigurator. You specify: model, target hardware, TTFT target, inter-token latency target, and expected traffic load. DGDR runs simulation-based recommendations for quick iteration, then optionally runs on-cluster profiling for production-grade sizing. The output is a Dynamo Graph Deployment (DGD) manifest — a Kubernetes-native YAML describing the full deployment including prefill worker count, decode worker count, tensor parallelism degree, and router configuration — ready to apply directly.
This matters because the ratio of prefill to decode workers is model-dependent and traffic-pattern-dependent. For DeepSeek R1 on GB200 NVL72 with 1k input / 1k output, the optimal ratio sits around 1:3 prefill-to-decode. For a code-generation model with 8k input and 128 output tokens, you flip it toward more prefill capacity. The planner finds this for you rather than requiring weeks of manual benchmark iteration.
Trade-Off Table: Monolithic NIM vs Dynamo Disaggregated
| Dimension | Monolithic NIM | Dynamo Disaggregated |
|---|---|---|
| Operational complexity | Low — one Helm chart, one pool | High — two pools, router, NIXL config, planner |
| Prefill/decode contention | Present — long prefills delay decode slots | Eliminated — independent queues |
| KV transfer overhead | None | Present — latency penalty at small scale or short context |
| Throughput at scale | Saturates at GPU memory ceiling | Up to 7x on GB200 NVL72 for MoE models with wide expert parallelism |
| Scaling granularity | Entire replica scales together | Prefill and decode pools scale independently |
| Hardware flexibility | Homogeneous — all GPUs same spec | Heterogeneous — can use H100 for prefill, H200 for decode |
| Best-fit workload | Small/medium scale, short context, stable latency SLO | Large scale, long context, MoE models, cost-per-token target |
| Break-even scale | Always wins below ~16 GPUs | Wins above ~32-50 GPUs at long context or high concurrency |
A Real Deployment Config
Below is a representative Dynamo disaggregated serving configuration for a vLLM backend, showing prefill and decode worker parameters with NIXL KV transfer enabled. Field names verified against Dynamo 1.x documentation at docs.nvidia.com/dynamo.
# dynamo-disagg-vllm.yaml
# Dynamo disaggregated inference config: prefill + decode pools with NIXL
# Tested pattern: vLLM backend on Dynamo 1.x
# [VERIFY exact field names against your Dynamo version before applying]
VllmWorker:
# Prefill workers: compute-bound, smaller tensor parallelism
prefill:
tensor-parallel-size: 4
max-num-seqs: 64
gpu-memory-utilization: 0.90
kv-transfer-config: '{
"kv_connector":"NixlConnector",
"kv_role":"kv_producer",
"kv_connector_extra_config":{
"backends":["NIXL"]
}
}'
num_nodes: 1
num_gpus: 4
# Decode workers: memory-BW-bound, larger TP, more KV budget
decode:
tensor-parallel-size: 8
max-num-seqs: 256
gpu-memory-utilization: 0.92
kv-transfer-config: '{
"kv_connector":"NixlConnector",
"kv_role":"kv_consumer",
"kv_connector_extra_config":{
"backends":["NIXL"]
}
}'
num_nodes: 1
num_gpus: 8
Router:
kv-overlap-score-weight: 0.5 # balance between cache-hit and load-balance
Frontend:
port: 8000
model: meta-llama/Llama-3.1-70B-Instruct
# Expected behavior with 2k+ token prompts at high concurrency:
# - TTFT improves because prefill pool is not competing for decode slots
# - Total throughput increases as both pools run at their optimal batch size
# - NIXL transfer adds ~5-15ms per request (verify on your fabric)
#
# Failure mode at small scale (under 16 GPUs total, short prompts < 512 tokens):
# - KV transfer overhead dominates: total latency INCREASES vs monolithic NIM
# - Prefill pool sits near-idle while transfer completes for tiny KV payloads
# - Fix: revert to monolithic NIM or increase minimum context length threshold
[VERIFY field names kv_role, kv_connector, and YAML nesting against your specific Dynamo + vLLM version before applying in production.]
Worked Example: DeepSeek R1 on GB200 NVL72
Worked Example
Scenario: Serving DeepSeek R1 (671B, FP4) on a GB200 NVL72 rack (72 Blackwell GPUs, 1.44 TB HBM3e total), targeting 1k input / 1k output token lengths, 50 tok/sec/user interactivity target, 500 concurrent users.
Monolithic baseline: With a single NIM pool at TP=8 across 72 GPUs (9 replicas), throughput tops out at roughly 4,000 requests/hour before prefill latency causes TTFT violations. Each prefill on a 1k-token prompt holds a replica’s GPU batch for ~300ms. At 500 concurrent users, you queue on prefill constantly.
Dynamo disaggregated: Planner recommends 18 prefill GPUs (TP=2, high batch) + 54 decode GPUs (TP=6, larger KV budget). Prefill pool processes prompts in under 80ms at TP=2 with wide expert parallelism active. KV blocks transfer via NIXL over NVLink in approximately 8-12ms for 1k-token context. Decode pool handles autoregressive generation unblocked. Published SemiAnalysis InferenceX result: 7x throughput improvement on this exact hardware class with this model. Real-world deployments at Baseten, CoreWeave, and ByteDance have validated similar gains in production.
The catch: If you run this same config with an average prompt of 200 tokens, the 8-12ms NIXL transfer costs the same but the prefill would have completed in under 20ms monolithically. You are now slower. Disaggregation is a long-context, high-concurrency optimization — not a universal one.
The Smart Router Diagram: Pool Split Decision
Production Gotchas
Gotcha 1: KV Transfer Overhead at Short Context
Already mentioned but worth repeating explicitly: on prompts under about 1,000 tokens at low concurrency, Dynamo will produce worse TTFT than a single NIM. The NIXL transfer adds 5-20ms of latency regardless of payload size. For a 200-token prompt that prefills in 15ms natively, adding a 10ms transfer makes no sense. Use Dynamo only when your p90 prompt length and concurrency make the transfer cost negligible relative to what you save in prefill/decode contention.
Gotcha 2: Fabric Choice Matters More Than You Expect
NIXL transfers over NVLink on a GB200 NVL72 rack are qualitatively different from NIXL transfers over 200Gb/s InfiniBand across nodes. On-rack NVLink bandwidth is roughly 1.8 TB/s for a full NVL72 fabric. Cross-node InfiniBand at 200Gb/s is 25 GB/s per link. For a 70B model at FP8, the KV cache for a 4k-token context is roughly 3 GB. Moving that over NVLink takes microseconds; over InfiniBand it takes over 100ms. If your disaggregated prefill and decode pools are on separate nodes connected by standard IB, measure your actual transfer latency before assuming disaggregation helps.
Gotcha 3: The Planner Needs Real Traffic Profiles
DGDR is only as good as the traffic profile you give it. If you profile with synthetic uniform requests and your production traffic is bimodal (short chatbot turns + long document analysis), the pool sizing will be wrong for at least one load type. Feed the planner actual production request traces. The Dynamo docs recommend sampling at least 100,000 requests from your real workload. [AUTHOR: add anecdote from a production sizing session where bimodal traffic caused prefill pool starvation]
When Dynamo Is NOT the Right Choice
Disaggregated inference is a scale optimization. It adds real complexity: two GPU pools to monitor, a router to configure, a NIXL fabric to validate, and a planner to run before any sizing change. Every one of those components can fail independently. That overhead is worth it when you are running a 70B+ model at 100+ GPU scale with long-context workloads and a hard cost-per-token target. It is not worth it in these cases:
- You are running a 7B or 13B model — the prefill is fast enough that contention is minimal
- Your GPU count is under 16 and you have no near-term plan to scale beyond it
- Average prompt length is under 1k tokens and output length is under 512 tokens
- You do not have a high-bandwidth fabric (NVLink rack or 400Gb/s IB) between prefill and decode nodes
- Your team does not yet have Kubernetes-native inference operations experience — the monolithic NIM failure modes are simpler to debug
In those cases, a NIM on Triton (covered in Part 19) or a direct TensorRT-LLM deployment gets you 80% of the performance at 30% of the operational burden. Save Dynamo for when you genuinely need it.
What to Validate Before You Commit
If you are evaluating Dynamo for a production deployment, work through this checklist before signing off on the architecture:
- Measure your actual NIXL transfer latency on your specific fabric before benchmarking throughput. Run the NIXL bandwidth test on your NVLink / IB / RoCE fabric and get the p99 latency for your expected KV payload size.
- Profile with real traffic, not synthetic requests. Bimodal or long-tail prompt distributions change the optimal prefill-to-decode pool ratio significantly.
- Run DGDR planner output as a recommendation, not a prescription. Validate the suggested ratio with a 30-minute load test at production concurrency before scaling to full deployment.
- Set up KV Block Manager tiering carefully. KVBM now supports GPU VRAM, CPU pinned memory, local SSD, and S3-compatible object storage. For long-context workloads, the tiering policy (eviction thresholds between tiers) is as important as the pool sizes.
- Check Grove topology constraints if you are on GB300 NVL72. Grove’s unified topology API lets you pin prefill and decode to the same NVL72 rack for intra-rack NVLink transfers — do this explicitly rather than letting the scheduler place them on different racks.
The Verdict
NVIDIA Dynamo is the right inference framework when you are running large MoE models like DeepSeek R1 at 50+ GPU scale, with long-context workloads and a genuine cost-per-token pressure. The 7x throughput claim on GB200 NVL72 is real, verified by SemiAnalysis and corroborated by production deployments at multiple cloud providers. The disaggregation model is architecturally sound: the KV-aware router, NIXL transfer, and DGDR planner together address the real bottlenecks in monolithic serving at scale.
What I would not do: deploy Dynamo as a default choice for every LLM workload. The operational complexity is non-trivial and the KV transfer overhead is real. Teams that are not yet hitting the limits of monolithic NIM serving should stay there until they are. When you do cross the threshold — 50+ GPUs, long context, high concurrency, MoE model — Dynamo is not just worth the complexity. It is the only architectural choice that scales.
Inference economics — what disaggregation actually does to your cost per token and how to size for a budget — is covered in Part 21. If you want to see how Dynamo sits in the broader NVIDIA AI stack, the NVIDIA AI Guide maps all 30 parts together.
If you are sizing a Dynamo deployment right now, drop a comment with your model, GPU count, and context length distribution — I am happy to talk through whether disaggregation is the right call for your specific numbers.
References
- How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale — NVIDIA Technical Blog (March 2026)
- Router Guide: KV Cache Aware Routing — NVIDIA Dynamo Documentation
- Dynamo Disaggregation: Separating Prefill and Decode — NVIDIA Dynamo Documentation
- Introducing NVIDIA Dynamo: A Low-Latency Distributed Inference Framework — NVIDIA Technical Blog
- NVIDIA Dynamo GitHub Repository — ai-dynamo/dynamo



