Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NVIDIA Dynamo Disaggregated Inference: Prefill, Decode, and KV-Aware Routing at Scale (NVIDIA AI Series, Part 20)

NVIDIA Dynamo separates prefill and decode onto independent GPU pools, routing requests via a KV-aware smart router and transferring KV cache blocks via NIXL. Here is when disaggregation wins, when it does not, and what to validate before committing to the architecture.

NVIDIA AI Series · Part 20 of 30
TL;DR: NVIDIA Dynamo separates the compute-bound prefill phase and the memory-bandwidth-bound decode phase onto independent GPU pools. A KV-aware smart router directs requests to workers with the highest cache overlap, and NIXL transfers the KV cache from prefill VRAM to decode VRAM over NVLink or InfiniBand without blocking the forward pass. The planner (now unified in DGDR) profiles and sizes these pools automatically. The honest trade-off: disaggregation adds KV-transfer overhead and only beats a monolithic NIM when you are running long-context or large MoE models at genuine scale — roughly 50+ GPUs and high concurrency. Below that threshold, a single NIM is simpler and cheaper.
Who this is for: Platform engineers and AI infrastructure architects sizing GPU clusters for production LLM serving — especially those already running NVIDIA NIM or Triton and asking whether Dynamo is the next logical step. You should be comfortable with Kubernetes, tensor parallelism, and basic LLM serving concepts. If you are just deploying your first NIM, read Part 16 first.

Your inference cluster is hitting a wall. Prefill on long prompts is consuming GPU compute while decode workers sit idle waiting for the first token to finish. Meanwhile, your decode pool is memory-bandwidth-starved and you are buying more H200s to add KV cache capacity. These are not independent problems — they are the same problem, and they have different optimal hardware configurations. Packing both phases on the same GPU is the root cause. NVIDIA Dynamo exists to fix that.

Why Prefill and Decode Are Fundamentally Different Workloads

Every LLM inference request runs two phases. Prefill processes the entire input prompt in one forward pass — it is compute-bound because the GPU is doing a large matrix-matrix multiply across the full context. Decode generates tokens one at a time — each step is a matrix-vector multiply, and the bottleneck is how fast you can stream model weights and KV cache out of HBM. Same model, same GPU, radically different performance profiles.

On a single GPU or a homogeneous pool, you compromise on both. You size batch sizes and tensor parallelism for one workload and hurt the other. At small scale this is acceptable — the cost of the compromise is low. At 32+ GPUs serving thousands of concurrent users on a 671B MoE model, it becomes your largest source of wasted capacity.

Monolithic vs Disaggregated LLM Serving

Both phases share GPU resources in monolithic mode; disaggregated mode allocates each phase its own pool

MONOLITHIC GPU Pool Prefill + Decode mixed Resource contention Request in Response out One GPU does everything Prefill stalls decode DISAGGREGATED (DYNAMO) Prefill Pool Compute-optimized High batch / TP Decode Pool BW-optimized Larger KV budget KV via NIXL Request in Response out Each phase scales independently Prefill never blocks decode slots

In monolithic serving, prefill and decode compete for the same GPU. In Dynamo, they run on separate pools scaled to their own bottleneck.

The Dynamo Architecture: Five Components That Matter

Dynamo 1.0, released in March 2026, is a production-grade distributed inference framework. It is not a replacement for vLLM, SGLang, or TensorRT-LLM — it sits above them, orchestrating and routing across pools of inference workers. The five components you need to understand are:

Component Role Where it runs
Dynamo Frontend Accepts OpenAI-compatible requests, injects agent hints (latency target, expected output length, cache control), passes to router CPU node or lightweight pod
KV-Aware Router Routes each request to the prefill worker with the highest KV prefix overlap, balancing cache hit rate against queue depth; uses a global radix tree CPU, co-located with frontend
Prefill Workers Run the compute-bound prefill pass; produce KV cache blocks and transfer them to decode workers via NIXL GPU pool (smaller TP, higher batch)
Decode Workers Run autoregressive decode; receive KV blocks from prefill; sized for KV cache capacity and memory bandwidth GPU pool (larger TP, larger KV budget)
NIXL NVIDIA Inference Xfer Library — point-to-point, async, non-blocking KV cache transfer from prefill VRAM to decode VRAM; works over NVLink, InfiniBand, RoCE, or Ethernet Fabric layer
Planner / DGDR Takes SLO targets (TTFT, ITL, throughput), profiles the cluster, and outputs an optimized Dynamo Graph Deployment (DGD) — prefill/decode pool sizes, TP degree, router weights — without manual tuning Kubernetes controller

The KV-Aware Router in Detail

The router is the piece most people underestimate. It is not round-robin and it is not just load-based. It hashes each incoming request and looks it up in a global radix tree that tracks which KV blocks exist on which worker. When a request shares a long prefix with a cached sequence — a system prompt, a shared document, a multi-turn conversation — the router sends it to the worker that already holds those blocks, avoiding the full prefill recompute cost.

The key tuning knob is --router-kv-overlap-score-weight. At 1.0 it prioritises prefix cache overlap entirely, which collapses TTFT at the cost of decode queue imbalance. At 0.0 it reverts to pure load balancing. Production deployments typically sit between 0.3 and 0.7 depending on whether the workload is more multi-turn (higher overlap weight) or single-shot (lower). Based on real DeepSeek R1 user query traces, KV-aware routing delivered a 3x improvement in TTFT and 2x reduction in average request latency versus round-robin at scale.

KV-Aware Routing Decision Flow

Router evaluates prefix overlap and queue depth before assigning a prefill worker

Incoming Request Hash prefix Radix tree lookup Overlap score High overlap Worker with cached blocks Low overlap Least-loaded worker Prefill worker Weight: –router-kv-overlap-score-weight (0.0 to 1.0)

The router chooses based on a weighted combination of KV overlap score and worker queue depth. The tuning knob shifts priority between cache efficiency and load balance.

NIXL: The KV Transfer Path

NIXL (NVIDIA Inference Xfer Library) is the piece that makes disaggregation practical rather than theoretical. It moves KV cache blocks directly from the prefill worker’s VRAM to the decode worker’s VRAM, non-blocking, so both GPUs keep their forward passes running while the transfer happens in the background. It abstracts the transport: on a GB200 NVL72 rack the KV blocks move over NVLink; across nodes they move over InfiniBand or RoCE; on AWS EFA deployments they move over LIBFABRIC. The application code does not change.

The failure mode to know about: at small scale or with short context lengths, the NIXL transfer latency dominates. If your average prompt is 200 tokens, the KV blocks are tiny and the network round-trip to move them may cost more wall time than just re-running prefill on the decode worker. Disaggregation only wins when the KV payload is large enough for the transfer to be cheaper than recompute — that threshold is roughly context lengths of 2k+ tokens at production concurrency.

KV Cache Transfer Path via NIXL

Non-blocking async transfer from prefill VRAM to decode VRAM; forward passes continue on both workers

Prefill Worker VRAM — KV blocks produced Forward pass (non-blocking) NIXL NVLink / IB / RoCE Decode Worker VRAM — KV blocks received Autoregressive decode KV Block Manager GPU VRAM CPU pinned mem Local SSD Object storage (S3) evict Compute-bound (matrix-matrix) BW-bound (matrix-vector)

NIXL transfers are async and non-blocking: both workers continue their forward passes while blocks move. The KV Block Manager handles tiered eviction from GPU VRAM down to object storage.

The Planner and DGDR: Removing the Sizing Guesswork

The hardest part of disaggregated serving is sizing the prefill and decode pools relative to each other. Get it wrong and you create a new bottleneck: either the prefill pool is undersized and decode workers queue waiting for KV blocks, or the decode pool is undersized and completed prefills pile up waiting for a free decode slot.

Dynamo 1.0 ships the Dynamo Graph Deployment Request (DGDR), which unifies the planner and AIConfigurator. You specify: model, target hardware, TTFT target, inter-token latency target, and expected traffic load. DGDR runs simulation-based recommendations for quick iteration, then optionally runs on-cluster profiling for production-grade sizing. The output is a Dynamo Graph Deployment (DGD) manifest — a Kubernetes-native YAML describing the full deployment including prefill worker count, decode worker count, tensor parallelism degree, and router configuration — ready to apply directly.

This matters because the ratio of prefill to decode workers is model-dependent and traffic-pattern-dependent. For DeepSeek R1 on GB200 NVL72 with 1k input / 1k output, the optimal ratio sits around 1:3 prefill-to-decode. For a code-generation model with 8k input and 128 output tokens, you flip it toward more prefill capacity. The planner finds this for you rather than requiring weeks of manual benchmark iteration.

In practice: Do not skip the planner. I have seen teams manually configure a 1:1 prefill-to-decode ratio for a reasoning model with long output sequences and end up with prefill workers idle 70% of the time while decode workers queued 10+ requests. DGDR would have suggested 1:4. The profiling run takes under 30 minutes for most configurations and saves days of production tuning.

Trade-Off Table: Monolithic NIM vs Dynamo Disaggregated

Dimension Monolithic NIM Dynamo Disaggregated
Operational complexity Low — one Helm chart, one pool High — two pools, router, NIXL config, planner
Prefill/decode contention Present — long prefills delay decode slots Eliminated — independent queues
KV transfer overhead None Present — latency penalty at small scale or short context
Throughput at scale Saturates at GPU memory ceiling Up to 7x on GB200 NVL72 for MoE models with wide expert parallelism
Scaling granularity Entire replica scales together Prefill and decode pools scale independently
Hardware flexibility Homogeneous — all GPUs same spec Heterogeneous — can use H100 for prefill, H200 for decode
Best-fit workload Small/medium scale, short context, stable latency SLO Large scale, long context, MoE models, cost-per-token target
Break-even scale Always wins below ~16 GPUs Wins above ~32-50 GPUs at long context or high concurrency

A Real Deployment Config

Below is a representative Dynamo disaggregated serving configuration for a vLLM backend, showing prefill and decode worker parameters with NIXL KV transfer enabled. Field names verified against Dynamo 1.x documentation at docs.nvidia.com/dynamo.

# dynamo-disagg-vllm.yaml
# Dynamo disaggregated inference config: prefill + decode pools with NIXL
# Tested pattern: vLLM backend on Dynamo 1.x
# [VERIFY exact field names against your Dynamo version before applying]

VllmWorker:
  # Prefill workers: compute-bound, smaller tensor parallelism
  prefill:
    tensor-parallel-size: 4
    max-num-seqs: 64
    gpu-memory-utilization: 0.90
    kv-transfer-config: '{
      "kv_connector":"NixlConnector",
      "kv_role":"kv_producer",
      "kv_connector_extra_config":{
        "backends":["NIXL"]
      }
    }'
    num_nodes: 1
    num_gpus: 4

  # Decode workers: memory-BW-bound, larger TP, more KV budget
  decode:
    tensor-parallel-size: 8
    max-num-seqs: 256
    gpu-memory-utilization: 0.92
    kv-transfer-config: '{
      "kv_connector":"NixlConnector",
      "kv_role":"kv_consumer",
      "kv_connector_extra_config":{
        "backends":["NIXL"]
      }
    }'
    num_nodes: 1
    num_gpus: 8

Router:
  kv-overlap-score-weight: 0.5   # balance between cache-hit and load-balance

Frontend:
  port: 8000
  model: meta-llama/Llama-3.1-70B-Instruct

# Expected behavior with 2k+ token prompts at high concurrency:
# - TTFT improves because prefill pool is not competing for decode slots
# - Total throughput increases as both pools run at their optimal batch size
# - NIXL transfer adds ~5-15ms per request (verify on your fabric)
#
# Failure mode at small scale (under 16 GPUs total, short prompts < 512 tokens):
# - KV transfer overhead dominates: total latency INCREASES vs monolithic NIM
# - Prefill pool sits near-idle while transfer completes for tiny KV payloads
# - Fix: revert to monolithic NIM or increase minimum context length threshold

[VERIFY field names kv_role, kv_connector, and YAML nesting against your specific Dynamo + vLLM version before applying in production.]

Worked Example: DeepSeek R1 on GB200 NVL72

Worked Example

Scenario: Serving DeepSeek R1 (671B, FP4) on a GB200 NVL72 rack (72 Blackwell GPUs, 1.44 TB HBM3e total), targeting 1k input / 1k output token lengths, 50 tok/sec/user interactivity target, 500 concurrent users.

Monolithic baseline: With a single NIM pool at TP=8 across 72 GPUs (9 replicas), throughput tops out at roughly 4,000 requests/hour before prefill latency causes TTFT violations. Each prefill on a 1k-token prompt holds a replica’s GPU batch for ~300ms. At 500 concurrent users, you queue on prefill constantly.

Dynamo disaggregated: Planner recommends 18 prefill GPUs (TP=2, high batch) + 54 decode GPUs (TP=6, larger KV budget). Prefill pool processes prompts in under 80ms at TP=2 with wide expert parallelism active. KV blocks transfer via NIXL over NVLink in approximately 8-12ms for 1k-token context. Decode pool handles autoregressive generation unblocked. Published SemiAnalysis InferenceX result: 7x throughput improvement on this exact hardware class with this model. Real-world deployments at Baseten, CoreWeave, and ByteDance have validated similar gains in production.

The catch: If you run this same config with an average prompt of 200 tokens, the 8-12ms NIXL transfer costs the same but the prefill would have completed in under 20ms monolithically. You are now slower. Disaggregation is a long-context, high-concurrency optimization — not a universal one.

The Smart Router Diagram: Pool Split Decision

Prefill / Decode Pool Split and Smart Router

End-to-end request path through Dynamo: frontend to router to prefill pool to decode pool

Client OpenAI API Frontend Agent hints inject + route Smart Router KV overlap score Queue depth Radix tree lookup OSL hints Prefill Pool TP=2..4, high batch Compute-optimized Scales independently NIXL Decode Pool TP=6..8, large KV BW-optimized Scales independently Tokens streamed

The smart router sits between the frontend and prefill pool, not between prefill and decode. Decode workers receive their assignment implicitly via KV block delivery from NIXL.

Production Gotchas

Gotcha 1: KV Transfer Overhead at Short Context

Already mentioned but worth repeating explicitly: on prompts under about 1,000 tokens at low concurrency, Dynamo will produce worse TTFT than a single NIM. The NIXL transfer adds 5-20ms of latency regardless of payload size. For a 200-token prompt that prefills in 15ms natively, adding a 10ms transfer makes no sense. Use Dynamo only when your p90 prompt length and concurrency make the transfer cost negligible relative to what you save in prefill/decode contention.

Gotcha 2: Fabric Choice Matters More Than You Expect

NIXL transfers over NVLink on a GB200 NVL72 rack are qualitatively different from NIXL transfers over 200Gb/s InfiniBand across nodes. On-rack NVLink bandwidth is roughly 1.8 TB/s for a full NVL72 fabric. Cross-node InfiniBand at 200Gb/s is 25 GB/s per link. For a 70B model at FP8, the KV cache for a 4k-token context is roughly 3 GB. Moving that over NVLink takes microseconds; over InfiniBand it takes over 100ms. If your disaggregated prefill and decode pools are on separate nodes connected by standard IB, measure your actual transfer latency before assuming disaggregation helps.

Gotcha 3: The Planner Needs Real Traffic Profiles

DGDR is only as good as the traffic profile you give it. If you profile with synthetic uniform requests and your production traffic is bimodal (short chatbot turns + long document analysis), the pool sizing will be wrong for at least one load type. Feed the planner actual production request traces. The Dynamo docs recommend sampling at least 100,000 requests from your real workload. [AUTHOR: add anecdote from a production sizing session where bimodal traffic caused prefill pool starvation]

My take: Dynamo 1.0 is genuinely production-ready, as evidenced by its deployment at AstraZeneca, ByteDance, CoreWeave, Baseten, and a dozen cloud providers. The architecture is sound. The piece I would watch is the fabric dependency: teams that do not have NVLink-connected GPU pools or fast IB fabrics will find the KV transfer overhead erodes a significant fraction of the throughput gains. If you are on commodity Ethernet between nodes, measure before committing.

When Dynamo Is NOT the Right Choice

Disaggregated inference is a scale optimization. It adds real complexity: two GPU pools to monitor, a router to configure, a NIXL fabric to validate, and a planner to run before any sizing change. Every one of those components can fail independently. That overhead is worth it when you are running a 70B+ model at 100+ GPU scale with long-context workloads and a hard cost-per-token target. It is not worth it in these cases:

  • You are running a 7B or 13B model — the prefill is fast enough that contention is minimal
  • Your GPU count is under 16 and you have no near-term plan to scale beyond it
  • Average prompt length is under 1k tokens and output length is under 512 tokens
  • You do not have a high-bandwidth fabric (NVLink rack or 400Gb/s IB) between prefill and decode nodes
  • Your team does not yet have Kubernetes-native inference operations experience — the monolithic NIM failure modes are simpler to debug

In those cases, a NIM on Triton (covered in Part 19) or a direct TensorRT-LLM deployment gets you 80% of the performance at 30% of the operational burden. Save Dynamo for when you genuinely need it.

What to Validate Before You Commit

If you are evaluating Dynamo for a production deployment, work through this checklist before signing off on the architecture:

  1. Measure your actual NIXL transfer latency on your specific fabric before benchmarking throughput. Run the NIXL bandwidth test on your NVLink / IB / RoCE fabric and get the p99 latency for your expected KV payload size.
  2. Profile with real traffic, not synthetic requests. Bimodal or long-tail prompt distributions change the optimal prefill-to-decode pool ratio significantly.
  3. Run DGDR planner output as a recommendation, not a prescription. Validate the suggested ratio with a 30-minute load test at production concurrency before scaling to full deployment.
  4. Set up KV Block Manager tiering carefully. KVBM now supports GPU VRAM, CPU pinned memory, local SSD, and S3-compatible object storage. For long-context workloads, the tiering policy (eviction thresholds between tiers) is as important as the pool sizes.
  5. Check Grove topology constraints if you are on GB300 NVL72. Grove’s unified topology API lets you pin prefill and decode to the same NVL72 rack for intra-rack NVLink transfers — do this explicitly rather than letting the scheduler place them on different racks.

The Verdict

NVIDIA Dynamo is the right inference framework when you are running large MoE models like DeepSeek R1 at 50+ GPU scale, with long-context workloads and a genuine cost-per-token pressure. The 7x throughput claim on GB200 NVL72 is real, verified by SemiAnalysis and corroborated by production deployments at multiple cloud providers. The disaggregation model is architecturally sound: the KV-aware router, NIXL transfer, and DGDR planner together address the real bottlenecks in monolithic serving at scale.

What I would not do: deploy Dynamo as a default choice for every LLM workload. The operational complexity is non-trivial and the KV transfer overhead is real. Teams that are not yet hitting the limits of monolithic NIM serving should stay there until they are. When you do cross the threshold — 50+ GPUs, long context, high concurrency, MoE model — Dynamo is not just worth the complexity. It is the only architectural choice that scales.

Inference economics — what disaggregation actually does to your cost per token and how to size for a budget — is covered in Part 21. If you want to see how Dynamo sits in the broader NVIDIA AI stack, the NVIDIA AI Guide maps all 30 parts together.

If you are sizing a Dynamo deployment right now, drop a comment with your model, GPU count, and context length distribution — I am happy to talk through whether disaggregation is the right call for your specific numbers.

NVIDIA AI Series · Part 20 of 30
« Previous: Part 19  |  NVIDIA AI Guide  |  Next: Part 21 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading