TL;DR
You can run the NVIDIA AI stack three ways: bare-metal Kubernetes on-prem, virtualized on VMware Cloud Foundation through Private AI Foundation, or rented in the cloud. The stack is the same in all three (drivers, GPU Operator, NIM, NeMo, the models); what changes is who owns the platform underneath and the cost shape. Bare metal gives maximum performance and the most operational burden. VCF gives you virtualization, hard multi-tenancy, and self-service for teams that already run VMware, at a modest overhead. Cloud gives elasticity and zero capital outlay but bills per GPU-hour, which loses to on-prem once your utilization is steady and high. The real total cost is GPUs plus power and cooling plus AI Enterprise licensing plus the people to run it, and the deciding number is honest utilization. Build on-prem or on VCF when your GPUs stay busy and your data must stay home; rent when demand is bursty or you are still finding product-market fit.
Twenty-nine parts in, you know what NVIDIA ships and how the pieces fit. The last question is the one that actually shows up in a budget meeting: where does this run, and what does it really cost? The honest answer is that the technical stack is the easy part. The hard part is the platform decision underneath it and the total cost nobody fully tallies until the invoices arrive. This part closes the loop: the three places the stack runs, the costs that decide between them, and my verdict on building an AI factory at all.
Three places the same stack runs
The NVIDIA AI Enterprise stack does not change based on where you put it. The drivers, the GPU Operator, NIM, NeMo, and the models are identical whether they run on bare metal, in a virtual machine, or in someone else's data center. What differs is the platform layer beneath them, and that layer is the real decision.
Bare-metal Kubernetes on-prem is the performance ceiling and the operations floor: no virtualization overhead, full control, and every bit of the platform, the cluster, the fabric, the lifecycle, is yours to build and run. VMware Cloud Foundation with Private AI Foundation puts the same stack on vSphere, so you gain virtualization, live migration, hard multi-tenancy, and self-service provisioning through VCF Automation, in exchange for a modest virtualization overhead and a platform your VMware team already knows how to operate. Cloud, including DGX Cloud and the hyperscalers, hands you GPUs on demand with none of the capital outlay and none of the facility work, billed by the hour.
| Dimension | Bare metal | VCF / Private AI | Cloud |
|---|---|---|---|
| Performance | Highest | High (small overhead) | High, varies by provider |
| Multi-tenancy | DIY (MIG + namespaces) | Built in (vSphere + MIG) | Provider-managed |
| Self-service | Build it yourself | VCF Automation catalog | Console / API |
| Cost shape | Capital + ops | Capital + ops + VCF | Operating, per hour |
| Best when | Peak performance, big team | You already run VMware | Bursty or early-stage |
The cost nobody fully tallies
GPU sticker price is the number everyone quotes and the smallest part of the real bill. The total cost of an on-prem AI factory has five components, and each one was a part in this series. The GPUs and systems are the capital line. Power and cooling, the liquid-cooling reality from Part 10, are a recurring operating line that scales with every rack of Blackwell you add. NVIDIA AI Enterprise licensing, the per-GPU subscription from Part 2, is a recurring software line you cannot skip if you want support. Networking and storage, the fabric and the data path, are capital you under-budget at your peril. And people, the platform engineers who run all of it, are the line that quietly dominates at small scale.
The way to make this tractable is the framing from Part 21: reduce everything to cost per token, or cost per unit of useful work, and compare. That converts a messy capital-versus-operating argument into one number you can hold against a cloud GPU-hour rate. And the input that moves that number most is the honest utilization from Part 29, because an on-prem GPU you paid for and left at twenty percent occupancy is far more expensive per token than a cloud GPU you only rented when you needed it.
| Cost component | Type | Covered in |
|---|---|---|
| GPUs and systems | Capital | Parts 3 to 5 |
| Power and cooling | Operating | Part 10 |
| AI Enterprise licensing | Operating | Part 2 |
| Networking and storage | Capital | Parts 7 to 9 |
| Platform engineers | Operating | Parts 12 to 15, 29 |
Worked example
A team needs steady inference for an internal product, running roughly two GPUs' worth of load around the clock. In the cloud at a typical GPU-hour rate, two GPUs running continuously for a year is a large, purely operating bill that recurs forever. On-prem, the same two GPUs are a one-time purchase plus power, a slice of an AI Enterprise subscription, and a fraction of an engineer's time.
Because the load is steady and high, on-prem crosses below cloud within the first year and keeps widening the gap after the hardware is paid off. Flip the scenario to a research team whose GPUs sit idle most nights and spike during experiments, and the math inverts: the cloud bill follows the spikes, while owned GPUs would sit paid-for and idle. Same stack, opposite answer, and the only variable that changed was utilization.
The arc, in one pass
It is worth seeing the whole journey at once, because the order was the argument. The series moved through seven phases, each building on the last. Foundations established what the stack is and how the GPUs and AI-factory systems are built. GPU infrastructure covered partitioning, the NVLink scale-up fabric, the InfiniBand and Spectrum-X scale-out fabric, the storage data path, and the power and cooling reality. The software platform took us through drivers, CUDA, the GPU and Network Operators, the NGC catalog, and air-gapped lifecycle. Inference covered NIM, TensorRT-LLM, Triton, Dynamo, and the economics that decide cost per token.
From there the series climbed the value chain. Customization covered the NeMo framework, LoRA and SFT and RLHF, data curation, and resilient multi-node training. Models and agents covered the open Nemotron family, NeMo Retriever and grounded RAG, and the Blueprints and agent toolkit that make agents observable. Operations closed it with DCGM and honest utilization, and this finale with the platform and cost decision. The shape is deliberate: you cannot reason about cost per token without inference, you cannot trust an agent without observability, and you cannot pick a platform without honest utilization. Each phase was a prerequisite for the verdict you are reading now.
What I would actually choose
My recommendation, stated plainly. If you already operate VMware and have steady GPU demand and data that should stay on-premises, run the stack on VCF with Private AI Foundation; you reuse your platform team and get multi-tenancy and self-service without building them. Why not: if you have a strong bare-metal Kubernetes practice and chase the last few percent of performance, skip the virtualization layer. If your demand is bursty or you are still finding the use case, start in the cloud and move on-prem only once utilization is high enough to cross the break-even. What to validate first, before any of it: your honest utilization and your data-residency requirement, because those two numbers decide the platform more than any benchmark.
The mistake I see most is buying an on-prem AI factory on excitement and running it at twenty percent occupancy, paying owner prices for renter usage. Build to own only when you will keep the GPUs busy. Until then, rent, and instrument relentlessly so you know the day the math flips.
The Verdict on the whole stack
Thirty parts come down to a few honest claims. NVIDIA's advantage is not any single chip; it is that the whole stack, silicon, NVLink and the fabric, the GPU Operator, NIM and NeMo, the models, and the operational tooling, fits together and is supported as one platform. That coherence is real and it is worth paying for when you are running at scale, because the integration work it saves is the work that otherwise sinks projects. The flip side is lock-in and cost, and the discipline that keeps you honest is measurement: cost per token, SM activity, retrieval recall, profiled agents. Every part of this series pointed back to the same lesson, which is that the engineering is solved and the operations are where success is won or lost.
So here is the verdict. The NVIDIA AI stack is the most complete way to build an AI factory today, on-prem or on VCF, and for an enterprise with real workloads and real data-residency needs it is the safe, defensible choice. It is not the cheapest way to run a small or bursty workload; that is still the cloud. Decide with your utilization and your data, not with the hype, and you will build something that pencils out. If you take one thing from this whole series, take this: pick the platform with the cost model, instrument everything, and let the numbers, not the marketing, tell you when to own and when to rent. That is the end of the series; start your worksheet, and build accordingly.
Where to go from here: if you are starting fresh, begin with the two numbers that decide everything, your honest utilization and your data-residency requirement, and let them pick your platform before you pick a GPU. If you are already running, turn on the DCGM profiling fields and the agent profiler this week and look at the gap between what you are paying for and what you are actually using, because that gap is what funds the next decision. And if your road runs through VMware, the Private AI series picks up exactly where this one ends, with the VCF-specific deployment, sizing, and governance. Thirty parts on, the through-line is simple: the stack is ready, the integration is done, and the discipline of measuring before you spend is the part only you can supply. Thank you for reading the whole series, and build accordingly.
References
NVIDIA AI Enterprise
VMware Private AI Foundation with NVIDIA on VCF 9.0
NVIDIA DGX Cloud



