Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)

The finale: running the NVIDIA AI stack on bare metal, on VMware Cloud Foundation, or in the cloud; the real total cost of an AI factory; and the verdict on when to build versus rent.

NVIDIA AI Series · Part 30 of 30

TL;DR

You can run the NVIDIA AI stack three ways: bare-metal Kubernetes on-prem, virtualized on VMware Cloud Foundation through Private AI Foundation, or rented in the cloud. The stack is the same in all three (drivers, GPU Operator, NIM, NeMo, the models); what changes is who owns the platform underneath and the cost shape. Bare metal gives maximum performance and the most operational burden. VCF gives you virtualization, hard multi-tenancy, and self-service for teams that already run VMware, at a modest overhead. Cloud gives elasticity and zero capital outlay but bills per GPU-hour, which loses to on-prem once your utilization is steady and high. The real total cost is GPUs plus power and cooling plus AI Enterprise licensing plus the people to run it, and the deciding number is honest utilization. Build on-prem or on VCF when your GPUs stay busy and your data must stay home; rent when demand is bursty or you are still finding product-market fit.

Who this is for: AI-infrastructure architects and platform engineers making the build-versus-rent and on-prem-versus-VCF decision for a GPU platform. This is the finale of the series, so it assumes the whole stack from the GPUs in the earlier parts. Where it touches VMware specifics, it points to the Private AI series rather than repeating it.

Twenty-nine parts in, you know what NVIDIA ships and how the pieces fit. The last question is the one that actually shows up in a budget meeting: where does this run, and what does it really cost? The honest answer is that the technical stack is the easy part. The hard part is the platform decision underneath it and the total cost nobody fully tallies until the invoices arrive. This part closes the loop: the three places the stack runs, the costs that decide between them, and my verdict on building an AI factory at all.

Three places the same stack runs

The NVIDIA AI Enterprise stack does not change based on where you put it. The drivers, the GPU Operator, NIM, NeMo, and the models are identical whether they run on bare metal, in a virtual machine, or in someone else's data center. What differs is the platform layer beneath them, and that layer is the real decision.

Bare-metal Kubernetes on-prem is the performance ceiling and the operations floor: no virtualization overhead, full control, and every bit of the platform, the cluster, the fabric, the lifecycle, is yours to build and run. VMware Cloud Foundation with Private AI Foundation puts the same stack on vSphere, so you gain virtualization, live migration, hard multi-tenancy, and self-service provisioning through VCF Automation, in exchange for a modest virtualization overhead and a platform your VMware team already knows how to operate. Cloud, including DGX Cloud and the hyperscalers, hands you GPUs on demand with none of the capital outlay and none of the facility work, billed by the hour.

Same stack, three platformsThe NVIDIA layer is identical; the platform under it is the choiceNVIDIA AI Enterprise: NIM, NeMo, GPU Operator, modelsBare-metal on-premKubernetes on hardwaremax performancemax ops burdenVCF (Private AI)vSphere virtualizationmulti-tenant + self-servicemodest overheadCloud / DGX Cloudrented by the hourelastic, no capitalper-GPU-hour cost
Because the NVIDIA layer is portable across all three, the decision is a platform and cost decision, not a question of which one can run the models.
DimensionBare metalVCF / Private AICloud
PerformanceHighestHigh (small overhead)High, varies by provider
Multi-tenancyDIY (MIG + namespaces)Built in (vSphere + MIG)Provider-managed
Self-serviceBuild it yourselfVCF Automation catalogConsole / API
Cost shapeCapital + opsCapital + ops + VCFOperating, per hour
Best whenPeak performance, big teamYou already run VMwareBursty or early-stage
In practice: the VCF path wins most often not because it is the fastest but because the platform team already exists. If your organization runs vSphere, Private AI Foundation lets that same team operate GPUs with the tools, the multi-tenancy, and the governance they already use, instead of standing up a brand-new bare-metal Kubernetes practice from zero. The deployment details are in the VCF 9 deployment walkthrough.

The cost nobody fully tallies

GPU sticker price is the number everyone quotes and the smallest part of the real bill. The total cost of an on-prem AI factory has five components, and each one was a part in this series. The GPUs and systems are the capital line. Power and cooling, the liquid-cooling reality from Part 10, are a recurring operating line that scales with every rack of Blackwell you add. NVIDIA AI Enterprise licensing, the per-GPU subscription from Part 2, is a recurring software line you cannot skip if you want support. Networking and storage, the fabric and the data path, are capital you under-budget at your peril. And people, the platform engineers who run all of it, are the line that quietly dominates at small scale.

The way to make this tractable is the framing from Part 21: reduce everything to cost per token, or cost per unit of useful work, and compare. That converts a messy capital-versus-operating argument into one number you can hold against a cloud GPU-hour rate. And the input that moves that number most is the honest utilization from Part 29, because an on-prem GPU you paid for and left at twenty percent occupancy is far more expensive per token than a cloud GPU you only rented when you needed it.

Where on-prem overtakes cloudCost per month vs how busy the GPUs areUtilization (how busy the GPUs are) →Cost →On-prem (fixed)Cloud (per hour)break-evencloud cheaperon-prem cheaper
On-prem cost is roughly fixed once you have bought the hardware; cloud cost climbs with use. Past the crossover, steady high utilization makes owning cheaper. Below it, renting wins.
Cost componentTypeCovered in
GPUs and systemsCapitalParts 3 to 5
Power and coolingOperatingPart 10
AI Enterprise licensingOperatingPart 2
Networking and storageCapitalParts 7 to 9
Platform engineersOperatingParts 12 to 15, 29

Worked example

A team needs steady inference for an internal product, running roughly two GPUs' worth of load around the clock. In the cloud at a typical GPU-hour rate, two GPUs running continuously for a year is a large, purely operating bill that recurs forever. On-prem, the same two GPUs are a one-time purchase plus power, a slice of an AI Enterprise subscription, and a fraction of an engineer's time.

Because the load is steady and high, on-prem crosses below cloud within the first year and keeps widening the gap after the hardware is paid off. Flip the scenario to a research team whose GPUs sit idle most nights and spike during experiments, and the math inverts: the cloud bill follows the spikes, while owned GPUs would sit paid-for and idle. Same stack, opposite answer, and the only variable that changed was utilization.

The arc, in one pass

It is worth seeing the whole journey at once, because the order was the argument. The series moved through seven phases, each building on the last. Foundations established what the stack is and how the GPUs and AI-factory systems are built. GPU infrastructure covered partitioning, the NVLink scale-up fabric, the InfiniBand and Spectrum-X scale-out fabric, the storage data path, and the power and cooling reality. The software platform took us through drivers, CUDA, the GPU and Network Operators, the NGC catalog, and air-gapped lifecycle. Inference covered NIM, TensorRT-LLM, Triton, Dynamo, and the economics that decide cost per token.

From there the series climbed the value chain. Customization covered the NeMo framework, LoRA and SFT and RLHF, data curation, and resilient multi-node training. Models and agents covered the open Nemotron family, NeMo Retriever and grounded RAG, and the Blueprints and agent toolkit that make agents observable. Operations closed it with DCGM and honest utilization, and this finale with the platform and cost decision. The shape is deliberate: you cannot reason about cost per token without inference, you cannot trust an agent without observability, and you cannot pick a platform without honest utilization. Each phase was a prerequisite for the verdict you are reading now.

Seven phases, in orderEach phase is a prerequisite for the next1Foundations2GPU infra3Software4Inference5Customize6Models/agents7OperationsHardware at the left, the platform-and-cost verdict at the right.
The full guide lives on the pillar page; this map is the one-screen version of how the thirty parts connect.
The recurring lesson: across every phase, the same theme held. The components are excellent and the integration is genuinely done, so the risk is never that the technology cannot work. The risk is that you deploy it without measuring it, size it on the wrong number, or buy capacity you will not keep busy. The teams that succeed with this stack are not the ones with the biggest cluster; they are the ones who instrument it, route work to the cheapest model that can do the job, and let utilization decide what they own. Pick the platform deliberately and measure everything, and the stack rewards you. Skip that discipline and it will quietly bill you for the privilege.

What I would actually choose

My recommendation, stated plainly. If you already operate VMware and have steady GPU demand and data that should stay on-premises, run the stack on VCF with Private AI Foundation; you reuse your platform team and get multi-tenancy and self-service without building them. Why not: if you have a strong bare-metal Kubernetes practice and chase the last few percent of performance, skip the virtualization layer. If your demand is bursty or you are still finding the use case, start in the cloud and move on-prem only once utilization is high enough to cross the break-even. What to validate first, before any of it: your honest utilization and your data-residency requirement, because those two numbers decide the platform more than any benchmark.

The mistake I see most is buying an on-prem AI factory on excitement and running it at twenty percent occupancy, paying owner prices for renter usage. Build to own only when you will keep the GPUs busy. Until then, rent, and instrument relentlessly so you know the day the math flips.

The Verdict on the whole stack

Thirty parts come down to a few honest claims. NVIDIA's advantage is not any single chip; it is that the whole stack, silicon, NVLink and the fabric, the GPU Operator, NIM and NeMo, the models, and the operational tooling, fits together and is supported as one platform. That coherence is real and it is worth paying for when you are running at scale, because the integration work it saves is the work that otherwise sinks projects. The flip side is lock-in and cost, and the discipline that keeps you honest is measurement: cost per token, SM activity, retrieval recall, profiled agents. Every part of this series pointed back to the same lesson, which is that the engineering is solved and the operations are where success is won or lost.

So here is the verdict. The NVIDIA AI stack is the most complete way to build an AI factory today, on-prem or on VCF, and for an enterprise with real workloads and real data-residency needs it is the safe, defensible choice. It is not the cheapest way to run a small or bursty workload; that is still the cloud. Decide with your utilization and your data, not with the hype, and you will build something that pencils out. If you take one thing from this whole series, take this: pick the platform with the cost model, instrument everything, and let the numbers, not the marketing, tell you when to own and when to rent. That is the end of the series; start your worksheet, and build accordingly.

Where to go from here: if you are starting fresh, begin with the two numbers that decide everything, your honest utilization and your data-residency requirement, and let them pick your platform before you pick a GPU. If you are already running, turn on the DCGM profiling fields and the agent profiler this week and look at the gap between what you are paying for and what you are actually using, because that gap is what funds the next decision. And if your road runs through VMware, the Private AI series picks up exactly where this one ends, with the VCF-specific deployment, sizing, and governance. Thirty parts on, the through-line is simple: the stack is ready, the integration is done, and the discipline of measuring before you spend is the part only you can supply. Thank you for reading the whole series, and build accordingly.

NVIDIA AI Series · Part 30 of 30
« Previous: Part 29  |  NVIDIA AI Guide  |  Private AI Series »

References

NVIDIA AI Enterprise
VMware Private AI Foundation with NVIDIA on VCF 9.0
NVIDIA DGX Cloud

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading