TL;DR · Key Takeaways
- Size memory in three buckets: weight memory (2 bytes per parameter at FP16), KV cache (scales with context length times concurrency), and a 10-15% runtime overhead. KV cache, not weights, is what caps your real concurrency.
- For most enterprise RAG and 7B-to-13B serving, the L40S (48 GB) is the value sweet spot. Reach for H100/H200 only when the model exceeds ~34B, context runs long, or concurrency is high.
- Software is the silent line item. NVIDIA AI Enterprise (per GPU per year) plus the Private AI Foundation add-on (per core) routinely rivals the amortised hardware cost over three years. Model it explicitly.
- Size for steady-state p95 load with a queue, not for a peak concurrency you will hit twice a year. Over-provisioning GPUs is the most expensive mistake in this stack.
Every Private AI sizing conversation I walk into starts in the wrong place. Someone has already picked the GPU. “We bought eight H100s, now help us size the platform.” The GPU is the most expensive decision in the rack and it is usually the one made with the least math. So let us do the math first, then talk about what it costs, because on a PAIF platform the bill that surprises people is rarely the silicon. It is the software subscription that rides on top of it for the next three years.
Step 1 and 2: the memory math you can do on a napkin
GPU memory for an inference workload comes in three buckets, and you can estimate all of them before you talk to a vendor.
Weight memory is the easy one. At FP16, a model needs roughly 2 bytes per parameter. A 7B model is about 14 GB, a 13B model about 26 GB, a 34B model about 68 GB, and a 70B model about 140 GB. Quantise to FP8 and you halve that; quantise weights to INT4 and a 70B model drops to roughly 35 GB. Quantisation is not free, it costs you a little accuracy, but for retrieval-augmented workloads where the model is grounded on your own documents, FP8 weights are usually an easy call and I recommend them by default for anything 34B and up.
KV cache is the bucket everyone forgets, and it is the one that actually decides how many concurrent users a GPU can hold. The cache grows with the number of tokens in flight, which is context length multiplied by concurrent requests. For a 70B model, a single 8K-token context costs roughly 2.5 GB of FP16 cache; push that to 32K and it is about 10 GB; at 128K you are looking at roughly 40 GB for the KV cache of a single long request. Now multiply by ten concurrent users at 32K context and you need around 100 GB of cache on top of the weights. That is why a 70B model at FP16 with real concurrency wants four H100-class GPUs, not one. The weights fit; the cache does not.
Runtime overhead is the third bucket: CUDA context, the serving framework (vLLM, the NIM container, Infinity for embeddings), activation buffers, and fragmentation. Budget 10 to 15% of card memory and do not let anyone tell you a model “fits” because the weights happen to equal the VRAM number on the box. A 48 GB L40S does not serve a 26 GB model to fifty users; the weights leave you ~22 GB for cache and overhead, and that ceiling arrives fast.
Step 3: choosing the GPU (and how to partition it)
Once you know the memory footprint, GPU selection is mostly a lookup. My default for enterprise Private AI work is the L40S, and I will defend that against the reflex to buy H100s. For 7B to 13B models serving RAG and chat at moderate concurrency, the L40S delivers the tokens per second the business needs at roughly a third of the capital cost, and you can put more cards in more hosts for the same money, which buys you availability instead of raw single-model throughput. The detailed card-by-card comparison lives in the GPU selection part of this series; here the point is the sizing decision, not the spec sheet.
Reach for H100 or H200 in three situations: the model is large enough that weights plus cache will not fit a 48 GB card even quantised (roughly 34B and up at useful concurrency), the context window is long (64K and beyond, where KV cache dominates and H200’s 141 GB earns its premium), or you need tensor-parallel throughput on a single large model that one card cannot feed. Outside those cases, an H100 serving a 7B chatbot is an expensive way to leave 60 GB idle.
One sizing opinion that saves real money: do not time-slice GPUs for production inference. Time-sliced vGPU is excellent for data-science notebooks and dev sandboxes where bursty, tolerant workloads share a card. For a latency-sensitive serving endpoint it introduces tail latency under contention that your p99 will hate. Use MIG for hard isolation with predictable performance, or give the workload a full card. The partitioning trade-offs are covered in depth in the GPU partitioning part of this series, but the sizing rule is short: notebooks share, endpoints do not.
The sizing matrix: model to card to concurrency
Here is the starting-point matrix I hand clients. Treat these as design anchors to validate with your own benchmark, not guarantees; real throughput depends on prompt length, output length, batch settings, and the serving framework. The concurrency figures assume FP16 weights unless noted, a 4K to 8K working context, and headroom for cache.
| Workload | Model size | Recommended GPU | Partition | Rough concurrent sessions |
|---|---|---|---|---|
| Embeddings / RAG retrieval | < 1B (e5, bge) | 1x L40S | MIG or shared | High; embeddings are cheap |
| Chat / assistant | 7B-8B | 1x L40S | Full card | ~30-60 |
| RAG with reasoning | 13B | 1x L40S (FP8) or 1x H100 | Full card | ~15-30 (L40S), ~50+ (H100) |
| Code / heavier assistant | 34B | 1x H100 (FP8) | Full card | ~20-40 |
| Flagship / long context | 70B | 2-4x H100 or 2x H200 | Tensor parallel, passthrough | ~20-40, context-dependent |
Notice the pattern. The jump from a 13B model on one L40S to a 70B model on four H100s is not a 5x cost increase, it is closer to 15x once you count the cards and the per-GPU software. Model choice is a budget decision disguised as a quality decision, and a well-tuned 13B on your own grounded data often beats an under-provisioned 70B that queues under load. That is the single most useful thing I tell clients in a sizing workshop.
Step 4: from GPUs to hosts to a cluster
GPU count is not host count. A GPU workload domain needs hosts that can physically and electrically carry the cards: PCIe lanes and slot width, a power budget that accounts for 350W (L40S) to 700W (H100/H200 SXM) per GPU before the rest of the server, and cooling that the data centre can actually deliver. Two H100 SXM cards plus CPUs and NVMe will push a node past 1.5 kW under load, and I have watched more than one project stall because the rack had power for the servers but not for the GPUs.
Then add the platform tax that has nothing to do with AI. You still need N+1 for maintenance and failure, so a three-host minimum for the GPU domain is realistic even if two hosts hold all the cards you need, because vSphere has to drain a host for patching without taking your inference endpoint down. You need vSAN or external storage sized for model artifacts (a model store with several 30-to-140 GB models adds up), and you need network headroom for model pulls from the NGC catalog or your local registry. None of this is the GPU, and all of it is in the bill.
Step 5: the cost model nobody runs until it is too late
Now the part that actually decides whether the project gets funded. A PAIF platform has four cost layers, and the GPU is only the first. Street prices move, so use these as order-of-magnitude anchors and get current quotes: an L40S lands around $8K to $10K, an H100 around $25K to $30K, and an H200 around $30K to $40K per card, with an 8-GPU H200 server quoted in the $300K range fully built.
On top of hardware sit three software and operating layers. NVIDIA AI Enterprise (now folded into the unified NVIDIA Enterprise SKU as of late 2025) is licensed per GPU, commonly quoted around the low-four-figures per GPU per year for a subscription with multi-year terms discounting that, and it is mandatory for the supported NIM and operator stack. The Private AI Foundation add-on is licensed on top of your VCF subscription and, like VCF 9, follows a per-core model across the GPU hosts. And then power: a single H100 pulling 700W at, say, $0.15 per kWh, running continuously, is on the order of $900 a year in electricity alone, before cooling overhead, and you multiply that across every card.
The practical consequence: a sizing decision that adds two H100s to “be safe” does not just add ~$60K of hardware once. It adds the per-GPU software and a share of the power every year you run them. Right-sizing is therefore worth far more than it looks on the capex line, because the GPU you do not buy saves you three times: capital, subscription, and watts. This is why I push hard against the “round up to be safe” instinct and toward sizing for the p95 steady-state load with a request queue to absorb spikes.
| Cost layer | Basis | Cadence | What drives it up |
|---|---|---|---|
| GPU + servers | per card / per host | one-time capex | GPU class, card count, N+1 hosts |
| NVIDIA AI Enterprise | per GPU | annual subscription | number of GPUs, not their size |
| PAIF add-on + VCF | per core | annual subscription | core count across GPU hosts |
| Power + cooling | per kW | ongoing opex | card TDP, utilisation, PUE |
Build on PAIF or rent from a cloud? My take
Every sizing exercise eventually hits the question of whether to build this on-premises at all. The honest framing is utilisation. Rented GPU capacity bills by the hour, so it wins decisively for bursty, experimental, or short-lived workloads where your cards would otherwise sit idle. An on-premises PAIF platform wins when utilisation is high and sustained, because at steady high duty cycle the hourly rental adds up past the amortised on-prem cost within a year or two, and you get data residency, predictable latency, and no egress surprises as a bonus that, for regulated workloads, is often the actual reason you are reading this series.
So the verdict is not ideological. If your GPUs will run above roughly 50-60% duty cycle on steady production inference, and especially if data cannot leave your walls, build it on PAIF and size it carefully. If your demand is spiky, exploratory, or you genuinely cannot predict it yet, rent first, measure the real utilisation curve for a quarter, and let that curve size the on-prem platform you eventually build. Sizing without a utilisation measurement is guessing, and guessing in this stack is expensive in three currencies at once.
The Bottom Line
Size the workload before the silicon, budget memory in three buckets and respect the KV cache, default to the L40S until the model or the concurrency forces an H100/H200, and build the cost model with all four layers because the recurring per-GPU and per-core software is the part that ambushes the business case. Most of all, size for the load you actually have, not the load you are afraid of. The cheapest GPU in any Private AI design is the one you talked yourself out of buying.
What is the GPU duty cycle on your current AI workloads, and have you ever actually measured it? If not, that is the number to go find before the next sizing meeting.
Continue the series
References
- VMware Private AI Foundation with NVIDIA 9.1 documentation (Broadcom TechDocs)
- Private AI Services: New in PAIF with NVIDIA in VCF 9.0 (VCF Blog)
- NVIDIA AI Enterprise licensing guide (docs.nvidia.com)
« Previous: Part 17 | VMware Private AI Complete Guide | Next: Part 19 »








