Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Running GPU and AI Workloads on VKS (VKS Series, Part 14)

GPUs are where VKS stops being interchangeable with generic Kubernetes. Here is the vGPU VM class, the GPU Operator, and how VKS becomes the substrate for VMware Private AI.

Running GPU & AI Workloads on VKS
VKS Series · Part 14 of 17

TL;DR · Key Takeaways

  • GPU on VKS is the strongest VKS-specific use case. A custom vGPU or passthrough VM class gives worker nodes GPU access, and your GPU pool is just another node pool.
  • Inside the cluster, the NVIDIA GPU Operator manages drivers and the device plugin so pods request GPUs with a standard nvidia.com/gpu resource limit.
  • VKS clusters are the Kubernetes substrate underneath VMware Private AI Foundation. The GPU-cluster mechanics here are exactly what Private AI builds on.
  • The hard GPU-domain choices, vGPU vs MIG vs passthrough, GPU selection, sizing, belong to the Private AI series. This part stays on the VKS-cluster wiring.
Who this is for: teams running inference or training on VKS, and anyone evaluating VKS as the base for VMware Private AI.  Prerequisites: Part 5 (VM classes and node pools), GPU-certified hosts, and a vGPU VM class published to your namespace.

Everything in this series so far applies to any workload. This part is different, because GPUs are where VKS stops being interchangeable with generic Kubernetes and becomes a genuine reason to choose it. The same VM-class model that sizes ordinary nodes is what attaches NVIDIA GPUs to worker nodes, and the same cluster you provisioned in Part 4 becomes the substrate for serious AI workloads. I will keep this on the VKS-cluster mechanics and point to the Private AI series for the GPU-domain depth rather than duplicate it.

The GPU stack on a VKS cluster

A GPU-accelerated VKS cluster starts with a custom VM class. The vSphere administrator creates a VM class carrying an NVIDIA vGPU profile (a slice of a physical GPU) or a passthrough / dynamic DirectPath I/O profile (a whole GPU dedicated to the VM), makes it available in the namespace, and you build a worker node pool from it. From the cluster’s point of view this is just the node-pool pattern from Part 5, one more pool, whose nodes happen to have GPUs. NVIDIA vGPU on VKS runs on NVIDIA GPU-certified servers, and the GPU profile is read from the device. Hardware access is only half the job, though; the cluster also has to schedule GPU pods, and that is the NVIDIA GPU Operator’s role.

The GPU stack, bottom to top GPU pod  requests resources: limits nvidia.com/gpu: 1 NVIDIA GPU Operator + device plugin  driver lifecycle + advertises GPUs to the scheduler GPU worker node pool  built from the custom VM class (Part 5) Custom vGPU / passthrough VM class  profile read from the device NVIDIA GPU-certified host
Hardware up to pod: the VM class attaches the GPU, the operator makes it schedulable, the pod just asks for one.

Installed into the VKS cluster (typically via its Helm chart from the NGC catalog), the GPU Operator manages the GPU driver lifecycle on the GPU nodes and runs the Kubernetes device plugin that advertises GPUs as a schedulable resource. Once it is healthy, a pod requests a GPU with a standard resource limit and Kubernetes places it on a node from your GPU pool:

resources:
  limits:
    nvidia.com/gpu: 1   # scheduled onto a vGPU / passthrough node pool

For connected and disconnected (air-gapped) environments the installation path differs, and the Private AI documentation covers provisioning a GPU-accelerated VKS cluster in both. The GPU Operator install and driver matching are the parts worth rehearsing, because a driver or operator version mismatch is the most common reason GPU pods fail to schedule.


VKS as the substrate for Private AI

This is the strategic point. VMware Private AI Foundation with NVIDIA does not replace VKS; it builds on it. The GPU-accelerated VKS clusters described here are the Kubernetes layer Private AI uses to run model-serving and training workloads, and VCF Automation can provision these GPU clusters as self-service catalog items so data scientists get a cluster without writing YAML. Everything in this part is the foundation, and Private AI is the governed, self-service platform layered on top of it.

Because of that, I am deliberately not re-litigating the GPU-domain decisions here. Whether to use vGPU, MIG or passthrough, which GPU to buy, and how to size GPU memory are covered properly in the Private AI series: see GPU partitioning: vGPU vs MIG vs passthrough and installing the GPU Operator and vGPU drivers. The whole Private AI complete guide picks up where this part leaves off.

Match the driver to the operator: the single most common reason GPU pods sit unscheduled is a mismatch between the vGPU guest driver and the GPU Operator version. Pin and test that pairing before you scale a GPU pool, and rehearse the disconnected install path separately if you run air-gapped.

vGPU, passthrough and time-slicing: the cluster-side view

From the cluster operator’s seat, the GPU sharing mode you choose changes what the scheduler sees, and that is worth understanding even though the full decision lives in the Private AI series. With full passthrough or a dynamic DirectPath device, a node gets a whole physical GPU and a pod that requests one gets exclusive use of it, which is what you want for training or heavy inference that needs the entire card. With vGPU, a physical GPU is sliced into profiles and a node presents a vGPU, so several VMs can share one card, which suits many smaller inference workloads that would waste a whole GPU each. Time-slicing and MIG add further ways to share a GPU among pods, and they surface to Kubernetes as more schedulable GPU resources, so the scheduler can place several pods on what is physically one device.

The practical consequence is utilisation versus isolation. Sharing modes pack more workloads onto fewer cards and lift utilisation, which matters because GPUs are the most expensive thing in the rack, but they also mean workloads share a device and can contend. Exclusive access gives clean performance and isolation at the cost of stranding capacity when a workload does not use the whole card. Match the mode to the workload: exclusive for the heavy, latency-sensitive or training jobs, shared for the long tail of small inference. Get it wrong in the expensive direction and you have a rack of half-idle GPUs; get it wrong in the cheap direction and your latency-sensitive model is fighting a neighbour for the same silicon.

Driver and operator lifecycle across upgrades

The single most common reason a GPU cluster breaks is not the hardware; it is a version mismatch in the software chain. The vGPU guest driver, the GPU Operator, the host-side vGPU manager and the Kubernetes version all have compatibility relationships, and a Kubernetes upgrade that moves the cluster forward without the GPU Operator and drivers being on a matching, supported combination leaves GPU pods unschedulable while the rest of the cluster looks fine. This makes GPU clusters the ones to test most carefully before an upgrade, because the failure is silent: the cluster upgrades cleanly, and only the GPU workloads stop being placed, often noticed hours later by a data science team wondering why their jobs are pending.

Treat the GPU software stack as a versioned unit. Pin the driver and operator versions, validate the combination against the target Kubernetes release before you upgrade, and rehearse the whole thing on a GPU test cluster first. If you run disconnected, rehearse the air-gapped install path separately, because mirroring the GPU Operator images and the driver into a private registry is its own exercise and not something to discover during a production change.

Monitoring GPU utilisation and cost

GPUs are too expensive to run blind, so the one piece of observability you must add for a GPU pool is GPU utilisation itself, exposed through the NVIDIA tooling (DCGM) and scraped into your metrics stack. Without it you cannot tell the difference between a pool that is genuinely saturated and needs more cards and a pool that is half-idle because workloads grabbed whole GPUs they barely use. That distinction is a direct cost decision: more cards, or better sharing, and you cannot make it from CPU and memory metrics alone. VCF Operations gives you the infrastructure correlation, host and power, and DCGM gives you the per-GPU utilisation, memory and temperature; together they tell you whether the expensive hardware is earning its place.

This is also where the governance argument for putting GPUs on VKS rather than a standalone island becomes concrete. Because the GPU pool is a node pool under the same quotas, monitoring and chargeback as everything else, you can actually attribute GPU consumption to teams and see whether the utilisation justifies the spend. An ungoverned GPU box that one team owns gives you none of that, and it is exactly the situation that leads to a rack of underused cards nobody can account for.

What I’d Do

If GPUs are why you are looking at VKS, you are looking for the right reason. I treat a GPU pool as exactly that, a node pool on a custom VM class, so it inherits all the governance, sizing and lifecycle discipline of the rest of the cluster rather than becoming a special snowflake. I pin the vGPU driver and GPU Operator versions together and test that pairing before scaling. And I think about the destination early: if multiple teams will compete for GPUs with a governance requirement, I plan for VMware Private AI Foundation on top of these clusters rather than hand-rolling self-service later. The continuity, from a single GPU node pool to a governed AI platform, is the thing bolt-on Kubernetes cannot match. Is your GPU capacity managed like the rest of your VKS estate, or is it a separate, ungoverned island that one team quietly owns?

References

VKS Series · Part 14 of 17
« Prev: Part 13  |  VKS Complete Guide  |  Next: Part 15 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading