A data scientist files a ticket: “I need a GPU box with PyTorch and a notebook, today.” You could hand them a raw Ubuntu VM, attach a vGPU profile, and let them spend a day fighting CUDA versions and guest driver mismatches. Or you give them a Deep Learning VM. In VMware Private AI Foundation with NVIDIA, the Deep Learning VM (DLVM) is the unit of work that turns “I want to experiment” into a running, GPU-accelerated workstation in minutes. It is also the piece of the platform people most often misunderstand, because it looks like a production serving tier and it is not.
This post explains what a DLVM actually is, how the image is put together, what happens the first time it boots (the step that quietly breaks the most deployments), and where it stops being the right tool. If you have already installed the GPU Operator and vGPU drivers from Part 9, the DLVM is the first workload you can put on top of that foundation.
What a Deep Learning VM actually is
A DLVM is a Canonical Ubuntu virtual machine that NVIDIA and VMware have pre-validated for GPU work, delivered as an image you provision from a catalog. The point is not that it runs Ubuntu. The point is that the entire stack below your model, the OS, the container runtime, the conda manifests, and the GPU driver, has already been tested as a set, so a data scientist starts building instead of debugging compatibility.
It helps to split the image into two layers. Some software is baked into the image and ships with it. The rest is pulled and installed the first time the VM powers on, driven by a cloud-init script. That second layer is where most of the operational risk lives, so it is worth seeing the split clearly.
You pick the workload at deploy time, and the cloud-init pulls the matching container from the NVIDIA NGC catalog. Choose PyTorch or TensorFlow and you get a ready JupyterLab instance at http://dl_vm_ip:8888. Choose NVIDIA RAG and you get a sample chatbot at http://dl_vm_ip:3001/converse that you can point at your own knowledge base. Triton and the DCGM Exporter are there too, the latter giving you Prometheus-ready vGPU metrics without any extra setup.
Where it sits in the stack
The DLVM is a guest on top of the GPU-accelerated workload domain you built earlier in this series. The host vGPU driver lives in ESXi. The DLVM carries a matching guest driver, and a vGPU profile slices the physical GPU down to what the VM needs. Understanding this placement matters because the DLVM is one of two ways to consume GPUs on the platform. The other is a VKS Kubernetes cluster with GPU worker nodes, which is the path you take toward production. Same hardware, same drivers underneath, very different operating model on top.
What happens on first boot
Here is the part that trips teams up. The DLVM does not arrive ready to run. The vGPU guest driver and the deep learning workload are installed the first time you start the VM. That first boot reaches out to three places: NVIDIA’s licensing service for a vGPU license, a driver source for the guest driver that matches the host, and the NGC catalog for the workload container. If any one of those is unreachable or misconfigured, the VM boots fine and reports no GPU, which sends people chasing the wrong problem.
In a disconnected or air-gapped site, none of that works out of the box. Your administrators have to stand up a local URL for the vGPU drivers, deploy and configure an NVIDIA DLS instance with a client configuration token, and mirror every required image and model into a local Harbor registry. Skip one and the first boot stalls. This is the single most common support call I see on new DLVM rollouts, and it is almost never a GPU fault. It is a missing token or an unreachable registry.
Three ways to deploy one
There is no single deploy button, and the right method depends on who is asking. A data scientist or DevOps engineer self-serves the AI Workstation catalog item in VCF Automation, which is the intended flow and the one worth standardizing on. A VI administrator can deploy a DLVM directly on a vSphere cluster from the vSphere Client, which is handy for a quick template test but does not give your users self-service. A DevOps engineer can deploy through the VM Service in the Supervisor with kubectl, which fits anyone already managing infrastructure as code. My advice: invest in the VCF Automation catalog path and treat the other two as escape hatches. The catalog item is where you can enforce sizing, networking, and a known-good cloud-init instead of letting people hand-roll VM properties.
DLVM or VKS cluster: when to use which
The mistake I see most is teams trying to run production inference on a Deep Learning VM because it was fast to stand up. It works in a demo and then falls over the moment you need to scale, roll a model without downtime, or share a GPU across many requests. The DLVM is a workbench. When the workload graduates from “a person is experimenting” to “a service other systems call,” it belongs on a VKS GPU cluster with NIM, which is where this series goes next.
| Dimension | Deep Learning VM | VKS GPU cluster |
|---|---|---|
| Primary user | Data scientist, ML engineer | Platform / MLOps team |
| Best for | Prototyping, fine-tuning, validation, demos | Production inference and serving |
| Scaling model | One VM, one GPU slice, manual | Horizontal, scheduled across nodes |
| Interface | JupyterLab, SSH, single container | Kubernetes API, NIM endpoints |
| Time to value | Minutes, self-service | Longer, needs cluster lifecycle |
| Resilience | None built in, it is a single VM | Self-healing, rolling updates |
The gotchas I flag to clients
Three things are worth knowing before you roll DLVMs out at scale. First, a live one: the key used to sign the published VMware Deep Learning VM Image releases expired on January 3, 2026. If your content library enforces certificate validity through a security policy, newly published images can no longer be synced or uploaded to it. Check your content library policy before you assume a sync failure is a network problem, and watch the image release notes for a re-signed build.
Second, the vGPU guest driver is matched to the host driver automatically, which is exactly what you want, but only if the host driver is the version you think it is. If Part 9 left you on a different host driver than the image expects, the guest install can fail or fall back in ways that are annoying to diagnose. Validate the host driver version first. Third, the DLVM has no resilience on its own. It is a single VM with a single GPU slice. Do not let an experiment quietly become a dependency that the business now relies on, because there is nothing underneath it to catch a failure.
For the design context behind all of this, the architecture and components breakdown from Part 2 shows where the DLVM fits among the other moving parts, and the reference architecture and sizing guidance from Part 7 helps you decide how much GPU to hand a single workstation.
What I’d Do
Standardize on the VCF Automation AI Workstation catalog item, build a known-good cloud-init for your two or three common workloads, and validate the first-boot path end to end (driver source, DLS token, NGC or Harbor reachability) before you let a single data scientist near it. Treat the DLVM as what it is: a fast, disposable workbench for people. The moment a model needs to scale or serve, it moves off the VM and onto a cluster. Where do your data scientists hit the wall first, on GPU sizing or on the air-gapped first boot? That will tell you where to spend your hardening time.
References
- About Deep Learning VM Images in VMware Private AI Foundation with NVIDIA (Broadcom TechDocs)
- VMware Deep Learning VM Image Release Notes (Broadcom TechDocs)
- Deploy Deep Learning Virtual Machine with VMware Cloud Foundation Automation (VCF Blog)
« Previous: Part 9 | VMware Private AI Complete Guide | Next: Part 11 »



