Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, , ,

Self-Service Private AI and GPU Catalog Items in VCF Automation (VCF Automation 9 Series, Part 27)

VCF Automation can turn scarce GPUs into self-service catalog items: deep learning VMs, RAG workstations and GPU Kubernetes clusters. The Quickstart builds the blueprints in minutes. The real work is editing them and rationing the GPUs with policy.

VCF Automation 9 Series · Part 27 of 41
TL;DR · Key Takeaways
  • The Private AI Foundation Quickstart in VCF Automation generates ready-made GPU catalog items: AI Workstation, RAG Workstation, Triton Inference Server, AI Kubernetes Cluster and AI Kubernetes RAG Cluster.
  • The wizard is the easy 20 minutes. The blueprints it generates are a starting point you must edit before production, not a finished catalog.
  • Prerequisites gate everything: vGPU-enabled VM classes, Private AI content-library images, an NVIDIA license, and a namespace with GPU capacity. Miss one and the catalog item fails at request time.
  • A GPU catalog without a lease policy and a profile allowlist is a budget incident waiting to happen. Idle deep learning VMs squat on whole vGPU profiles for weeks.
  • Default to fractional vGPU, gate the large profiles behind approval, and cap GPUs per project. The catalog is where you ration the most expensive thing you own.

Who this is for: Cloud and platform admins standing up GPU self-service in VCF Automation, and architects who have to make a fixed GPU budget serve many data science and DevOps teams.

Prerequisites: VCF Automation 9.x with Private AI Foundation with NVIDIA in place, a GPU workload domain with vGPU-enabled VM classes and Private AI images, an NVIDIA license, and the catalog and policy knowledge from the earlier Parts.

A GPU sitting idle in a forgotten virtual machine is the most expensive thing in your data center. That single sentence is the whole reason this Part exists. Self-service for AI is easy to switch on and easy to regret, because the same catalog that lets a data scientist get a GPU workstation in ten minutes also lets twenty of them hold full vGPU profiles for a month each. The platform makes provisioning trivial; you have to make reclamation and rationing equally deliberate.

This Part is the VCF Automation admin’s view of Private AI self-service: what the Quickstart builds, what you must change before you trust it, and how to govern scarce GPUs through the catalog. It is the platform side of the consumer-facing material in the Private AI Series catalog post, which is worth reading alongside this. I am writing against the current VCF 9.1 release.

What the Quickstart actually builds

The Private AI Foundation Quickstart is a wizard inside VCF Automation. You point it at a namespace with vGPU-enabled VM classes and Private AI images, answer a few questions, and it generates the cloud templates and publishes the catalog items for you. The first run creates a set of items aimed at two audiences: data scientists who want a GPU workstation, and DevOps engineers who want a GPU Kubernetes cluster.

From Quickstart to a GPU catalog The wizard generates the blueprints and publishes the items Quickstartnamespace, vGPUclasses, AI imagesNVIDIA license Blueprintscloud templatesgenerated for youeditable AI Workstation · RAG Workstation Triton Inference Server AI K8s Cluster · AI K8s RAG Cluster
The Quickstart is a generator. What it produces is yours to tune, govern and own afterward.
Catalog itemWhat it provisionsWho requests it
AI WorkstationDeep learning VM, configurable vCPU/vGPU/memory, PyTorch/CUDA/TensorFlowData scientist
AI RAG WorkstationGPU VM with a Retrieval Augmented Generation reference solutionData scientist
Triton Inference ServerGPU VM running NVIDIA TritonData scientist / MLOps
AI Kubernetes ClusterVKS cluster with GPU-capable worker nodesDevOps engineer
AI Kubernetes RAG ClusterVKS cluster plus pgvector via Data Services ManagerDevOps engineer

The prerequisites that actually gate it

The Quickstart fails politely if the foundations are not there, and every one of these is a request-time failure rather than a save-time error, so verify them first. You need vGPU-enabled VM classes defined on the Supervisor, so the platform knows which hardware profiles exist. You need Private AI images in a content library, the deep learning VM image the workstation boots from. You need an NVIDIA license reachable, because the driver and the AI Enterprise stack are licensed. And you need a namespace with actual GPU capacity assigned, because a catalog item that places onto a namespace with no free GPU will request, try, and fail.

The Kubernetes items add the VKS layer from the All Apps organization, so GPU-capable node pools and the supervisor must be ready before those items will provision. Treat the prerequisites as the real project; the wizard is the last step, not the first.

The blueprint is a starting point, not the finish

After the first run, the generated cloud templates back the catalog items, and you can modify them to fit your organization. You should. The defaults are sensible for a demo and wrong for production, because they expose more choice than you want a user to have. The most important edit is constraining the vGPU profile: turn the free-form size into a short allowlist so a user cannot request a full A100 when a fractional slice will do. Below is the shape of an AI workstation template after I have tightened it.

formatVersion: 1
inputs:
  vgpuProfile:
    type: string
    title: vGPU profile
    enum: [grid_a100-10c, grid_a100-20c]   # allowlist, not free choice
    default: grid_a100-10c                  # default to the smaller slice
  framework:
    type: string
    title: Pre-install framework
    enum: [pytorch, tensorflow, none]
    default: pytorch
resources:
  ai-workstation:
    type: Cloud.vSphere.Machine
    properties:
      image: dlvm-ubuntu-2404        # from the Private AI content library
      flavor: '${input.vgpuProfile}' # vGPU-enabled VM class mapping
      cloudConfig: |
        # bootstrap NVIDIA driver, license, and ${input.framework}

# The Quickstart generates a working template. This is what it looks
# like after pinning profiles and defaults for production.

In practice: I publish two workstation items rather than one with every option: a small fractional-vGPU item open to all, and a large full-GPU item gated behind approval and a senior role. Mixing both into a single item with a giant dropdown trains users to pick the biggest profile because they can.

Governing scarce GPUs through the catalog

This is the part the Quickstart does not do for you, and the part that decides whether GPU self-service is an asset or a recurring budget fire. Four controls, all from the governance Part, wrap a GPU catalog item. A lease policy so idle workstations are reclaimed automatically. An approval policy on the large profiles so someone signs off before a full GPU leaves the pool. A project resource limit so no single team can drain the cluster. And the profile allowlist in the blueprint so the requestable sizes are bounded in the first place.

Four guards around a GPU catalog item The wizard builds the item; you build the guards GPU catalog itemdeep learning VM Profile allowlist Approval (large) Lease (reclaim) Project GPU limit
Allowlist bounds the size, approval gates the big ones, lease reclaims the idle, the project limit caps the team.
Gotcha · GPU squat

A deep learning VM holds its vGPU profile whether the data scientist is training or on vacation. Without a lease policy, that GPU is gone from the pool until someone notices and asks. On a cluster with eight GPUs, three forgotten workstations is a 37 percent capacity loss that never shows up as an error. Lease every GPU item, with a short default and a hard total cap, and make extension a deliberate act.

vGPU, MIG or passthrough at the catalog layer

The sharing model you bake into the VM class decides the economics of the catalog item, so choose it per item, not once for the whole platform. Time-sliced vGPU oversubscribes a GPU across many light users and is right for development workstations where utilization is bursty. MIG hard-partitions a GPU into isolated slices with guaranteed performance, which suits inference and shared production where one tenant must not starve another. Passthrough gives a workload the entire physical GPU for maximum performance and no sharing, which is for heavy training that earns the whole card. Expose the cheap, shared profiles as the default items, and make passthrough the exception that needs justification.

Which GPU sharing model for this item Choose per catalog item, by workload shape Workload shape?be honest Bursty devmany light userstime-sliced vGPU Shared productionguaranteed isolationMIG partitions Heavy trainingmax performancepassthrough (full GPU)
The sharing model lives in the VM class behind the item. It is an economic choice as much as a technical one.
Worked example · Making 8 GPUs serve 30 people

One host, 8 physical GPUs, 30 data scientists. Expose the default AI Workstation on a time-sliced vGPU profile that splits each card into four, giving 32 logical slices, enough that everyday work never queues. Reserve 2 GPUs for an MIG-backed inference item and 1 for a passthrough training item gated behind approval.

Add a 5-day default lease with a 14-day cap on the workstation item, and a per-project limit of 8 concurrent slices. The result: bursty development shares the fractional pool, production inference is isolated, training is rationed and signed off, and idle workstations return their slices automatically. Without the lease and the limit, the same 8 GPUs serve about 8 people and then the cluster is full.

Disclaimer: Editing generated blueprints and applying leases to GPU items affects real, expensive resources. Test changes in a non-production project, confirm the namespace has free GPU capacity, and warn users before applying a lease to existing GPU deployments, since reclamation destroys the VM and frees the GPU.

Day-2 actions that keep GPUs honest

Provisioning is half the lifecycle. What a user can do to a live GPU workstation matters just as much, because the expensive resource is attached the whole time it runs. Curate the day-2 actions on these items deliberately. Power-off should be easy and encouraged, because a powered-off deep learning VM with a time-sliced profile can return its slice to the pool, so make it a one-click action and tell users to use it overnight. Resize the vGPU profile should exist but stay gated, since moving from a fractional slice to a full card is a budget decision, not a convenience. Delete should be open, because the faster a finished workstation goes away the sooner the GPU is reusable.

Pair those actions with the day-2 policies from earlier in the series. The action defines what is possible; the policy decides who may run it and whether a profile bump needs approval. The combination is what lets you hand a GPU workstation to a data scientist without handing them the keys to the whole cluster.

Allocated is not the same as used

The catalog tells you what is allocated. It does not tell you what is actually being used, and on GPUs the gap between the two is where the money leaks. A workstation can hold a full vGPU profile while its GPU utilization sits near zero for days, and from the catalog point of view everything looks healthy. Lease policies catch the worst of it by reclaiming on a timer, but a lease is a blunt instrument that fires on the clock, not on idleness.

Close the loop with utilization signals from the operations side. Watch real GPU utilization per deployment, flag the workstations that have been allocated but idle for days, and use that data to tune lease lengths and profile sizes rather than guessing. If your fractional users never exceed a quarter of their slice, your default profile is too big and you are leaving capacity on the table. Self-service GPU provisioning without a feedback loop on utilization is how a cluster stays full and underused at the same time, and the fix is to let the numbers, not the requests, decide your defaults.


Two checks before a GPU item goes live

Before a GPU catalog item reaches a tenant, confirm two things that the Quickstart will not confirm for you. First, that the backing namespace actually has free GPU capacity and the vGPU-enabled VM class the blueprint references, because a request that places onto a namespace with no free GPU fails after the user has waited, not before. Second, that a lease policy is attached, so the item cannot mint an immortal GPU deployment on its first day in production.

Prove it as a real user

Request the item once as an ordinary project member, not as an administrator, and watch the whole arc: it provisions, the vGPU attaches, the lease clock starts, and at expiry the GPU returns to the pool unaided. An item that an admin can deploy but a scoped user cannot, or one whose GPU never comes back, is not ready regardless of how clean the Quickstart looked. The end-to-end request as a real user is the only test that proves the governance binds to the people it is meant to bind to.

What I’d Do

Run the Quickstart to get working items fast, then treat its output as a draft. Pin the vGPU profiles to a short allowlist, default to the smallest sensible slice, and split workstation choice into a small open item and a large gated one. Before you publish to a single user, put a lease on every GPU item and a GPU limit on every project, because the absence of those two is what turns a GPU budget into a monthly surprise. I would not expose passthrough as a default or leave the generated template unedited, and I would not skip the prerequisite check, since vGPU classes, Private AI images, a license and free capacity are all request-time failures that look like the catalog is broken. Validate one end-to-end request as a real data scientist, watch the lease attach and the GPU return, and only then announce it.

Stand up one fractional AI Workstation item with a 5-day lease in a lab project this week and request it as an ordinary user. If the GPU comes back to the pool on expiry without anyone asking, your AI self-service is ready to grow.

VCF Automation 9 Series · Part 27 of 41
« Previous: Part 26  |  VCF Automation Guide  |  Next: Part 28 »

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF Automation 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading