Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Self-Service AI Catalog Items with VCF Automation for VMware Private AI (Private AI Series, Part 16)

How to publish self-service GPU catalog items for VMware Private AI Foundation with the VCF Automation Quickstart, plus the namespace, vGPU class and quota bindings that decide whether the catalog is safe to hand out.

VMware Private AI Series · Part 16 of 24

TL;DR · Key Takeaways

  • Self-service is the whole point of Private AI Foundation. If a data scientist still files a ticket to get a GPU VM, you built an expensive lab, not a platform.
  • The fast path is the PAIF Quickstart in VCF Automation, which generates the AI Workstation, AI Kubernetes Cluster and Triton catalog items for you. Get the org, project and namespace plumbing right first, or the Quickstart has nothing to publish.
  • The catalog item is only a request form. What actually governs cost and blast radius is the VM class (vGPU profile), the project, and the namespace it lands in.
  • Field gotcha: the NVIDIA RAG catalog items are removed in PAIF 9.1 because NVIDIA stopped supporting the backing blueprints. If your design depended on the turnkey RAG Workstation, plan for it now.
Who this is for: VCF cloud admins and architects standing up the consumption layer for Private AI Foundation.  Prerequisites: A working PAIF deployment (GPU workload domain, vGPU drivers, VCF Automation 9.0 or 9.1 deployed), Private AI Services installed on the Supervisor, and a vGPU-backed VM class already published.

Every Private AI project I see starts the same way. The platform team builds a beautiful GPU stack, the drivers load, a test inference job runs, and everyone declares victory. Then the first real data scientist shows up and asks for a GPU workstation, and the answer is a Jira ticket with a three-day SLA. That is the moment the platform quietly fails. The infrastructure works, but the consumption model is still 2015.

Part 16 is about closing that gap. The goal is a VCF Automation catalog where a data scientist picks AI Workstation, sets a vGPU size and a framework, clicks request, and gets a running deep learning VM in minutes with no admin in the loop. This is the runbook to get there, plus the design decisions that decide whether your catalog is safe to hand out.

The Self-Service Stack What a single catalog request actually traverses Consumer Data scientist or DevOps engineer, self-service only Service Broker Catalog AI Workstation · AI Kubernetes Cluster · Triton Inference Server Organization & Project Governs who can request what, and which zones it can land in Supervisor Namespace Private AI Services activated, quotas and VM classes bound here GPU Workload Domain vGPU hosts, the actual silicon the request consumes
One click on a catalog item resolves all the way down to a vGPU profile on a physical host.

What you are actually publishing

VCF Automation (the product formerly sold as Aria Automation) ships a Quickstart for Private AI Foundation that generates the catalog items for you. You do not hand-author blueprints. The Quickstart produces three core items: an AI Workstation backed by a deep learning VM with PyTorch and TensorFlow pre-installed, an AI Kubernetes Cluster with GPU-capable worker nodes and the NVIDIA GPU Operator already deployed, and a GPU-enabled Triton Inference Server. Two RAG-flavoured items (AI RAG Workstation and AI Kubernetes RAG Cluster, both wired to a pgvector database on Data Services Manager) also existed in 9.0.

Read the matrix below before you promise anyone a RAG button. For the design behind the deep learning VM itself, see Part 10 on Deep Learning VMs, and for what the cluster items feed, Part 12 on the Model Store and Model Runtime.

Catalog itemBacking workloadGPUDSM9.1 status
AI WorkstationDeep learning VM (PyTorch, TensorFlow)vGPU or noneNoSupported
AI Kubernetes ClusterVKS cluster, GPU Operator pre-installedvGPU worker nodesNoSupported
Triton Inference ServerDLVM running TritonvGPUNoSupported
AI RAG WorkstationDLVM with NVIDIA RAG + pgvectorvGPUYesRemoved
AI Kubernetes RAG ClusterVKS cluster with pgvector on DSMvGPU worker nodesYesRemoved

My take on the RAG removal. The 9.1 release notes are blunt: the NVIDIA RAG catalog items are removed because NVIDIA is no longer supporting the blueprints behind them. This is the kind of dependency that bites a platform team six months in, when a turnkey item silently disappears on upgrade. Do not architect a customer demo around the RAG Workstation if you are landing on 9.1. Build the RAG path yourself on top of the plain AI Workstation plus a pgvector instance, which is exactly the assembly we walked through in Part 15. You trade a button for control, and on a platform you support for years, control wins.


The setup flow, end to end

From Empty Org to Published Catalog 1 Install PAI Services on Supervisor 2 Org & Project + cloud account zones, quotas 3 Import namespaces activate PAI 4 Run PAIF Quickstart content source 5 Catalog items created 3 core items 6 Share to users entitlements
Steps 1 to 3 are plumbing. The Quickstart in step 4 does the heavy lifting only if the plumbing is correct.
Disclaimer: This procedure creates consumable infrastructure that real users will provision against. Validate your vGPU profiles and VM classes against the supported BOM, confirm GPU Operator and driver interoperability for your target PAIF version, set quotas before you entitle anyone, and test the full request-to-running path with a throwaway project first.

Step by step

  1. Install Private AI Services on the Supervisor. Private AI Services runs on the vSphere Supervisor for the organization. Without it activated, the namespaces you create have nowhere to host model endpoints. If your Supervisor was deployed without NSX, validate the networking path early, because the supported topologies differ.
  2. Build the organization and project. In VCF Automation, the project is the governance boundary. It decides which users can request items and which Cloud Zones or Kubernetes Zones their requests can land in. Add the vCenter cloud account, define the zones that map to your GPU workload domain, and set capacity limits here, not later.
  3. Import the Private AI namespaces and activate Private AI Services. Import the Supervisor namespaces into VCF Automation, then have the org administrator activate Private AI Services in each namespace that should host endpoints. Bind the vGPU-backed VM class and the namespace quota at this point.
  4. Run the PAIF Quickstart. The Quickstart creates the content source and generates the catalog items. This is also where it wires the items to your project and zones, so the earlier steps have to be right or the Quickstart produces items that fail at request time.
  5. Entitle and publish. Catalog items are invisible until you entitle them to the project. Share only the items each audience needs. Data scientists rarely need the Kubernetes cluster item, and DevOps teams rarely need the single-VM workstation.

If you want the underlying VCF Automation constructs (provider, organizations, projects, Service Broker) explained from the ground up, that is covered in VCF Automation in VCF 9 Explained. Here we assume you know those pieces and are wiring them to GPUs.

# Confirm the vGPU VM class the catalog item will request exists in the namespace
kubectl get virtualmachineclass -n <org-namespace>

# Confirm Private AI Services is activated for the namespace
kubectl get pods -n <org-namespace> | grep -i private-ai

# After a request deploys a DLVM, confirm the vGPU actually attached
kubectl get virtualmachine -n <org-namespace> -o wide
nvidia-smi   # run inside the deployed deep learning VM

The catalog item is just a request form

The single most common design mistake is treating the catalog item as the thing that matters. It is not. The item is a form. What it resolves to (the VM class, the framework image, the network, the namespace) is where all the governance and cost live. A user choosing a 48 GB vGPU profile instead of a shared 8 GB profile is the difference between four workstations and twenty four on the same host.

Anatomy of One Request AI Workstation the form the user sees VM Class vGPU profile governs cost Framework PyTorch / TensorFlow image Network VPC / segment from the zone Namespace quota + blast radius
Tune the things the arrows point to. The form on top is the least interesting part.

Two design rules I apply on every engagement. First, publish a small number of vGPU profiles, not a free-text field. A curated list of, say, three sizes (shared, single-GPU, multi-GPU) keeps users out of the fractional-GPU weeds and keeps your host density predictable. Second, set the namespace quota to a number you can actually back with hardware. Self-service without a quota is just a faster way to exhaust your GPUs, and the failure mode is ugly: requests succeed in VCF Automation and then sit pending because there is no capacity to place them.

Validate before you hand it out

Do not entitle a catalog item to real users until you have run the full path yourself and watched it land. The failure points are predictable, and they almost always trace back to a missing binding rather than a broken Quickstart.

Request-to-Running Checkpoints Request deploys? VM object created vGPU attached? class resolves Driver loaded? nvidia-smi OK No: project not entitled or zone has no capacity. Fix quota. No: vGPU VM class not bound to the namespace. Bind it. No: guest driver / token mismatch. Check the DLVM image. Endpoint reachable
Walk down the left column. Most production tickets are a binding that was skipped in steps 2 and 3.

The pattern is consistent: when a self-service request fails, the catalog is almost never the culprit. It is a project that was not entitled, a vGPU VM class that was never bound to the namespace, or a zone with no real GPU capacity behind it. Build the bindings deliberately, then let the catalog be boring. Boring is the goal. A boring catalog is one your users trust.

What I’d Do

Stand up the three core items (AI Workstation, AI Kubernetes Cluster, Triton) with the Quickstart, publish a curated set of vGPU sizes, set hard namespace quotas, and entitle each item to only the audience that needs it. Skip the RAG catalog items entirely on 9.1 and assemble RAG yourself, because a turnkey item you cannot support is worse than no item at all. Then go validate the full request-to-running path before a single real user sees the catalog. Self-service that fails on first use does more damage to platform credibility than no self-service at all.

Where has your self-service catalog leaked the most: quota exhaustion, vGPU sizing, or entitlement sprawl? That is usually where the next round of governance work lives.

References


VMware Private AI Series · Part 16 of 30
« Previous: Part 15  |  VMware Private AI Complete Guide  |  Next: Part 17 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading