Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, , ,

Governing Private AI Consumption in VCF Automation: GPU Quotas, Policies and Roles (VCF Automation 9 Series, Part 35)

GPU is the most expensive resource you will ever put behind self-service. Here is how to govern Private AI consumption in VCF Automation 9.1 with namespace GPU quotas, consumption policies, roles and per-namespace activation, without touching the AI stack itself.

VCF Automation 9 Series · Part 35 of 41
TL;DR · Key Takeaways
  • This Part is about governing Private AI consumption, not the AI stack. The components and catalog items live in Part 27 and the Private AI Series.
  • GPU is governed like any scarce resource: a quota on the organization namespace, expressed as a device-plugin resource such as nvidia.com/gpu.
  • Org admins control who gets GPU and how much through consumption policies, templates and roles, and they activate Private AI Services per namespace, not everywhere.
  • In 9.1, providers grant quota across multiple supervisors in a region, and org admins delegate namespace creation to project admins and platform engineers within guardrails.
  • My recommendation: cap GPU per namespace, offer a short list of approved vGPU and MIG profiles as templates, and make idle GPU visible. An ungoverned GPU pool is the fastest way to a budget overrun and a queue.
Who this is for: provider and organization admins who run VCF Automation for AI tenants and have to keep GPU consumption controlled, fair and within budget.
Prerequisites: VCF Automation 9.x with Private AI Foundation in play, GPU hosts in the estate, organizations and projects defined, and the rights to set quotas, policies and roles. For the AI components themselves, see the Private AI Series.

A data science team asks for a few GPUs to experiment. Six weeks later you find they are holding eight high-end GPUs at four percent utilization, and another team that needs two cannot get any. GPU is the most expensive resource you will ever put behind self-service, and the one most likely to be quietly hoarded. This Part is about governing that consumption in VCF Automation: the quotas, policies and roles that keep GPU honest. It is deliberately not about the AI components, which the Private AI Series covers, or the GPU catalog items, which Part 27 covers. It is about the guardrails around them.

The governance stack for GPU

GPU governance is not one control, it is a stack, and each layer answers a different question. The provider decides how much GPU capacity an organization gets across a region. The organization carves that into namespace quotas that cap each team. A consumption policy decides who may request GPU-backed items at all, and roles decide who can do what. Activation decides which namespaces even have Private AI Services switched on. Skip a layer and you get the failure mode above: capacity handed out with no ceiling and no accountability.

The GPU governance stack Each layer answers a different question Provider: region GPU capacity How much GPU an org gets, across supervisors in the region Org: namespace GPU quota The ceiling per team, as a device-plugin resource quota Consumption policy + templates Who may request GPU items, and which approved profiles Roles and delegation Who creates namespaces, requests GPU, and approves Activation: Private AI Services per namespace Switched on where it is needed, not estate-wide
Five layers, top to bottom. The provider sets the outer bound; activation sets the inner one.

GPU as a quota, not a hope

The concrete control is the namespace quota. GPU surfaces to the platform as a device-plugin resource, so it is governed the same way you govern CPU and memory: a hard limit on the namespace. The org namespace gets a GPU ceiling, and requests beyond it are simply refused. That single number is what turns GPU from an open bar into a budget.

# GPU is governed as a hard quota on the AI tenant namespace
kubectl --context ai-tenant-ns describe resourcequota

# Name:                  ai-tenant-quota
# Resource               Used   Hard
# --------               ----   ----
# limits.nvidia.com/gpu  3      4
# requests.cpu           12     32
# requests.memory        96Gi   256Gi

Expected result: the team can hold at most four GPUs in this namespace, and the fifth request is rejected at admission, not after the bill arrives. Note that partitioned GPUs change the resource name: a full GPU is nvidia.com/gpu, while MIG slices surface as distinct names like nvidia.com/mig-1g.10gb. Decide on a partitioning strategy first, because the quota you write depends on it. The vGPU and MIG mechanics themselves are covered in the Private AI Series.

In practice: set the GPU quota lower than you think you need and raise it on request. Raising a quota is a two-minute change with a paper trail. Clawing back GPUs a team has grown attached to is a political negotiation. Start tight.

Policies, templates and who can ask

A quota caps the total; a consumption policy decides who may draw on it and what they may draw. Organization admins enforce control over GPUs through consumption policies, templates and defined user roles. Templates are the underused lever: instead of letting anyone request an arbitrary GPU shape, you publish a short list of approved profiles, a small inference slice, a single training GPU, a multi-GPU node, and consumers pick from those. That keeps requests standard, costable and supportable, and it stops the one-off configurations that become support burdens later. Tie GPU-backed items to the same approval and lease policies you use elsewhere, so an expensive GPU deployment cannot exist without an owner and an expiry.

LeverWhat it controlsWho sets it
Region GPU capacityTotal GPU an org can use across supervisorsProvider admin
Namespace GPU quotaThe GPU ceiling per teamOrg / project admin
Consumption policyWho may request GPU items and how muchOrg admin
TemplatesThe approved GPU profiles on offerOrg admin / provider
RolesWho creates namespaces, requests, approvesOrg admin (delegated)

Roles and delegation

Roles decide who can pull each lever. In 9.1, an org admin can delegate namespace creation to project admins and platform engineers within defined guardrails, which removes the bottleneck of every namespace going through one team while keeping the limits enforced. The pattern that works: the org admin owns GPU quota and consumption policy, delegates namespace creation and day-to-day sizing to project admins, and leaves GPU requests to the consumers within the published templates. Each role can do its job without holding a key to the whole budget.

Delegation without losing control Each role does its job; only one holds the GPU budget Org admin GPU quota Consumption policy Templates and roles Project admin / platform engineer Creates namespaces Sizes within guardrails Consumer Requests GPU items from templates within quota
Delegate creation and sizing downward; keep quota, policy and templates with the org admin.

Activation, and provider-side capacity in 9.1

Private AI Services are activated per organization namespace, not switched on everywhere by default. That is a governance feature, not an inconvenience: a namespace without Private AI Services activated cannot consume the AI catalog at all, which gives you a clean, explicit boundary around who is in the AI program. Activate it for the namespaces that need it and leave the rest alone.

On the provider side, 9.1 makes the capacity math easier. A provider can grant an organization quota across multiple supervisors in a region, and share a region's capacity across supervisors for several organizations, rather than pinning each org to one supervisor. For GPU, where the hardware is lumpy and expensive, that flexibility is the difference between stranded GPUs sitting idle in the wrong supervisor and a pool that several teams can actually reach.

What a GPU request passes through Three gates before a GPU is ever attached 1 Consumption policy: allowed? 2 Namespace GPU quota: room? 3 Approval + lease applied 4 GPU attached or request denied
Policy, then quota, then approval and lease. A GPU only attaches after all three pass.
Worked example
A region has 16 GPUs. You grant the AI organization a 12-GPU quota and keep 4 in reserve for bursts and failures. Inside the org, three teams get namespace quotas of 4, 4 and 2, leaving 2 unallocated for ad hoc requests. You publish three templates: a 1-slice MIG inference profile, a single full GPU for fine-tuning, and a 4-GPU training node, each with a default 14-day lease. Consumers self-serve within those limits. When a team needs more, they ask, you see the current utilization before deciding, and you raise a quota by a known amount rather than discovering the overcommit after the fact. Idle GPUs show up against the quota, so the four-percent-utilization hoard becomes a visible number instead of a surprise.
Disclaimer
GPU quota, consumption policy and Private AI Services activation are governance changes that affect what AI tenants can consume and spend. Validate in a non-production organization, confirm your vGPU or MIG partitioning strategy before writing quotas, and follow the current Broadcom documentation for Private AI Foundation and your exact VCF Automation build.

My Recommendation

Govern GPU before you publish a single AI catalog item. Set a namespace GPU quota lower than the ask, offer a short list of approved vGPU and MIG profiles as templates, activate Private AI Services only where it is needed, and put GPU-backed deployments under the same approval and lease policies as everything else. The reason is simple economics: GPU is scarce and costly, and an ungoverned pool turns into both a budget overrun and a queue at the same time, because the teams that grabbed early hold capacity the teams that need it cannot reach. When would I loosen this? For a dedicated research program with its own funded GPU pool and accountable owner, give them a large quota and get out of the way. For shared infrastructure, keep it tight and make idle GPU visible. The components and catalog belong to Part 27 and the Private AI Series; the discipline that keeps them affordable is yours. What is your GPU utilization across teams right now, and could you prove it?

References

VCF Automation 9 Series · Part 35 of 41
« Previous: Part 34  |  VCF Automation Guide  |  Next: Part 36 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF Automation 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading