Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Why Automate VCF 9: The API-First Shift and What Actually Changed (Automating VCF Series, Part 1)

VCF 9 collapsed SDDC Manager, vCenter, NSX and vSAN behind one Unified REST API. Here is what changed, the five tools that share it, and where I would start automating.

Automating VCF Series · Part 1 of 30

TL;DR · Key Takeaways

  • VCF 9 put SDDC Manager, vCenter, NSX and vSAN behind one Unified VCF REST API. Automate against that contract, not the per-component legacy endpoints.
  • PowerVCF is archived. The supported PowerShell module is VCF.PowerCLI 9.1.x. Do not start new work on PowerVCF; the live job is migrating off it.
  • Five tools share the same API: VCF.PowerCLI, the vmware/vcf Terraform provider, Ansible, the Unified VCF SDK for Python and Java, and plain curl. Pick by job, not habit.
  • Aria is gone. VCF Operations (was Aria Operations) and VCF Automation (was Aria Automation) are the current names, and the doc paths moved with them.
  • The first thing that breaks is auth. The access token expires around the 60-minute mark, and a token that died mid-run is the most common pipeline failure I see.
Who this is for: VMware admins, platform and DevOps engineers, and architects moving from click-ops to code on VCF 9.  Prerequisites: a reachable VCF 9.0 or 9.1 instance, a service account on SDDC Manager, and working comfort with REST, PowerShell, or Terraform.

Run a script you wrote against last year’s VCF and watch half of it fall over on VCF 9. The PowerVCF cmdlet you had in muscle memory is gone, replaced by Connect-VcfSddcManager. The endpoint you used to POST to now sits behind a single unified contract. And the token you cached for the whole job expires under you at the one-hour mark, so the long task fails three quarters of the way through with a 401 and no clean rollback. None of that is bad luck. It is the API-first redesign doing exactly what it was meant to do, and it changes how you build automation for this platform.

What API-first actually changed

Earlier VCF made you speak four dialects. SDDC Manager had one API, vCenter another, NSX a third, vSAN a fourth, each with its own authentication, object model, and SDK. Your automation turned into glue: translate a domain object here, re-authenticate there, reconcile two inventories that disagreed. VCF 9 consolidates that into a single Unified VCF REST API that fronts SDDC Manager, vCenter, NSX, and vSAN. One base path. One token model. One convention for long-running work. That is the shift, and every later Part in this series sits on top of it.

Why you should care

If you wrote automation for VCF 5.x, a meaningful chunk of it assumes the old per-component split. Those endpoints still answer underneath, but the documented, supported, forward path is the unified contract. Building new work against legacy per-product endpoints is how you take on technical debt before the project even ships. What I tell clients: treat the Unified VCF REST API as the source of truth, and treat PowerCLI, Terraform, Ansible, and the SDK as different doors into the same house.

The VCF 9 automation stack One contract underneath, many tools on top VCF.PowerCLIimperative Terraformvmware/vcf Ansibleidempotent Unified SDKPython / Java curlspikes / debug Unified VCF REST APIone base path · one token model · one async-task convention SDDC Manager vCenter NSX vSAN
Every tool in the series talks to the same unified contract, which fronts the four core platform components.

The toolchain, and what each tool is actually for

The unified endpoint speaks to all of them, so the real question is not which tool can do the job. Most can. The question is which one fits the shape of the work. Here is how I split them.

ToolReach for it whenWhen I would NOT
VCF.PowerCLI 9.1.xInteractive and operational scripting: bring-up, host ops, day-2 tasksAs your system of record for desired state
Terraform vmware/vcfDeclarative infrastructure you keep in Git: domains, network pools, clustersOne-off imperative actions like a single host commission
AnsibleSequencing config across a mixed estate, alongside OS and app configHeavy stateful infra lifecycle where Terraform state serves you better
Unified VCF SDK (Python/Java)Embedding VCF actions inside a larger application or serviceA five-line task a PowerCLI one-liner already covers
curlSpikes, debugging, confirming a raw request and response shapeProduction workflows that need state, retries, and review

A few specifics worth pinning down, because versions matter here. VCF.PowerCLI is one Install-Module VCF.PowerCLI away, currently in the 9.1.x line. The Terraform provider is published as vmware/vcf (0.18.x), not under a hashicorp namespace, and it targets SDDC Manager and fleet resources. Keep it separate in your head from vmware/vcfa, which is the distinct provider for VCF Automation catalog and IaaS deployments. The Unified SDK installs with pip install vcf-sdk and runs on Python 3.10 through 3.14. Two Terraform paths, one provider name people constantly get wrong: that confusion alone causes a surprising number of failed terraform init runs.

My take: start with VCF.PowerCLI because it matches how admins already think, then push anything that represents desired state into the Terraform vmware/vcf provider so it lives in Git and gets reviewed. Reserve the SDK for when VCF is one part of a bigger program, and keep curl in your back pocket for when an API is misbehaving and you need the unvarnished response.

Authentication is the part everyone underestimates

The single most common automation failure I see on VCF is not a clever bug. It is a token that expired mid-run. Every call into the unified API needs a bearer token, and you get one by POSTing credentials to /v1/tokens. That returns a pair: a short-lived access token and a longer-lived refresh token. The access token is the one you put in the Authorization header, and it lapses around the 60-minute mark. Cache it once at the top of a 90-minute job and the back third of that job fails.

# 1. Get the token pair (access + refresh)
curl -sk -X POST https://sddc-manager.lab.local/v1/tokens 
  -H 'Content-Type: application/json' 
  -d '{"username":"svc-automation@vsphere.local","password":"********"}'

# Response shape:
# {
#   "accessToken":  "eyJhbGciOiJSUzI1NiJ9.eyJ...",   <- expires ~60 min
#   "refreshToken": "9b3c0f2a-7e1d-4a5b-8c6e-..."     <- use to renew
# }

# 2. Use the access token on every call
curl -sk https://sddc-manager.lab.local/v1/domains 
  -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiJ9.eyJ...'

# 3. Renew BEFORE it dies, do not wait for the 401
curl -sk -X PATCH https://sddc-manager.lab.local/v1/tokens/access-token/refresh 
  -H 'Content-Type: application/json' 
  -d '{"refreshToken":"9b3c0f2a-7e1d-4a5b-8c6e-..."}'

# Failure mode: a cached access token used past ~60 min returns
# HTTP 401 with "The token has expired". Catch 401, refresh, retry once.

Two design rules fall out of this. First, never cache an access token for the lifetime of a long job. Refresh preemptively on a timer, or catch the 401, refresh once, and retry the failed call. Second, do not run automation under a personal login. Use a dedicated service account scoped through RBAC, so a token leak or a stale credential does not hand someone admin, and so revoking automation access does not lock out a human.

The async task pattern you have to design around

Most write operations against VCF do not finish inside the HTTP call. POST a workload domain and you get back 202 Accepted plus a task id, then you poll the task until it reaches COMPLETED or FAILED. This is where non-idempotent code bites: if you miss the 202 and blindly re-POST, you can kick off the same domain creation twice. Capture the task id, poll it, and make your create calls safe to retry.

Auth, then the async task loop Write calls return a task id, not a result 1 POST /v1/tokensaccess + refresh 2 POST /v1/domainsBearer token 3 202 + taskIdaccepted poll GET /v1/tasks/{id}loop until COMPLETED or FAILED refresh tokenif it expires
Authenticate once, send the bearer request, then poll the returned task id rather than assuming the work is done.

Worked example

A workload domain bring-up across 9 hosts runs roughly 90 minutes in my experience. The access token lives about 60. Cache it once at the start and it dies near minute 60, leaving the final 30 minutes of the job throwing 401s with no completed domain to show for it. Refresh on a 45-minute timer (comfortably inside the 60-minute window) or catch the first 401 and renew, and the same job finishes clean. The fix is four lines of code. The bug is a 2am page.


The names changed, and it matters for automation

This is not cosmetic. When the product names moved, the documentation paths, module names, and API references moved with them. Search for old terms and you land on deprecated material that points at endpoints and cmdlets that no longer apply to VCF 9. Keep this mapping handy.

Old name (pre-9)VCF 9 nameWhat it does for automation
VMware Aria AutomationVCF AutomationSelf-service IaaS, catalog, declarative templates, multi-tenancy
VMware Aria OperationsVCF OperationsFleet management, lifecycle, upgrades and patching, real-time metrics
PowerVCF (PowerShell module)VCF.PowerCLI 9.1.xSupported PowerShell automation; PowerVCF is archived
Separate per-product SDKsUnified VCF SDKOne SDK across vSphere, vSAN, SDDC Manager, NSX (Python and Java)
In practice: the mistake teams make is copying a snippet from a 2023 blog post that still says Aria Automation and PowerVCF, then burning an afternoon on why a cmdlet does not exist. If a code sample predates VCF 9, assume the names and paths are wrong until proven otherwise.

What I would automate first, and what I would leave alone

You do not have to codify the whole platform on day one, and you should not try. The first thing I reach for is read-only work: inventory pulls and drift detection. Low blast radius, immediate value, and it builds the auth and polling plumbing you will reuse everywhere else. From there I move to secrets and token handling, then to repeatable day-2 operations like host commission and decommission, then to full workload domain bring-up once the patterns are proven.

What I would leave alone at first: the management domain bring-up itself. The VCF Installer handles the initial deployment well, and hand-rolling that path before you trust your tooling buys risk without much reward. Why this order, why not the reverse, and what to validate first: start read-only because a buggy GET cannot corrupt your estate; do not lead with domain creation because a half-finished async task is painful to unwind; and before any mutating call, validate that your service account has exactly the RBAC scope it needs and no more.

Click-ops vs codified VCF Same platform, very different operational cost Click-ops • Manual UI clicks • Config drift between instances • No review, no audit trail • Tribal knowledge • Not repeatable at 2am Codified (Git to API) • Desired state in version control • Idempotent, repeatable runs • Peer review on every change • Drift detection in the pipeline • Same result every time
The point of automating VCF is not speed for its own sake. It is repeatability, review, and the end of drift.
Disclaimer: the commands here mutate platform state. Test against a lab or non-production instance first, use a scoped non-production service account, back up SDDC Manager before lifecycle operations, and dry-run any Terraform plan before you apply it. Treat a domain create or host decommission as a change with a rollback plan, not a one-liner.

My Take

VCF 9 is the first release where API-first is real rather than aspirational. One contract, one token model, one async convention, and a toolchain that all points at the same place. If you are still driving this platform by hand, you are paying an operational tax every day in drift and rework. My recommendation: learn the auth and async patterns first, because they are the load-bearing parts everything else depends on, then start with VCF.PowerCLI for muscle memory and move your desired state into the Terraform vmware/vcf provider so it lives in Git. The tools are ready. The naming is settled. The only thing left is to stop clicking.

Next in this series we map the full toolchain end to end. For the wider platform context, my VCF 9 Explained walkthrough sets the scene, the API-first runbook is a practical companion, and VCF Automation in VCF 9 Explained covers the self-service side. Which tool are you reaching for first on VCF 9, PowerCLI or Terraform? Tell me in the comments.

Automating VCF Series navigation:
Previous: this is Part 1 (start of the series).  Next: Part 2, the VCF 9 automation toolchain (coming soon).  Up: VCF Automation Guide (pillar).

References

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading