Troubleshooting, Hardening and the Verdict on VCF Automation (VCF Automation 9 Series, Part 30)

The finale: where to look when a VCF Automation deployment fails, the hardening that actually matters including the certificate-rotation trap, and an honest verdict after 30 parts on who the product is for and when it is overkill.

Dr. Pranay Jha

June 22, 2026

No comments

11 minutes

Read Time

VCF Automation 9 Series · Part 30 of 41

TL;DR · Key Takeaways

Most VCF Automation deployment failures are environment hygiene, not product defects: DNS, NTP, naming rules, port 443 to vCenter, and certificates.
Diagnose in order: Fleet Management tasks first, then the installer log, then the services platform pods with kubectl and vracli on the appliance.
The certificate-rotation trap is real: the automated Rotate workflow does not cover some management components, so you must monitor expiry alerts and remediate manually before the window closes.
Hardening that matters is RBAC discipline, identity and service-account hygiene, TLS and the compliance baseline. Lean roles beat broad ones every time.
The verdict after 30 parts: VCF Automation rewards design discipline and punishes defaults. It is the right tool for governed multi-tenant self-service, and overkill if you only need a few VMs scripted.

Who this is for: Cloud and platform admins who run VCF Automation day to day and want a diagnostic and hardening reference, plus anyone weighing whether the product fits their environment.

Prerequisites: A VCF Automation 9.x instance, Fleet Management access, and SSH to the appliance for the deeper checks. The whole series behind you helps, but this Part stands on its own.

Most tickets that arrive as "VCF Automation is broken" are not about VCF Automation. They are DNS that does not resolve, NTP that drifted, a cluster name with a space in it, a firewall that blocks port 443 to vCenter, or a certificate that expired over a long weekend. The product sits on top of a Kubernetes-based services platform, and that platform is unforgiving about the environment underneath it. Learn to read the failure in the right order and most of these resolve in minutes instead of a day.

This is the final Part. I will cover the troubleshooting order I actually use, the hardening that earns its place, and then a real verdict after thirty parts on VCF Automation 9 (the product formerly known as VMware Aria Automation, and before that vRealize Automation). I am writing against the current VCF 9.1 release.

Where to look, in order

The mistake is diving straight into appliance logs. Start higher and move down only as needed. First, Fleet Management tasks: VCF Operations, then Fleet Management, then Lifecycle, then the VCF Automation task, where the failing step usually names its own cause. If that is not enough, the VCF Installer log at the OS level carries the orchestration detail. Only then do you SSH to the appliance and inspect the services platform pods directly. Ninety percent of the time you never reach the third step, because the task error already told you it was a name with a space or a blocked port.

Start at the task, not the pod. The deepest layer is the last resort, not the first move.

The failures that actually bite

A handful of issues account for most failed deployments, and they share a theme: the environment broke a rule the Kubernetes services platform cannot bend. The services platform builds resources whose names must follow RFC 1123, so a datacenter or cluster name with a space, an underscore or an uppercase letter fails the cluster creation outright. No connectivity on port 443 between the Automation nodes and vCenter stalls the deploy. And setting NTP at the wrong moment during first boot has a known habit of leaving pods uncreated.

Symptom	Likely cause	Fix
Services platform cluster creation fails	Space/underscore/uppercase in datacenter or cluster name	Use RFC 1123 names: lowercase, hyphens only
Deploy stalls connecting to vCenter	Port 443 blocked from Automation nodes	Open 443 node-to-vCenter, verify firewall
Pods never create on first boot	NTP set during/before firstboot	Follow the documented NTP sequence; redeploy
"Existing deployment already present"	Prior partial/failed deployment not cleaned	Clear the prior deployment before retrying
Logins fail after a healthy restore	Identity Broker not restored with Automation	Restore both as a set (see Part 29)

# Step 1 (UI): VCF Operations -> Fleet Management -> Lifecycle
#              -> VCF Management -> Tasks -> the VCF Automation task

# Step 2 (OS): orchestration detail on the VCF Installer
tail -f /var/log/vmware/vcf/domainmanager/domainmanager.log

# Step 3 (appliance): inspect the services platform pods
kubectl get pods -n prelude
kubectl describe pod <failing-pod> -n prelude

# Service status / health
vracli status
# look for errors in vracli-service-status.log

In practice: Before I open a single log on a fresh deploy, I check four things: forward and reverse DNS for every appliance, NTP in sync, names that pass RFC 1123, and port 443 to vCenter. That four-item preflight prevents more failed deployments than any amount of after-the-fact log reading.

Hardening that earns its place

Hardening VCF Automation is layered, and each layer has a control that matters more than the rest. Identity and access is first: a lean RBAC model with default system roles, custom provider roles published only to the organizations that need them, and tenant separation that holds. Resist role sprawl, because a broad custom role copied across twenty organizations is a privilege you cannot easily walk back. Use service accounts and OAuth 2.0 API tokens for automation rather than personal credentials, which ties directly to the token work earlier in the series.

Four layers. The certificate layer is the one most teams discover the hard way.

Gotcha · The certificate-rotation trap

The automated Rotate workflow does not cover every management component; Fleet Management itself, VCF Operations, and the Identity Broker are outside it. If you assume Rotate handles everything, a certificate expires unnoticed and authentication or lifecycle operations stop. Treat the 30-day and 7-day expiry alerts from VCF Operations as action items, and run the manual Update or Remediate workflow for those components well before the deadline. This is the single hardening detail I see bite mature environments most.

What the whole series adds up to

Thirty parts, one recurring lesson: VCF Automation rewards design discipline and punishes defaults. The same pattern showed up everywhere. The catalog with no lease policy fills a cluster with zombies. The blocking subscription with no condition wedges every deployment. The custom resource with no read workflow acts on stale data. The GPU item with no profile allowlist burns a budget. None of these are product faults; they are the cost of accepting the default instead of making a decision.

The throughline is a loop. Design the tenancy deliberately, govern it with policy, extend it idempotently, and operate it with tested recovery, then feed what you learn back into the design. Teams that run that loop get a platform they can hand to tenants. Teams that skip a quarter of it get a help desk.

Run the whole loop. Skipping govern or recover is where self-service quietly turns into a support burden.

Go-live hardening checklist

Before a tenant touches the catalog: RBAC roles scoped and least-privilege; service accounts and API tokens in place of personal logins; certificate expiry monitoring active with the unsupported-Rotate components flagged for manual remediation; TLS ciphers and unused services reviewed against the VCF security configuration guide; a lease policy on every project; backups of Automation and the Identity Broker scheduled and restore-tested; and the four-item deploy preflight (DNS, NTP, RFC 1123 names, port 443) baked into your build process. If all eight are true, you are production-ready.

Disclaimer: Appliance-level commands and certificate or service remediation are production-changing operations on the management plane. Run them with care, in a maintenance window, and follow the version-specific Broadcom guidance and the VCF security configuration guide for your exact release.

When a healthy platform still misbehaves

Deployment-time failures are the loud kind. The quieter and more common day-two problem is a platform that installed cleanly and then misbehaves on a specific request: a deployment that hangs halfway, a day-2 action that errors, a catalog item that provisions for one project and not another. The instinct is to suspect the appliance. The cause is almost always one layer out from where the symptom appears.

A deployment that stalls part way is usually waiting on something external: a blocking subscription whose runnable item never returned, an IPAM or DNS call that timed out, an approval nobody actioned. Read the deployment’s request and event history before you read any log, because it will show you the exact step that is waiting. A day-2 action that fails on a live resource is usually a stale binding or a missing input rather than a broken engine. A catalog item that works for one project and not another is almost never the item; it is a content-sharing policy or an entitlement that was never scoped to the second project. In each case the symptom is in VCF Automation and the cause is in the configuration around it.

Resist the reflexive restart

The worst habit I see is restarting services or redeploying the moment something looks stuck. On a Kubernetes-based services platform that often makes diagnosis harder, because you destroy the state that would have told you what happened. Capture the failing deployment id, the request details, and the relevant pod state first. A stuck workflow holding a lock will not be fixed by a reboot if the underlying external call is still failing; it will just fail again on the retry, and now you have lost the evidence. Slow down, read the request trail, and fix the cause. The platform is usually doing exactly what it was told; the instruction was wrong.

This is the same lesson as the deploy-time hygiene, one level up: the product is rarely the thing that is broken. Treat VCF Automation as the messenger reporting a problem in the environment, the configuration, or an external dependency, and your mean time to resolution drops sharply. Treat it as the suspect and you spend the afternoon reading pod logs for a firewall rule.

My take: Keep a one-page runbook that maps the five most common symptoms to their real causes and the exact place to look. Most operational time on VCF Automation is lost not to hard problems but to looking in the wrong layer first, and a short, environment-specific map fixes that faster than any amount of product expertise.

The Verdict

After thirty parts, here is the honest call. VCF Automation is the right tool when you need governed, multi-tenant, self-service cloud on VCF: real provider and tenant separation, a catalog with policy behind it, extensibility, and an API and Terraform provider to run it as code. For that job it is strong and, in VCF 9, more coherent than the standalone Aria Automation era it replaces. It is overkill if your actual need is to script a handful of VMs or stand up one team’s infrastructure; for that, drive the VCF and vSphere APIs or Terraform directly and skip the platform, which is exactly what the separate Automating VCF series is for.

What I would do differently if starting today: decide VM Apps versus All Apps deliberately at the very beginning, because that choice shapes everything downstream; treat governance and recovery as day-one work, not day-ninety; and manage the config as code from the first organization so the platform is reproducible and auditable. The product does not save you from skipping those. It just makes the consequences arrive later and larger. Run the discipline loop and VCF Automation is a platform you can trust tenants with. That is the whole series in one sentence.

Thank you for reading all thirty parts. If you build one thing from this series, make it the governed lease-and-policy baseline before you open the catalog, and let everything else follow from there.

VCF Automation 9 Series · Part 30 of 41
« Previous: Part 29 | VCF Automation Guide | Next: Part 31 »

References

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts

Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Tags: governance, hardening, Troubleshooting, VCF, VCF Automation, VCF Automation 9 Series

June 22, 2026

Architect’s Toolkit

About the Author

Dr Pranay Jha

You May Have Missed

View All

AI Stack, AI/ML, VMware & Cloud

Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)

June 23, 2026
AI Stack, AI/ML

GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)

June 23, 2026
AI Stack, AI/ML

NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)

June 23, 2026
AI Stack, AI/ML

The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)

June 23, 2026
AI Stack, AI/ML

NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)

June 23, 2026

Dr. Pranay Jha