Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Cutover and Decommission: The Teardown Failures That Actually Bite (VCF 9 Series, Part 18)

Cutting over to VCF 9 and tearing down the old environment? Here are the decommission failures that bite at workload domain delete, host removal, NSX edge and vSAN evacuation, and how to recover.

VCF 9 Series · Part 18 of 36

TL;DR · Key Takeaways

  • Decommission is an ordered teardown, not a single button. Workloads first, then NSX Edge, then hosts, then the cluster, then the domain. Skip a step and SDDC Manager stops you cold.
  • Deleting a workload domain is irreversible. Every VM, cluster and datastore inside it is destroyed, and the task can run for roughly 20 minutes with no undo.
  • The two failures that bite hardest: a shared NSX Manager that still references the domain, and a stale or dead host that will not leave the inventory.
  • Imported (brownfield) domains do not tear down through the normal UI workflow. They need the API and a documented KB path.
  • Decommission does not re-image hosts or reclaim every entitlement for you. Plan the cleanup of certs, DNS, backup jobs and monitoring before you walk away.
Who this is for: Architects and admins running the final cutover of a VCF 9 migration and tearing down the source environment.  Prerequisites: Workloads already migrated, a verified backup of anything you still need, and change-window approval. SDDC Manager 9.0 or 9.1, admin on the instance, SSH access to the SDDC Manager appliance.

The migration gets all the planning attention. The teardown is where things quietly go wrong. By the time you reach cutover, the workloads are running on the new VCF 9 instance, the project looks finished, and everyone wants the old hardware back. That pressure is exactly when a delete job stalls at 40 percent, an NSX Edge refuses to shrink, or a host sits in the inventory in an error state that no button will clear. This post is the field version of the teardown: the specific failures that show up at cutover and decommission, why they happen, and how to get past each one without losing data you needed.

Disclaimer: Everything below destroys infrastructure. Deleting a workload domain wipes its clusters, VMs and datastores permanently. Validate that every workload is migrated and registered on the target, confirm a tested backup, run the SDDC Manager precheck, and do the teardown inside an approved change window. Do not run a force decommission on a host that is still doing real work.
VCF 9 Teardown Order Follow the sequence top to bottom. Each step gates the next. 1 Migrate and verify workloads off the source Confirm zero running VMs, templates, ISOs and registered objects on the datastore. 2 Delete NSX Edge clusters on the domain Detach T0 and T1 gateways first. You cannot shrink an edge cluster below two nodes. 3 Remove hosts from clusters, then decommission A host must not belong to any domain before it can be decommissioned. 4 Delete the vSphere cluster (vSAN evacuation) Let full data evacuation and resync finish. Unmount any remote vSAN datastores first. 5 Delete the workload domain (irreversible) Check no shared NSX Manager still references this domain before you commit. 6 Clean up: re-image hosts, reclaim licensing, retire DNS, certs, backups The part decommission does not do for you.
The VCF 9 teardown sequence. Each failure below maps to a step that was skipped or stalled.

1. The cutover that left workloads behind

Symptom: The delete job fails fast, or worse, it succeeds and someone discovers a VM, a template, or a mounted ISO was still on the destroyed datastore.

Likely cause: “Cutover complete” was declared from a workload count, not from a full inventory sweep. Powered-off VMs, templates converted years ago, content library items, orphaned VMDKs and ISOs mounted from the local datastore all sit outside the running-VM list. They are still real data, and the domain delete will take them with it.

Fix: Before you touch decommission, treat the source domain as guilty until proven empty. Walk every cluster for powered-off VMs and templates, check content libraries, and confirm nothing is registered against the datastores you are about to destroy. If workloads still need to move, that is a migration task and not a teardown task. Use cross vCenter vMotion or HCX to relocate them first, which is exactly the work covered in How to Migrate Workloads into VCF 9 with HCX and vMotion. Decommission is the last step, never the way you discover stragglers.

2. The NSX Edge cluster that refuses to shrink

Symptom: You try to remove edge nodes or delete the edge cluster and the task fails. Removing a node would drop the cluster below two members, and SDDC Manager will not allow it.

Likely cause: The edge cluster is still hosting Tier-0 and Tier-1 gateways with active north-south routing, or it is below the minimum node count for a valid cluster. VCF treats the edge cluster as a managed object with guardrails, so it blocks any change that would leave an invalid topology or strand a gateway.

Fix: Work top down. Confirm no production traffic still rides the T0, detach or delete the Tier-1 gateways, then the Tier-0, and only then delete the whole edge cluster rather than trying to peel off nodes one at a time. If the cluster is genuinely broken and the normal delete will not complete, Broadcom ships an edge-cluster removal tool referenced in their KB base for exactly this situation. In practice, the cleanest path during a teardown is to delete the entire edge cluster in one operation once routing is gone, not to shrink it node by node.

Tear down NSX top-downClear routing first, then delete the edge cluster in one operation1Confirm no prodtraffic on T02Delete Tier-1gateways3Delete Tier-0gateway4Delete the wholeedge clusterDelete the whole edge cluster at once; do not shrink it node by node (SDDC Manager blocks below 2 nodes).
Work from the gateways down to the cluster, not the other way around.

3. The shared NSX Manager that blocks the domain delete

Symptom: The workload domain delete fails partway, with an error that points at NSX objects still being referenced. Nothing on the domain itself looks like it is in use.

Likely cause: Two domains share one NSX Manager cluster, which is a common and supported design. When VCF Operations for Networks latency collection is enabled against that NSX Manager, it holds references to NSX objects from the domain you are deleting, and that reference is enough to stop the teardown. This is the single most surprising blocker I see, because the dependency lives outside the domain you are looking at.

Fix: Map the NSX topology before you start. If the domain has its own dedicated NSX Manager, this never bites you. If it shares one, disable the VCF Operations for Networks latency collection for that NSX Manager cluster, confirm no other domain depends on the objects you are removing, then retry the delete. The broader lesson: a shared NSX Manager couples the lifecycle of every domain hanging off it, which is worth knowing at design time and not at teardown time.

4. The dead host that will not decommission

Symptom: A host is disconnected, in an error state, or has already failed, and the standard decommission refuses to remove it. The inventory keeps showing a host you physically pulled weeks ago.

Likely cause: The normal decommission flow expects a healthy host it can cleanly remove from a cluster and clean up. A stale or dead host fails the remove-from-cluster step, so the workflow stalls. A host also cannot be decommissioned while it is still assigned to a domain.

Fix: Since VCF 5.0, the decommission API carries a force flag, disabled by default, for exactly this case. You pass the FQDNs of the dead or stale hosts and force the decommission when the domain workflow cannot remove them cleanly.

# Force-decommission a stale/dead host (SDDC Manager API)
# Only for hosts not part of any active workload domain.
POST https://sddc-manager.example.local/v1/hosts
[
  {
    "fqdn": "esxi-dead-07.example.local",
    "forceDecommission": true
  }
]

My take

The force flag is a scalpel, not a hammer. Use it only on hosts you have confirmed are genuinely dead and unassigned. Forcing a host that is still doing real work is how a teardown turns into an outage.

5. Imported domains do not tear down the normal way

Symptom: A brownfield domain you imported into VCF will not decommission through the UI. The options are greyed out, or add and decommission host actions fail on imported clusters.

Likely cause: Imported (brownfield) domains carry a different lifecycle state than domains VCF deployed itself. The standard UI workflow assumes VCF owns the full stack. For imported clusters, some operations are deliberately constrained, which is documented behaviour rather than a bug.

Fix: Follow the documented removal path for imported domains rather than fighting the UI. The Broadcom KB sequence is to remove the NSX cluster for the domain from the SDDC inventory with a DELETE API call, decommission the hosts (which can require SSH to the SDDC Manager appliance as the vcf user and elevating to root), and then remove the domain. If you imported the environment in the first place, the mechanics will be familiar from Bringing Existing vSAN and NSX Under VCF 9, just run in reverse.

6. vSAN evacuation: the silent multi-hour stall

Symptom: The cluster or domain delete sits at the same percentage for what feels like forever. Nothing has failed. Nothing is obviously moving either.

Likely cause: vSAN is doing exactly what it should. Tearing down a cluster triggers a full data evacuation and resync, and on a large or busy datastore that takes real time. If the domain has a remote vSAN datastore mounted, the delete will not proceed until you migrate any VMs off it and unmount it from vCenter first.

Fix: Plan for the evacuation window instead of assuming the job hung. Confirm the data evacuation mode is what you expect, watch the resync queue drain to zero, and for remote vSAN datastores do the unmount before you start the delete. Budget time. A workload domain delete itself can run up to roughly 20 minutes once it actually starts, and that clock does not include the evacuation that has to finish first.

7. The cleanup decommission does not do for you

Symptom: Weeks after the teardown, monitoring still alarms on dead FQDNs, backup jobs fail against hosts that no longer exist, the license count looks wrong, and a host you tried to reuse will not re-commission.

Likely cause: Decommission removes infrastructure from VCF. It does not clean up everything that pointed at that infrastructure. After you decommission a host it must be re-imaged and re-commissioned before it can be used again, and the surrounding systems, DNS, certificates, backup catalogues and monitoring, keep their stale entries until someone removes them.

Fix: Treat cleanup as a named task with an owner, not an afterthought. Re-image each freed host before you try to reuse it. Verify license consumption in the License Management page so you actually reclaim the core and vSAN capacity entitlement you freed, and for subscription or VCF+ instances follow the documented disconnect steps. Then retire the DNS records, revoke or remove the old certificates, and delete the backup and monitoring jobs that target the dead environment. None of this is hard. All of it is forgotten.


What decommission does not clean upIt removes infrastructure from VCF, not the systems that pointed at itFreed hostsDo: Re-image before any reuseDNS recordsDo: Retire the stale entriesCertificatesDo: Revoke or remove old onesBackup jobsDo: Delete jobs for dead hostsMonitoringDo: Remove alarms / dead FQDNsLicensingDo: Reclaim cores + vSAN TiB
Make cleanup a named task with an owner; none of this happens automatically.

What I’d Do

Run the teardown as a rehearsed, ordered change, not a victory lap. Prove the source domain is empty before you delete anything, map the NSX topology so a shared Manager does not ambush you, and keep the source instance powered and reachable until the new VCF 9 instance has carried production through at least one full business cycle. The hardware is not worth reclaiming a day early if rolling back becomes impossible. If your cutover follows a parallel-instance migration, the sequencing and rollback thinking in the VCF-to-VCF parallel-instance reference architecture pairs directly with this teardown. What is the one cleanup item your last decommission forgot? That is usually the first thing to put on the runbook.

References

VCF 9 Series · Part 18 of 36
« Previous: Part 17  |  VCF 9 Complete Guide  |  Next: Part 19 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

VCF 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading