Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NSX 9 Manager Deployment and Cluster Bring-Up in VCF 9 (NSX Series, Part 5)

In VCF 9 you do not install NSX Manager, VCF does. Here is what the workflow deploys, the prerequisites that decide success, shared vs dedicated, and how to verify.

NSX Series · Part 5 of 30

TL;DR · Key Takeaways

  • In VCF 9 you do not run the NSX Manager OVA install. VCF deploys the three-node cluster for you when you create the first VI workload domain.
  • The workflow stands up 3 NSX Managers in the management domain, configures a cluster VIP, and adds an anti-affinity rule so no two Managers share a host.
  • Later workload domains can share the existing NSX cluster or get a new one. That choice is an architecture decision, not a checkbox.
  • Deployment success is decided by prerequisites: forward and reverse DNS for every Manager and the VIP, NTP in sync, IPs, and at least three commissioned hosts. Most failed deployments are a missing PTR record.
  • After it builds, verify with get cluster status: every group type STABLE across all three nodes.
Who this is for: admins and architects deploying NSX 9 inside VCF 9.  Prerequisites: you have the design from Part 4 settled (transport zones, MTU, TEP plan) and a management domain already running.

The first time an NSX-T veteran deploys NSX in VCF 9, they go looking for the OVA and the install wizard and cannot find them. That is the point. In VCF 9 you do not install NSX Manager by hand. VCF builds the cluster for you as part of creating the first VI workload domain, which means the skill that used to matter, clicking through the appliance deployment, is gone, and the skill that always actually mattered, getting the prerequisites exactly right, is now the whole game. This part walks the bring-up the way it really happens: what VCF does, what you owe it first, and how to confirm the cluster is healthy before you build anything on top.

What VCF deploys for you

When you create your first VI workload domain, the VCF workflow does the NSX deployment as a series of automated steps. It places three NSX Manager appliances in the management domain, wires them into a cluster, assigns the cluster VIP you specified, and creates a vSphere anti-affinity rule so the three Managers never land on the same host. You provide inputs; VCF does the build and the post-config. Your job shifts from operator to designer: get the names, addresses, and sizing right, and let the automation run.

How the NSX cluster gets built in VCF 9 1. PrerequisitesDNS, NTP, IPs, VIP FQDN 2. Commission hosts3+ in SDDC inventory 3. Create VI WLDSDDC Manager workflow 4. VCF deploys the cluster3 NSX Managers + cluster VIP + anti-affinity rule 5. Verify STABLEget cluster status, then build on top Steps 1 and 2 are yours. Step 4 is automation. Most failures are a step-1 miss that only surfaces at step 4.
VCF owns the build. You own the prerequisites and the verification on either side of it.

Prerequisites: the boring list that decides everything

I have watched more NSX deployments fail on DNS than on anything technical about NSX. The automation is unforgiving about names and time, because the cluster nodes have to find and trust each other. Walk this table before you start the workflow, not after it fails halfway through.

PrerequisiteRequirementHow to verify
Forward DNS (A)A record for each NSX Manager and the cluster VIP.nslookup the FQDN returns the right IP.
Reverse DNS (PTR)PTR record for every Manager IP and the VIP.nslookup the IP returns the right FQDN.
NTPAll components synced to a reliable source.Same time on SDDC Manager, vCenter, hosts.
IP addressesThree Manager IPs plus the VIP, on the mgmt network.Free, reserved, and not in any DHCP scope.
HostsAt least three commissioned hosts with storage.Visible and ready in SDDC Manager inventory.
VIP FQDNResolvable before you start; entered as the Appliance Cluster FQDN.Forward and reverse both resolve.

DNS is worth a hard check because the failure mode is ugly: the workflow gets a long way in, then stalls or rolls back, and the error rarely says “PTR record missing” in plain language. Two commands settle it.

# Forward: name should resolve to the VIP address
nslookup nsx-vip.mgmt.lab.local

# Reverse: the VIP address should resolve back to the same name
nslookup 10.0.0.30

# Do the same for each of the three Manager FQDNs and IPs.
# Both directions must match. A missing PTR is the #1 cause of a stalled build.
In practice: I run the forward and reverse lookups for all four names (three Managers plus the VIP) from SDDC Manager itself, not from my laptop. The resolver that matters is the one the workflow uses, and they are not always the same.

Shared or dedicated NSX per workload domain

The first VI workload domain always deploys a fresh NSX Manager cluster. The real decision comes with the second and every domain after: point it at the existing NSX cluster, or stand up a new one. This is a genuine architecture fork, not a default to click past.

Shared vs dedicated NSX across domains SHARED 1 NSX cluster WLD-1 WLD-2 WLD-3 DEDICATED NSX cluster A NSX cluster B NSX cluster C WLD-1 WLD-2 WLD-3
One cluster serving many domains, or one cluster each. The trade-off is footprint versus blast radius.
DimensionShared NSX clusterDedicated per domain
FootprintLower: one 3-node cluster total.Higher: three Managers per domain.
Blast radiusWider: one cluster issue touches all domains.Contained: a domain’s NSX problem stays local.
IsolationSoft: use Projects/VPCs for tenancy.Hard: separate control planes entirely.
LifecycleSimpler: one cluster to upgrade.More work: each cluster on its own.
Use it whenMost estates; efficiency matters.Strong separation, compliance, or different SLAs.

My default is shared. The footprint saving is real, the lifecycle is simpler, and NSX 9 gives you Projects and VPCs (Part 22) to handle tenant isolation inside one cluster, which is exactly what they are for. I go dedicated only when there is a hard reason: a regulated domain that must not share a control plane, wildly different upgrade cadences, or a separation requirement written into a contract. Do not split control planes out of vague unease; split them for a named requirement.

The VIP, the three nodes, and anti-affinity

The cluster VIP is the single name you and your automation talk to. One of the three Managers owns the VIP at a time; if it fails, the VIP moves to a surviving node. That is why the VIP FQDN has to resolve before deployment and why it is the address you put in monitoring and runbooks, never an individual node IP. The anti-affinity rule VCF creates keeps the three appliances on separate hosts, so a single host failure can never take more than one Manager with it. Check that rule survived; I have seen DRS settings and host maintenance flatten three Managers onto two hosts and quietly erode the HA you designed.

Cluster VIP and anti-affinity Admins / API / automationtalk to the VIP only Cluster VIP Manager 1ESXi host A Manager 2ESXi host B Manager 3ESXi host C anti-affinity: never two on one host
Talk to the VIP, never a node. Anti-affinity keeps the three Managers on three hosts.

Verify the bring-up before you build on it

VCF will report the domain as created, but I never hand a cluster to the next team on the strength of a green workflow alone. Two minutes of CLI confirms the control plane is genuinely healthy. SSH to any Manager node as admin and check it the same way you would during an incident (see Part 2).

# On any NSX Manager node
get cluster status      # every group type should read STABLE
get cluster config      # confirm 3 nodes, correct IPs and roles

# Group types to see STABLE across all three nodes:
#   MANAGER  POLICY  CONTROLLER  DATASTORE  HTTPS  CLUSTER_BOOT_MANAGER

# Then confirm the VIP answers and resolves:
ping nsx-vip.mgmt.lab.local

Then close out the operational basics: confirm the anti-affinity rule is present and enabled, confirm the cluster certificate and the VIP certificate are what you expect, and configure NSX backup to your SFTP target straight away. A control plane with no backup is a single bad change away from a very long day, and backup is covered properly in Part 19.

Disclaimer: this is a production-change procedure. Validate against the current VCF 9 BOM, confirm host compatibility and free capacity, take a management-domain backup first, and run the workflow in a maintenance window. Re-verify the exact NSX 9.x version and prerequisites in the release notes before you start.

Why deployments stall, and the fix

When a VCF NSX deployment fails, it rarely fails for an interesting reason. The same handful of misses account for almost every stalled or rolled-back workflow I get called into, and all of them trace back to something that was true before the build ever started. The fix is almost never inside NSX; it is in DNS, time, addressing, or capacity. This is the short list I run down the moment a deployment task goes red.

SymptomLikely causeFix
Workflow stalls partway, then rolls backMissing reverse (PTR) record for a Manager or the VIPAdd the PTR, confirm forward and reverse match, retry.
Cluster forms but nodes will not trust each otherNTP skew between appliancesPoint all components at the same NTP source, resync.
Deployment cannot place an applianceFewer than three healthy hosts, or no free capacityCommission a third host or free resources, then retry.
VIP unreachable after a green workflowVIP FQDN or IP wrong, or not on the mgmt subnetCorrect the record, confirm the VIP is on the right network.

Notice the pattern: not one of these is an NSX bug. They are environment facts the automation depends on, which is exactly why the prerequisites table earlier is the most important part of this whole post. If a deployment does fail, resist the urge to start over from scratch. Read the task error in SDDC Manager, fix the one underlying fact, and retry the failed task. The workflow is built to resume, and a from-scratch redo usually just wastes an hour reaching the same blocker.

My take: keep a one-page deployment runbook with the four FQDNs, four IPs, the NTP source, and the host list filled in before the change window opens. The teams that do this almost never have a failed NSX deployment; the teams that improvise the names on the day almost always do.

What I’d Do

Spend your effort before the workflow, not during it. Build the DNS records forward and reverse, prove them from SDDC Manager, lock NTP, reserve the IPs and the VIP, and confirm three healthy hosts. Decide shared versus dedicated on a named requirement and default to shared. Then let VCF do the build, and verify with get cluster status before anyone celebrates. A clean NSX deployment is almost entirely a clean prerequisites list; the automation rarely fails for reasons of its own. Next up is Part 6: host transport node prep with VDS and EDP, where the design from Part 4 finally lands on real hosts. How solid is your reverse DNS right now, honestly?


The three-node cluster is a production requirement, not a recommendation

NSX Manager runs as a three-node cluster sitting behind a virtual IP, and that topology exists for management-plane resilience. A single node is perfectly fine in a lab and a genuine liability in production, because losing it takes your management and your Policy API with it. The three nodes share a distributed datastore, so the placement decision matters: spread them across hosts and failure domains with anti-affinity so that a single host or rack failure can never take two managers at once. A three-node cluster crammed onto two hosts is a three-node cluster pretending to be resilient.

The bring-up sequence rewards patience. Deploy the nodes, form the cluster, and confirm that get cluster status reports STABLE on all three before you build a single segment or rule on top. Configuring networking against a cluster that is still forming is how you end up with half-realized objects that are maddening to clean up. In VCF this whole dance is orchestrated for you by SDDC Manager, which is one more reason to let the platform own the lifecycle rather than hand-driving it. Either way, the rule is the same: a stable, healthy cluster first, configuration second, never the two interleaved.

Watch the VIP and the certificates

The cluster presents a single virtual IP for management, and that VIP together with the manager certificates are the two things teams forget about until they break. The VIP is how clients and automation reach the cluster regardless of which node is currently active, so it has to be reachable and correctly mapped, and a certificate that does not match the VIP name produces exactly the kind of intermittent trust error that swallows an afternoon of debugging. In VCF, SDDC Manager owns the certificate lifecycle for NSX, which means you rotate and renew through the platform rather than hand-editing certificates in NSX Manager. Reaching past the platform to swap a certificate directly is one of the most common ways to create drift that quietly disables the platform ability to manage NSX later, and it always surfaces at the worst possible moment, usually mid-upgrade.

The practical habit is to treat the management cluster as a managed appliance, not a server you log into and tinker with. Let the platform handle the cluster lifecycle, the VIP and the certificates, and reserve your direct NSX Manager access for the networking and security intent that genuinely belongs there. When the cluster does misbehave, start at get cluster status and at certificate and VIP health before you assume a deeper fault, because the boring causes account for the overwhelming majority of management-plane incidents. A calm, patient bring-up and a disciplined hands-off posture afterward are worth more than any clever recovery procedure you might need if you skip them.

References

NSX Series · Part 5 of 30
« Previous: Part 4  |  NSX Complete Guide  |  Next: Part 6 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading