TL;DR · Key Takeaways
- Most failed VCF 9 bring-ups trace back to the physical network, not the software. Validate the fabric before you run the installer.
- Get MTU right end to end: at least 1600 on the host overlay, at least 1700 on the NSX Edge overlay, and jumbo frames (9000) everywhere is the safe target.
- Separate your traffic. Management, vMotion, vSAN, host overlay and Edge overlay each want their own VLAN, gateway and IP range.
- Host TEPs need addressing that actually works: provide DHCP for host Tunnel Endpoints, and keep Edge TEPs routable to host TEPs.
- VCF 9 changed the networking model. Single NSX Manager, shared NSX Managers across domains, Transit Gateway and VPC change how you should design, not just deploy.
Ask anyone who has run a VMware Cloud Foundation bring-up where it went wrong, and the answer is almost never “the SDDC Manager workflow.” It is the network. A missed MTU value, a TEP pool that cannot route, a VLAN that was never trunked to the right ports: any one of these stalls the deployment at a step that gives you a frustratingly generic error. VCF 9 makes the software side smoother than ever, but it still assumes the underlay is correct on day one. This post walks through the seven network design mistakes that most reliably break a VCF 9 deployment, and the fix for each, so you can design the fabric right the first time.
If you have not already locked down your sizing and host counts, start with the VCF 9 planning and prerequisites checklist. And if the fleet, instance and domain model is still fuzzy, the VCF 9 architecture overview sets the context for everything below.
Mistake 1: Treating overlay MTU as an afterthought
This is the single most common cause of a stuck bring-up. NSX overlay traffic is encapsulated in Geneve, which adds overhead on top of your payload. If the path cannot carry the larger frame, host TEPs fail to form tunnels and segments never come up. The minimum is an MTU of 1600 on the NSX host overlay VLAN, enabled end to end through every switch and uplink in the path. The NSX Edge overlay wants at least 1700. In practice the cleanest design is jumbo frames (9000) on all infrastructure VLANs, which removes the guesswork entirely. The catch is that “end to end” means every device: leaf, spine, and any interconnect. One switch left at 1500 silently drops the encapsulated frames.
Prove it before you deploy. From an ESXi host, a large-packet ping with the do-not-fragment bit set tells you immediately whether the path honors your MTU:
# Test the overlay path with a 8972-byte payload (9000 MTU minus headers), DF set
vmkping -I vmk10 -d -s 8972 <remote_tep_ip>
# Confirm the configured MTU on each vmknic and the VDS uplinks
esxcli network ip interface list
esxcli network vswitch dvs vmware list | grep -i mtu
If the large ping fails but a default-size ping succeeds, you have an MTU mismatch somewhere in the physical path. Fix the fabric, do not lower the overlay MTU.
Mistake 2: Collapsing every traffic type onto one VLAN
VCF expects distinct, 802.1q tagged VLANs for management, vMotion, vSAN or NFS, the NSX host overlay, and the NSX Edge overlay, plus uplink VLANs for north-south peering. Trying to save VLANs by merging vMotion and vSAN, or by running the overlay on the management VLAN, creates contention and breaks the assumptions the VCF network pool makes about IP allocation. Each of these networks needs its own VLAN ID, gateway, and a CIDR range with enough addresses for one IP per host (and headroom for growth).
Design the VLAN plan as a table before you touch the installer: VLAN ID, subnet, gateway, MTU, and purpose for each. vMotion and vSAN are defined together in a VCF network pool, so make sure those ranges are sized and routable. Getting this right up front avoids re-IP work later, which is painful once domains are deployed.
Mistake 3: No DHCP for host Tunnel Endpoints
The NSX host overlay relies on each ESXi host obtaining a TEP address. The standard design serves host TEP addresses from DHCP on the host overlay VLAN, and a missing or misconfigured DHCP scope is a classic reason hosts fail to join the overlay transport zone. Make sure a DHCP server is reachable on that VLAN, that the scope is large enough for every host (including future expansion), and that lease times are sane. Edge TEPs, by contrast, are typically assigned from a static IP pool you define in NSX, so do not assume the same mechanism covers both.
Mistake 4: Edge TEP and host TEP in the same VLAN on 2-pNIC hosts
On hosts with only two physical NICs, the NSX Edge overlay and the host overlay cannot share the same TEP VLAN. The Edge TEP and the ESXi host TEP must sit in different VLANs that are routable to each other through an external router. Skip this and the Edge transport nodes never establish tunnels to the hosts, which surfaces late in the workflow as a tunnel-down state that is hard to diagnose. If you have four or more pNICs you have more flexibility, but the two-NIC design is common and this rule trips up a lot of first deployments. Plan two TEP VLANs and confirm they can route to one another.
Mistake 5: Forgetting north-south routing and BGP on the uplinks
The overlay can be perfect and your workloads still have no path to the outside world if the Edge uplinks are not peered with the physical fabric. NSX Edge nodes establish north-south connectivity, commonly over BGP, to your top-of-rack or border routers across dedicated uplink VLANs. Design these uplink VLANs, the peer IPs, the autonomous system numbers, and any route filtering before the Edge cluster goes in. A quick sanity checklist:
- Two uplink VLANs (one per fabric side) for Edge redundancy.
- BGP neighbor IPs and ASNs agreed with the network team.
- MTU on uplink VLANs matching the rest of the design.
- Route advertisement and filtering policy decided, not improvised.
Mistake 6: Designing for the old NSX topology and ignoring the VCF 9 model
VCF 9 reworked the networking model, and carrying forward a VCF 5.x mental model leaves capacity and simplicity on the table. Three changes matter most at design time. First, VCF 9 supports deploying a single NSX Manager for environments that want to reduce the resource footprint, while a three-node cluster remains the choice for the highest availability. Second, the management domain can now share NSX Managers with workload domains, cutting both the resource cost and the lifecycle overhead of running NSX. Third, VCF 9 introduces a Transit Gateway, in centralized (CTGW) and distributed (DTGW) flavors, and the VMware Virtual Private Cloud (VPC) construct, which together change how you give workloads elastic, highly available connectivity. Decide deliberately whether you want single or clustered NSX Managers, whether to share managers across domains, and whether VPC and Transit Gateway fit your tenancy model, rather than defaulting to the older one-NSX-per-domain pattern.
Mistake 7: Skipping physical fabric validation
The recommended underlay for VCF is a leaf-spine fabric with redundant uplinks per host. Many problems that look like VCF software issues are actually a half-configured fabric: a port left as access instead of trunk, a VLAN not allowed on a trunk, an MTU set on the SVI but not the physical interface, or a single point of failure where redundancy was assumed. Before you start, walk the fabric and confirm every host port trunks the right VLANs with 802.1q tagging, MTU is consistent on physical interfaces and SVIs, and there is genuine link and device redundancy. This validation pass costs an hour and saves a day of deployment troubleshooting.
Final Thoughts
Network design is where VCF 9 deployments are won or lost. None of these seven mistakes are exotic, which is exactly why they keep happening: they are easy to overlook in the rush to run the installer. Build a VLAN and MTU table, prove the overlay path with vmkping, confirm TEP addressing and Edge routing, and make a conscious decision about the new VCF 9 networking constructs. Do that and the SDDC Manager workflow tends to just work. Next in the series we move from the network to the storage layer and vSAN design.
References
- VMware Cloud Foundation 9.0 Networking Re-imagines the Cloud Operating Model
- Simplify Workload Connectivity and Enhance Network Scale in VCF 9.1
- Broadcom TechDocs: NSX Virtual Network Appliance Cluster Models for VCF
« Previous: Part 4: VCF 9 Planning and Prerequisites | Next: Part 6, vSAN and Storage Design (coming soon) | Back to the VCF 9 Complete Guide



