Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VCF 9 Network Design: 7 Mistakes That Break Your Deployment (VCF 9 Series, Part 5)

The seven network design mistakes that most often break a VMware Cloud Foundation 9 bring-up, from overlay MTU and TEP addressing to the new VCF 9 networking model, with fixes for each.

VCF 9 Series · Part 5 of 36

TL;DR · Key Takeaways

  • Most failed VCF 9 bring-ups trace back to the physical network, not the software. Validate the fabric before you run the installer.
  • Get MTU right end to end: at least 1600 on the host overlay, at least 1700 on the NSX Edge overlay, and jumbo frames (9000) everywhere is the safe target.
  • Separate your traffic. Management, vMotion, vSAN, host overlay and Edge overlay each want their own VLAN, gateway and IP range.
  • Host TEPs need addressing that actually works: provide DHCP for host Tunnel Endpoints, and keep Edge TEPs routable to host TEPs.
  • VCF 9 changed the networking model. Single NSX Manager, shared NSX Managers across domains, Transit Gateway and VPC change how you should design, not just deploy.
Who this is for: Architects and admins designing the network for a new VMware Cloud Foundation 9 deployment.  Prerequisites: Working knowledge of VLANs, vSphere Distributed Switch, NSX overlay concepts and BGP basics.

Ask anyone who has run a VMware Cloud Foundation bring-up where it went wrong, and the answer is almost never “the SDDC Manager workflow.” It is the network. A missed MTU value, a TEP pool that cannot route, a VLAN that was never trunked to the right ports: any one of these stalls the deployment at a step that gives you a frustratingly generic error. VCF 9 makes the software side smoother than ever, but it still assumes the underlay is correct on day one. This post walks through the seven network design mistakes that most reliably break a VCF 9 deployment, and the fix for each, so you can design the fabric right the first time.

If you have not already locked down your sizing and host counts, start with the VCF 9 planning and prerequisites checklist. And if the fleet, instance and domain model is still fuzzy, the VCF 9 architecture overview sets the context for everything below.

Disclaimer: Network changes on a production fabric can take down running workloads. Validate your target BOM and interoperability, back up switch configs, stage changes in a maintenance window, and test MTU and routing with prechecks before you trust them in a VCF deployment.

Mistake 1: Treating overlay MTU as an afterthought

This is the single most common cause of a stuck bring-up. NSX overlay traffic is encapsulated in Geneve, which adds overhead on top of your payload. If the path cannot carry the larger frame, host TEPs fail to form tunnels and segments never come up. The minimum is an MTU of 1600 on the NSX host overlay VLAN, enabled end to end through every switch and uplink in the path. The NSX Edge overlay wants at least 1700. In practice the cleanest design is jumbo frames (9000) on all infrastructure VLANs, which removes the guesswork entirely. The catch is that “end to end” means every device: leaf, spine, and any interconnect. One switch left at 1500 silently drops the encapsulated frames.

Prove it before you deploy. From an ESXi host, a large-packet ping with the do-not-fragment bit set tells you immediately whether the path honors your MTU:

# Test the overlay path with a 8972-byte payload (9000 MTU minus headers), DF set
vmkping -I vmk10 -d -s 8972 <remote_tep_ip>

# Confirm the configured MTU on each vmknic and the VDS uplinks
esxcli network ip interface list
esxcli network vswitch dvs vmware list | grep -i mtu

If the large ping fails but a default-size ping succeeds, you have an MTU mismatch somewhere in the physical path. Fix the fabric, do not lower the overlay MTU.

Mistake 2: Collapsing every traffic type onto one VLAN

VCF expects distinct, 802.1q tagged VLANs for management, vMotion, vSAN or NFS, the NSX host overlay, and the NSX Edge overlay, plus uplink VLANs for north-south peering. Trying to save VLANs by merging vMotion and vSAN, or by running the overlay on the management VLAN, creates contention and breaks the assumptions the VCF network pool makes about IP allocation. Each of these networks needs its own VLAN ID, gateway, and a CIDR range with enough addresses for one IP per host (and headroom for growth).

Design the VLAN plan as a table before you touch the installer: VLAN ID, subnet, gateway, MTU, and purpose for each. vMotion and vSAN are defined together in a VCF network pool, so make sure those ranges are sized and routable. Getting this right up front avoids re-IP work later, which is painful once domains are deployed.


Mistake 3: No DHCP for host Tunnel Endpoints

The NSX host overlay relies on each ESXi host obtaining a TEP address. The standard design serves host TEP addresses from DHCP on the host overlay VLAN, and a missing or misconfigured DHCP scope is a classic reason hosts fail to join the overlay transport zone. Make sure a DHCP server is reachable on that VLAN, that the scope is large enough for every host (including future expansion), and that lease times are sane. Edge TEPs, by contrast, are typically assigned from a static IP pool you define in NSX, so do not assume the same mechanism covers both.

Mistake 4: Edge TEP and host TEP in the same VLAN on 2-pNIC hosts

On hosts with only two physical NICs, the NSX Edge overlay and the host overlay cannot share the same TEP VLAN. The Edge TEP and the ESXi host TEP must sit in different VLANs that are routable to each other through an external router. Skip this and the Edge transport nodes never establish tunnels to the hosts, which surfaces late in the workflow as a tunnel-down state that is hard to diagnose. If you have four or more pNICs you have more flexibility, but the two-NIC design is common and this rule trips up a lot of first deployments. Plan two TEP VLANs and confirm they can route to one another.

Mistake 5: Forgetting north-south routing and BGP on the uplinks

The overlay can be perfect and your workloads still have no path to the outside world if the Edge uplinks are not peered with the physical fabric. NSX Edge nodes establish north-south connectivity, commonly over BGP, to your top-of-rack or border routers across dedicated uplink VLANs. Design these uplink VLANs, the peer IPs, the autonomous system numbers, and any route filtering before the Edge cluster goes in. A quick sanity checklist:

  • Two uplink VLANs (one per fabric side) for Edge redundancy.
  • BGP neighbor IPs and ASNs agreed with the network team.
  • MTU on uplink VLANs matching the rest of the design.
  • Route advertisement and filtering policy decided, not improvised.

Mistake 6: Designing for the old NSX topology and ignoring the VCF 9 model

VCF 9 reworked the networking model, and carrying forward a VCF 5.x mental model leaves capacity and simplicity on the table. Three changes matter most at design time. First, VCF 9 supports deploying a single NSX Manager for environments that want to reduce the resource footprint, while a three-node cluster remains the choice for the highest availability. Second, the management domain can now share NSX Managers with workload domains, cutting both the resource cost and the lifecycle overhead of running NSX. Third, VCF 9 introduces a Transit Gateway, in centralized (CTGW) and distributed (DTGW) flavors, and the VMware Virtual Private Cloud (VPC) construct, which together change how you give workloads elastic, highly available connectivity. Decide deliberately whether you want single or clustered NSX Managers, whether to share managers across domains, and whether VPC and Transit Gateway fit your tenancy model, rather than defaulting to the older one-NSX-per-domain pattern.

Mistake 7: Skipping physical fabric validation

The recommended underlay for VCF is a leaf-spine fabric with redundant uplinks per host. Many problems that look like VCF software issues are actually a half-configured fabric: a port left as access instead of trunk, a VLAN not allowed on a trunk, an MTU set on the SVI but not the physical interface, or a single point of failure where redundancy was assumed. Before you start, walk the fabric and confirm every host port trunks the right VLANs with 802.1q tagging, MTU is consistent on physical interfaces and SVIs, and there is genuine link and device redundancy. This validation pass costs an hour and saves a day of deployment troubleshooting.

Final Thoughts

Network design is where VCF 9 deployments are won or lost. None of these seven mistakes are exotic, which is exactly why they keep happening: they are easy to overlook in the rush to run the installer. Build a VLAN and MTU table, prove the overlay path with vmkping, confirm TEP addressing and Edge routing, and make a conscious decision about the new VCF 9 networking constructs. Do that and the SDDC Manager workflow tends to just work. Next in the series we move from the network to the storage layer and vSAN design.

References


« Previous: Part 4: VCF 9 Planning and Prerequisites  |  Next: Part 6, vSAN and Storage Design (coming soon)  |  Back to the VCF 9 Complete Guide

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading