VCF 9 Network Design: 7 Mistakes That Break Your Deployment (VCF 9 Series, Part 5)

Seven network design errors that fail a VCF 9 bring-up, framed as symptom, cause and fix. The MTU drop on a single transit hop is the one that wastes the most hours.

by

Dr. Pranay Jha

June 13, 2026

No comments

9 minutes

Read Time

VCF 9 Series · Part 5 of 37

TL;DR · Key Takeaways

The deployment-breaker that wastes the most hours is an invisible MTU drop on a single transit hop, not a wrong config value.
When NSX transport nodes use a VDS, the MTU on the VDS wins. Editing the uplink profile and expecting it to apply is a dead end.
Missing reverse DNS records and an under-sized TEP pool are the next two most common stalls.
Keep the NSX gateway interface MTU at least 200 bytes below the fabric MTU.
Design overlay, vSAN, and vMotion for 25 Gb or higher. Sharing a saturated 10 Gb pair is a latency time bomb.

You can pass every checkbox in the planning workbook and still watch a VCF 9 bring-up fail at network validation. The reasons are almost always the same handful of design errors, and they share a theme: they are invisible until the Installer pings across them. Here are the seven that break deployments, framed as symptom, cause, and fix.

VCF 9 management-domain traffic types, their VLANs and MTU, with the networking mistakes that break bring-up.

1. MTU: the 1600 floor and the VDS-owns-it trap

Symptom: the Installer fails at the network connectivity or NSX MTU validation step, or you raise the MTU, redeploy, and the overlay still fragments.
Cause: two problems that often travel together. First, the physical fabric or an L3 hop between racks carries fewer than 1600 bytes on the TEP VLAN, and as Broadcom puts it, a device that receives a frame larger than its MTU drops the frame. Second, when NSX transport nodes run on a vSphere VDS, the MTU set on the attached uplink profile is ignored, because the VDS MTU wins.
Fix: set the MTU on the VDS in vCenter, not the uplink profile, and configure the entire path (VMkernel, VDS, physical switches and routers) to at least 1600, with 1700 recommended and 9000 ideal. Keep the NSX gateway or logical interface MTU at least 200 bytes below the fabric MTU, and guest VM MTU 100 to 200 bytes below that, because asymmetric MTU is a classic silent dropper. Test between TEP VMkernels on different leaf pairs with vmkping -d -s 1572, since the drop is usually at the inter-rack boundary, not the access port. The prerequisite detail is in Part 4.

The worst version I lived through cost most of a Saturday. A four rack management domain passed every workbook check, then bring-up failed the NSX MTU validation. The VDS was set to 9000, the switches reported 9000, and a vmkping inside a rack was clean. The drop was one inter-rack Layer 3 hop where a leaf pair had been rebuilt with a default 1500 MTU and nobody had reapplied the jumbo config. It cost about five hours, most of it spent re-checking the VDS and the uplink profile, which were both fine, before a vmkping between TEP VMkernels on different leaf pairs finally showed the drop at the boundary. The lesson is blunt: test MTU across racks, not inside one, and never assume a switch rebuild kept your jumbo settings.

Each layer sits below the one beneath it; mismatched MTU drops frames silently.

Set the overlay MTU on the VDS in vCenter; the uplink-profile value does not apply.

2. Forward DNS without the reverse record

Symptom: bring-up or component registration fails early.
Cause: forward A records were created but the PTR (reverse) records were omitted.
Fix: every VCF component, and the Installer appliance itself, needs both forward and reverse records resolvable before deployment. The 9.x wizard often asks only for FQDNs and resolves the rest, so a missing PTR fails validation rather than prompting you.

3. TEP pool sized 1:1 with hosts

Symptom: hosts fail to finish NSX transport-node preparation, or run out of TEP IPs partway through.
Cause: the TEP subnet was sized one address per host, ignoring that each active uplink gets its own TEP, so a 2-NIC VDS uses 2 TEPs per host.
Fix: size the TEP IP pool for roughly twice the host count, and leave headroom for growth.

4. Collapsing VLANs and missing tags

Symptom: one host fails transport-node prep while the rest succeed, or the Installer rejects the network layout.
Cause: either management and overlay traffic were merged onto one VLAN, or a required VLAN tag is missing on a single leaf switch.
Fix: keep separate VLANs for ESX management, VM management, vMotion, vSAN, host TEP, edge TEP, and uplinks, and confirm every required VLAN is trunked and tagged on every host uplink port. A single missing tag surfaces as one inconsistent host, which is a confusing way to spend an afternoon.

5. Under-provisioned uplinks and edge SPOFs

Symptom: the deployment succeeds but storage and overlay latency spikes under load, or a single edge failure drops north-south traffic.
Cause: overlay, vSAN, and vMotion were designed to share a saturated 10 Gb pair, or a single NSX Edge was deployed without redundant uplinks and BGP peers.
Fix: 10 Gb is supported only on the smallest profiles for legacy-switch accommodation. Broadcom highly recommends 25 Gb or higher for greenfield vSAN and overlay networking. Size the edge cluster and Tier-0 uplinks for real throughput and HA before go-live, not after. The NSX constructs behind this are explained in Part 10.

6. Shared vMotion network and dynamic VMkernel IPs

Symptom: validation rejects the network layout, or vMotion stalls under load once you are live.
Cause: vMotion was placed on a shared network rather than its own dedicated VLAN, or VMkernel interfaces are pulling dynamically allocated IPs.
Fix: VCF expects a dedicated network for vMotion and statically assigned VMkernel IP addresses. If your hosts currently use DHCP on the VMkernel, move them to static before bring-up. The same rule that surfaces here also blocks a brownfield converge, so it is worth fixing once and properly.

7. NTP drift that surfaces as certificate failures

Symptom: components fail to register, or NSX and vCenter certificates are rejected even though the names resolve.
Cause: hosts and appliances are not time-synced, and a clock skew breaks certificate validation and vSAN operations.
Fix: point every host and every appliance at the same reachable NTP source and confirm sync before deployment. Time is a silent dependency in VCF, and a few minutes of drift can masquerade as a DNS or certificate problem and send you debugging the wrong layer for an hour.

Mistake	Quick fix
1. MTU floor / VDS owns it	Set MTU on the VDS to 1600+ (not the uplink profile); vmkping across racks
2. Missing reverse DNS	Create PTR records for every component and the Installer appliance
3. TEP pool sized 1:1	Size the TEP subnet for roughly 2x the host count, plus headroom
4. Collapsed VLANs / missing tag	Separate VLANs per traffic type; trunk and tag on every host uplink
5. Thin uplinks / edge SPOF	25 Gb+ for vSAN and overlay; redundant edges with BGP peers
6. Shared vMotion / DHCP VMkernel	Dedicated vMotion VLAN; statically assigned VMkernel IPs
7. NTP drift	One reachable NTP source for all hosts and appliances; confirm sync

A two-minute pre-flight

Run this sweep before you ever open the Installer. It catches the three failures that account for most stalled bring-ups: MTU, DNS, and time.

# MTU across racks on the TEP VMkernel (1600 floor)
vmkping -I vmk10 -d -s 1572 <remote-tep-ip>

# Forward and reverse DNS for every component FQDN
nslookup vcenter.example.local
nslookup 10.0.10.21      # expect the FQDN back

# Time sync state on each host
esxcli system ntp test

A VLAN and subnet plan you can copy

Most of the mistakes above trace back to a network plan that was never written down in one place. Here is a worked example for a single management domain that you can adapt. The point is not the exact numbers, it is having every traffic type, its VLAN, its MTU, and a right-sized subnet decided before anyone racks a host.

Traffic type	VLAN	Example subnet	MTU	Notes
Management	10	10.0.10.0/24	1500	vCenter, NSX, SDDC Manager, host management
vMotion	20	10.0.20.0/24	9000	dedicated, jumbo end to end
vSAN	30	10.0.30.0/24	9000	dedicated, jumbo end to end
Host TEP (overlay)	40	10.0.40.0/24	1600 min	size the pool for growth, not today
Edge TEP	50	10.0.50.0/24	1600 min	separate from host TEP
Uplink / peering	60, 61	10.0.60.0/30 x2	1500	two uplinks to two ToR switches, no shared point of failure

Size the TEP subnet for the host count you will grow into, not the count you deploy. A /24 gives you room, a /27 does not, and resizing an overlay pool later is not a five-minute change. Keep vMotion and vSAN on their own VLANs at 9000, and never let the overlay drop below 1600.

The network mistake that has burned me most is the MTU floor, and it never fails loudly. A deployment came up clean, then Geneve traffic started fragmenting under load because one uplink in the path was still at 1500 while everything else sat at 1600. Nothing errored, throughput just quietly collapsed on the busy segments. Chasing it took most of a day because the fabric looked correct everywhere I first checked. Test end to end MTU with a real payload before you trust it, because a single device left at the default will pass a ping and still fail the platform.

Start with MTU, then the rest

The hours-eater is almost never a wrong value you can see, it is an invisible MTU drop on a single transit hop. Validate MTU between TEP VMkernels on different racks before you touch the Installer, set MTU on the VDS rather than the uplink profile, and never paper over a failed MTU check with the skip flag, because you are converting an upfront red check into random production packet loss. Get those right and most of this list never happens. Which of these seven has burned you most, the MTU hop or the missing PTR?

References

VCF 9 Series · Part 5 of 37
« Previous: Part 4 | VCF 9 Complete Guide | Next: Part 6 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

1. MTU: the 1600 floor and the VDS-owns-it trap

2. Forward DNS without the reverse record

3. TEP pool sized 1:1 with hosts

4. Collapsing VLANs and missing tags

5. Under-provisioned uplinks and edge SPOFs

6. Shared vMotion network and dynamic VMkernel IPs

7. NTP drift that surfaces as certificate failures

A two-minute pre-flight

A VLAN and subnet plan you can copy

Start with MTU, then the rest

References

About The Author

Dr. Pranay Jha

Discover more from Journal of Intelligent Infrastructure

Leave a Reply Cancel reply

Architect’s Toolkit

PJ’s Tools

VMware Cloud Foundation

Nutanix

AI & Cloud-Native Platform

Architecture & Design

About the Author

Dr Pranay Jha