Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NSX 9 Design and Planning: Transport Zones, MTU, TEP Pools and EDP (NSX Series, Part 4)

The NSX 9 design decisions that decide whether your overlay works: the single overlay transport zone, GENEVE MTU, TEP IP pools and VLANs, and EDP mode.

NSX Series · Part 4 of 30

TL;DR · Key Takeaways

  • In VCF 9 you get a single overlay transport zone per NSX instance, shared by every cluster across every workload domain. Stop designing for multiple overlay TZs; you cannot have them here.
  • GENEVE needs MTU 1600 minimum end to end. Set TEP and uplink MTU to 1700, or jumbo 9000, and make sure nothing in the path is smaller. One narrow link breaks the overlay.
  • Host TEPs and Edge TEPs belong on different VLANs when an Edge VM runs on a prepared host, and those VLANs must be routable to each other. This is the classic day-one outage.
  • EDP Standard is the default datapath in VCF 9 and the right answer for almost everyone. EDP Dedicated reserves physical CPUs and is for prescriptive, known-traffic builds only.
  • Get transport zones, MTU, TEP pools, and uplink profiles right on paper before you prepare a single host. Fixing them later means re-preparing transport nodes.
Who this is for: architects and admins planning an NSX 9 overlay in VCF 9.  Prerequisites: you understand the NSX 9 architecture (Part 2) and have a physical network team you can talk MTU and VLANs with.

The overlay either comes up clean or it fights you for a week, and which one you get is decided on paper, before anyone touches a host. I have walked into enough “NSX is broken” engagements to know the pattern: the build was fine, the design was not. A transport zone scoped wrong, an MTU that is 1500 somewhere it should be 1700, a TEP VLAN that cannot reach the Edge. None of those show up until traffic does. This part is the design pass I run with every client before we prepare a single transport node, in the order I actually do it.

The objects you build, and the order

NSX design is a dependency chain. Each object assumes the one before it is right, so getting the order straight is half the battle. You start in the physical underlay (VLANs and MTU your network team owns), land on the VDS, wrap that in an uplink profile, scope it with transport zones, hand out TEP addresses from an IP pool, package the lot into a transport node profile, and only then prepare the cluster. Verify last. Skip a step and you find out three objects later, in a place that does not obviously point back to the cause.

Build order: design top to bottom, verify last 1. UnderlayVLANs + MTU 2. VDShost switch (7.0+) 3. Uplink profileteaming, VLAN, MTU 4. Transport zonesoverlay + VLAN 5. TEP IP poolhost TEP addresses 6. TN profilecluster template 7. Prepare clusterapply to hosts 8. Verifyvmkping the TEPs Steps 4-6 are NSX objects and the ones most often scoped wrong. Steps 1-3 are shared with the team that owns the fabric.
The dependency chain. Decide every box on paper before step 7 touches a host.

Transport zones: the single-overlay reality in VCF 9

A transport zone defines what a transport node can reach. It identifies the traffic type, overlay or VLAN, and binds to a named VDS. Here is the VCF 9 constraint that resets old NSX-T habits: VCF supports a single overlay transport zone per NSX instance, and every vSphere cluster, within and across workload domains, shares it. If you came up building one overlay TZ per zone or per tenant, that pattern is gone. Tenancy and isolation now live in Projects and VPCs (Part 22), not in a wall of overlay transport zones.

VLAN transport zones are different: you can have several, and they carry the VLAN-backed segments that connect NSX to the physical world, including the uplinks the Edge uses to reach your Tier-0. So the mental model is one shared overlay fabric for east-west tenant traffic, plus the VLAN transport zones that stitch it to the outside.

One overlay TZ, shared by every cluster Overlay transport zone (one per NSX instance) Mgmt domaincluster VI WLD-1clusters VI WLD-2clusters All clusters share one GENEVE overlay. Isolation is done with Projects and VPCs, not separate overlay transport zones. VLAN transport zones Edge uplinks to Tier-0 VLAN-backed segments You can have several of these.
One overlay transport zone for the whole instance; multiple VLAN transport zones to reach the fabric.

MTU: the number that quietly breaks overlays

GENEVE wraps every overlay frame in an outer header, so the packet on the wire is bigger than the guest ever sees. The guest still thinks it is sending 1500 bytes; the TEP adds roughly 50 to 90 bytes of encapsulation on top. That is why NSX requires the transport path to carry at least 1600 bytes, and why I set TEP and uplink MTU to 1700 as a habit, with the physical fabric on jumbo (9000) where the network team allows it. The rule that actually matters: MTU is a weakest-link property. The path is only as wide as its narrowest hop, so one access port left at 1500 silently drops large overlay frames while small pings sail through.

MTU is a weakest-link chain Guest VM1500 bytes TEP (GENEVE)+ header = ~1600 VDS uplinkset 1700 Physical fabricjumbo 9000 ideal If any single hop is below 1600, large overlay frames drop and small pings still pass. That is the signature: ping works, file copies and database traffic hang. Test the full path, not just reachability.
Encapsulation makes the on-wire frame larger than the guest’s. Every hop must carry it.

Do not trust the design document; test the path. From a prepared host, ping another host’s TEP with the don’t-fragment bit set and a payload sized to exercise the full MTU. If 1572 passes with DF set, you have your 1600. If it fails while a normal ping works, you have found your narrow hop.

# From an ESXi host, test TEP-to-TEP MTU on the overlay netstack
# -d sets do-not-fragment, -s 1572 + headers approaches 1600
vmkping ++netstack=vxlan -d -s 1572 <remote_host_TEP_IP>

# Success: replies return. Failure: "sendto() failed (Message too long)"
# or 100% loss while a normal-size vmkping still works = a narrow hop.
In practice: when someone reports “NSX is slow or flaky but not down”, MTU is my first suspect and the vmkping above is my first command. It has found the root cause more often than any log file.

TEP IP pools and the VLAN routing trap

Every transport node needs a tunnel endpoint address. For hosts, you hand these out from an NSX IP pool (or DHCP, but a pool is cleaner and what I use). Size the pool for the cluster you will grow into, not the one you have today, and remember that NSX uses multiple TEPs per host for load spreading, so the address count is hosts times TEPs-per-host plus headroom.

Worked example: sizing the host TEP pool

Target cluster: 24 hosts, 2 TEPs per host for multi-TEP load spreading.

Addresses needed = 24 × 2 = 48. Add growth headroom and round up to a /26 (62 usable).

A /27 (30 usable) would not even cover today. Pools are painful to resize after hosts are prepared, so size for the cluster’s full footprint on day one.

Now the trap that takes out more first-day deployments than anything else on this list. When you run NSX Edge nodes as VMs on a host that is itself a prepared transport node, the host TEP and the Edge TEP must sit on different VLANs and different subnets, and those two TEP VLANs have to be routable to each other. If you put both on the same VLAN, inter-TEP traffic between the host and its local Edge breaks in ways that look like everything and nothing. Plan two TEP VLANs, confirm the fabric routes between them, and you avoid the single most common NSX bring-up outage.

Host TEP and Edge TEP: two VLANs, routable Prepared ESXi host Host TEPVLAN 1644 Edge VM TEPVLAN 1648 Different VLANs, different subnets. Same VLAN here breaks inter-TEP traffic between the host and its local Edge. Physical fabric Must ROUTE between VLAN 1644 and 1648. Confirm the SVIs and the route before bring-up.
Edge-on-host means two TEP VLANs that the fabric routes between. Same VLAN is the classic day-one outage.

EDP mode: Standard for almost everyone

Enhanced Data Path is the high-performance forwarding stack, and in VCF 9 it is the default for new workload domains and new NSX installs. There are two modes, and the choice is easier than people make it. EDP Standard runs the improved packet-forwarding path out of the box with no manual CPU reservation, and Broadcom recommends it for general compute and for Edge clusters. EDP Dedicated (the prescriptive performance mode) pins physical CPU cores to the datapath and is only worth it when you know your traffic profile cold and have a specific, measured reason. Pick Standard unless someone can name that reason.

ModeWhat it doesCPUUse it for
EDP StandardImproved forwarding stack, high packet-per-second out of the box. Default in VCF 9.No manual reservationGeneral compute and Edge clusters. The default answer.
EDP Dedicated (Performance)Prescriptive mode tuned to a known traffic pattern.Reserves physical coresNFV / telco / measured high-throughput cases only.

One planning caveat: EDP wants a supported NIC and driver. Before you commit a host model, check the NIC against the VCF 9 EDP compatibility list, because an unsupported card quietly falls back and you lose the performance you designed for. There is more on this in the datapath deep dive (Part 26) and the broader picture in how NSX Enhanced Data Path delivers its throughput boost in VCF 9.

Uplink and transport node profiles

Uplink profile

The uplink profile is the policy that says how a transport node connects upward: the teaming policy across its uplinks, the transport (TEP) VLAN, and the MTU. Get the teaming policy deliberate here. Named teaming policies let you pin specific traffic, for example Edge uplinks, to specific physical NICs, which matters when you care about which cable carries north-south. Set the MTU on the uplink profile to 1700 to match the TEP and fabric.

Transport node profile

The transport node profile is the template you apply to a whole cluster so every host comes out identical: the VDS, the transport zones, the uplink profile, and the TEP pool, bundled. For stretched or multi-rack clusters where hosts in different racks need different TEP VLANs or subnets, a sub-transport node profile templates that subset without breaking the single cluster-wide profile. Use it; hand-configuring per-host networking is how drift and one-off outages start.

The design decisions on one page

This is the checklist I leave with clients. Decide every row before host preparation, because every one of them is painful to change after transport nodes are live.

DecisionRecommendationWhat bites you if wrong
Overlay transport zoneOne per NSX instance, shared by all clusters (VCF rule).Designing for many overlay TZs; not supported in VCF.
GENEVE MTU1700 on TEP and uplinks; jumbo 9000 on the fabric.One 1500 hop drops large frames; pings still pass.
Host TEP poolSize for the full cluster, hosts × TEPs + headroom.Pool exhaustion mid-expansion; painful to resize.
Host vs Edge TEP VLANsSeparate VLANs and subnets, routable, when Edge runs on host.Same VLAN breaks inter-TEP; classic day-one outage.
EDP modeStandard, unless you have a measured reason for Dedicated.Unsupported NIC falls back; lost performance.
ProfilesTransport node profile per cluster; sub-profile for multi-rack.Per-host hand config leads to drift and one-offs.
Disclaimer: validate the design against the current VCF 9 BOM and NSX 9 configuration maximums, confirm physical MTU and TEP VLAN routing with the network team, and test on a non-production cluster before you prepare production transport nodes. Re-verify the exact NSX 9.x patch and any updated maximums before committing.

What I’d Do

Treat this as a one-page design that gets signed off before procurement closes, not a thing you figure out during bring-up. Lock the single overlay transport zone, set 1700 everywhere and jumbo on the fabric, size the TEP pool for the cluster you are growing into, split host and Edge TEP VLANs and prove the fabric routes between them, default to EDP Standard, and template everything with transport node profiles. Do that and the build is boring, which in networking is the highest compliment. Skip it and you will spend your first week chasing an overlay that pings fine and moves no data. Next up is Part 5: NSX Manager deployment and cluster bring-up. Which of these six decisions is least nailed down in your current design?


The design decisions you cannot easily change later

Some NSX choices are cheap to revisit and some are effectively permanent once workloads are running, and the design Part is where you separate them. The transport zone layout, the TEP pool sizing and the underlay MTU are in the permanent column. A transport zone defines the span of your segments; a TEP pool that is sized too small quietly caps how many transport nodes you can ever add; and an underlay MTU left at the default 1500 instead of 1600 breaks the overlay the moment real traffic flows.

These are the decisions worth slowing down for, because changing them after deployment means touching every host in a maintenance window rather than editing a field. I would rather spend an extra afternoon in the planning workbook confirming the TEP address space has headroom for years of growth and that 1600 is set end to end, including every routed hop between TEP subnets, than discover the constraint the day a new cluster will not join or large flows start hanging. Plan the things that are hard to undo as if they are hard to undo, because they are.

References

NSX Series · Part 4 of 30
« Previous: Part 3  |  NSX Complete Guide  |  Next: Part 5 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading