TL;DR · Key Takeaways
- The overwhelming majority of NSX overlay connectivity problems are MTU problems. GENEVE adds about 50 bytes, so your physical underlay needs at least 1600 bytes of MTU, end to end, including any L3 device between host TEPs and Edge TEPs.
- Test TEP reachability the right way: a vmkping over the geneve overlay network stack with the don’t-fragment bit set and a 1572-byte payload, which is the largest that fits a 1600 MTU. A normal small ping that succeeds proves almost nothing.
- A tunnel stuck down with zero Rx or Tx packets is usually a firewall blocking BFD. The ports to clear are UDP 3784 and 4784, plus 9785, between transport nodes.
- Traceflow is the fastest way to see where a packet dies, but it needs the segment connected to a Tier-1 or Tier-0 with gateway connectivity on, or it fails for reasons that have nothing to do with your actual problem.
- Work the path in order: MTU first, then TEP reachability, then BFD and tunnels, then the logical path with Traceflow. Most teams jump straight to the logical layer and miss the MTU mismatch underneath.
I can usually guess the root cause of an NSX overlay outage before I log in. It is MTU. Not always, but often enough that MTU is the first thing I check and the last thing teams check, which is exactly backwards. The symptom is maddening because small packets work fine: you ping, it replies, everything looks healthy, and yet large flows hang, file copies stall, and applications time out intermittently. That signature, small traffic fine and large traffic broken, is the fingerprint of an MTU mismatch under a GENEVE overlay. So let me start there and work outward to the rest of the connectivity stack.
The GENEVE MTU math, and why 1600 is the number
NSX overlay segments tunnel your workload traffic inside GENEVE. A guest VM sends a normal frame with a 1500-byte payload, and NSX wraps it with GENEVE encapsulation that adds roughly 50 bytes of headers before it crosses the physical network between transport endpoints. So a 1500-byte inner frame becomes about 1550 bytes on the wire, and that is the minimum your underlay must carry without fragmenting. The reason everyone says 1600 rather than 1550 is headroom: GENEVE option fields can vary, so 1600 gives margin and is the long-standing recommendation. If any link, switch, or router in the transport path is left at the default 1500, those encapsulated frames get dropped or fragmented, and you get the small-works-large-fails signature.
The part that catches people is the word end-to-end. It is not enough for the host uplinks to be at 1600. Every device the encapsulated traffic touches, including any L3 router between host TEPs and Edge TEPs on different subnets, has to honor the larger MTU. One forgotten interface on a transit switch is all it takes. I have spent more billable hours than I would like finding a single 1500-byte link in an otherwise correct 1600 fabric.
Test TEP reachability the right way
A plain ping between hosts proves the IPs can reach each other with tiny packets. It does not prove the overlay will work, because it never tests a full-size encapsulated frame. The correct test sends a large packet, with the don’t-fragment bit set, over the dedicated overlay network stack, so you are exercising exactly the path GENEVE uses. On a 1600 MTU network the largest payload that fits is 1572 bytes. If the small ping succeeds and this large one fails, you have found your MTU mismatch.
# From an ESX host: large, do-not-fragment ping over the geneve TEP stack
vmkping ++netstack=vxlan -d -s 1572 <remote-TEP-IP>
# Confirm the host TEP and tunnel state
esxcli network ip interface ipv4 get # check TEP vmk MTU
get vtep # NSX CLI: list TEPs
get tunnel-ports / get tunnel <id> stats # tunnel + BFD counters
# A small ping that works while -s 1572 fails == MTU mismatch in the path
Tunnels down: look at BFD and the firewall
When a tunnel between transport nodes shows down, the counters tell you a lot. NSX uses BFD to keep tunnel liveness, and if the BFD statistics show zero Rx or zero Tx packets, the traffic is not getting through at all, which points at something blocking the BFD path rather than a flapping link. The usual culprit is a firewall, often a gateway firewall rule or an upstream physical firewall, blocking the BFD ports. Clear UDP 3784 and 4784, and 9785, between your transport nodes, and a tunnel that looked permanently dead frequently comes straight up.
Traceflow, and working the path in order
Once the physical and tunnel layers are healthy, Traceflow is the tool that shows you the logical path: it injects a synthetic packet and renders every component it traverses, pinpointing exactly where a packet is dropped. It is excellent, with one prerequisite people miss. If your segment is not connected to a Tier-1 or Tier-0 gateway with gateway connectivity enabled, Traceflow can fail for that reason alone, which sends you chasing a phantom. Confirm the gateway attachment first, then trust the result.
The discipline that saves the most time is order. Work bottom-up: MTU end to end, then TEP reachability with a proper large do-not-fragment ping, then tunnel and BFD state, and only then the logical path with Traceflow. I cover the alarm and Traceflow tooling itself in Part 18 on monitoring and operations. Most failed troubleshooting sessions I am called into skipped the bottom of the stack and spent hours in the logical layer while a 1500-byte link sat quietly underneath.
| Symptom | Likely cause | Fix |
|---|---|---|
| Small pings work, large flows hang | MTU mismatch under GENEVE | Set 1600 end to end, retest with -d -s 1572 |
| Tunnel down, BFD Rx/Tx = 0 | Firewall blocking BFD | Open UDP 3784, 4784, 9785 between TNs |
| Traceflow fails immediately | Segment not gateway-connected | Attach to T1/T0, enable gateway connectivity |
| Transport node degraded, faulty TEP | TEP IP / uplink issue | Check TEP pool, vmk, uplink and re-verify |
| Intermittent drops across subnets | L3 device MTU between TEP subnets | Fix MTU on the routed interfaces too |
Worked example
New segment, two VMs on different hosts, cannot talk. Plain ping between the hosts succeeds, so the team blames the firewall rules and spends an afternoon there. Run vmkping ++netstack=vxlan -d -s 1572 between the TEPs and it fails. The host uplinks are 1600, but the leaf-to-spine link a network team rebuilt last week came back at 1500. Raise that one interface to 1600, the large ping succeeds, and the VMs talk. Total fix time once you tested at the right layer: ten minutes. Total time wasted in the firewall first: most of the day.
The Bottom Line
Overlay troubleshooting is mostly discipline, not cleverness. When connectivity breaks, resist the urge to start in the logical layer where the interesting features live, and work the stack from the bottom. Confirm 1600 MTU end to end, including every routed hop between TEP subnets, and prove it with a don’t-fragment 1572-byte ping over the overlay stack rather than a small ping that lies to you. If a tunnel is down with zero BFD counters, open UDP 3784, 4784, and 9785 before you suspect anything exotic. Only once the physical and tunnel layers are clean should you reach for Traceflow, and check the gateway attachment so it does not fail for the wrong reason. Internalize the small-works-large-fails fingerprint and you will diagnose the most common NSX outage in minutes instead of an afternoon. Did you actually test a full-size encapsulated packet, or just a ping that made you feel better?
The asymmetric-MTU trap on routed TEP networks
The MTU problem gets sneakier the moment your host TEPs and Edge TEPs sit on different subnets. When TEPs share a subnet, a single fabric misconfiguration tends to break everything uniformly, which is at least easy to spot. When they are routed, the Layer 3 device between the two TEP networks also has to honour the larger MTU, and that interface is the one everybody forgets. The result is a maddening partial failure: same-subnet overlay traffic works perfectly while host-to-Edge overlay traffic, the path that actually carries north-south, silently drops its full-size frames on that one undersized routed hop.
This is why the rule is end to end and not just on the hosts. Every interface the encapsulated traffic crosses, including the routed interfaces between TEP subnets, needs the 1600 MTU, and a single 1500 link anywhere in that path is enough to produce the small-works-large-fails signature on exactly the flows you care about most. When the symptom is that intra-rack overlay is fine but cross-rack or host-to-Edge overlay is broken, the routed TEP path is the first place to look, because that asymmetry is the fingerprint of one forgotten interface rather than a fabric-wide misconfiguration.
When you are stuck, capture beats theorizing
Overlay arguments go in circles when people debate MTU and firewall rules from memory instead of measuring. The fastest way out is to capture. A packet capture on the TEP vmkernel interface, or on the Edge, tells you definitively whether the encapsulated frames are arriving, whether they are full size, and whether the BFD packets that keep the tunnel alive are getting through. That evidence ends the debate, because you stop guessing which layer is broken and start seeing it.
The discipline pairs with the bottom-up ladder. Use the large do-not-fragment ping to localize an MTU problem, check the tunnel and BFD counters to localize a liveness problem, and reach for a full capture when the simpler checks disagree or point somewhere ambiguous. The goal at every step is to replace an opinion with an observation, because overlay faults are precisely the kind that feel like they could be three different things at once. Measure the path, read the counters, capture the frames, and the genuinely confusing cases resolve into the single concrete cause they always had.
Document the working MTU as part of the build
When you finally have the overlay working end to end at 1600, write it down, because an undocumented MTU design is one that gets painfully rediscovered six months later. Record which devices carry the larger MTU, which routed hops sit between the TEP subnets, and how to run the verification ping, so the next engineer, or you after you have forgotten the details, can confirm the path in minutes instead of re-deriving it during an outage. The overlay MTU is invisible until it breaks, which is exactly why a clear record of it is worth the few minutes it takes to capture.
That documentation pays off most during change. When a network team rebuilds a switch or replaces a router, the 1600 MTU is the kind of non-default setting that quietly reverts to 1500, and the only thing standing between that and a confusing partial outage is someone knowing the setting was supposed to be there and where to check it. Make the MTU design part of the build record and part of the change checklist for the underlay, and the most common overlay outage becomes one you catch in review rather than in production.
References
- Broadcom KB: Troubleshooting NSX Network Connectivity Issues
- Broadcom KB: Troubleshooting NSX TEP/BFD Tunnels Between Hosts and Edges
- Broadcom KB: Troubleshooting NSX Traceflow



