Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

Troubleshooting NSX 9 Connectivity and Overlay: GENEVE, MTU and the Tunnel That Will Not Come Up (NSX Series, Part 29)

Most NSX overlay outages trace back to one thing: MTU. Here is the GENEVE encapsulation math, how to test TEP connectivity properly, the BFD ports that silently block tunnels, and a symptom-to-fix path for the connectivity problems that actually bite.

NSX Series · Part 29 of 30

TL;DR · Key Takeaways

  • The overwhelming majority of NSX overlay connectivity problems are MTU problems. GENEVE adds about 50 bytes, so your physical underlay needs at least 1600 bytes of MTU, end to end, including any L3 device between host TEPs and Edge TEPs.
  • Test TEP reachability the right way: a vmkping over the geneve overlay network stack with the don’t-fragment bit set and a 1572-byte payload, which is the largest that fits a 1600 MTU. A normal small ping that succeeds proves almost nothing.
  • A tunnel stuck down with zero Rx or Tx packets is usually a firewall blocking BFD. The ports to clear are UDP 3784 and 4784, plus 9785, between transport nodes.
  • Traceflow is the fastest way to see where a packet dies, but it needs the segment connected to a Tier-1 or Tier-0 with gateway connectivity on, or it fails for reasons that have nothing to do with your actual problem.
  • Work the path in order: MTU first, then TEP reachability, then BFD and tunnels, then the logical path with Traceflow. Most teams jump straight to the logical layer and miss the MTU mismatch underneath.
Who this is for: NSX admins and network engineers debugging overlay connectivity, dropped traffic, or tunnels that will not come up.  Prerequisites: the overlay and transport-zone concepts from Parts 4 and 8, host access for esxcli, and NSX Manager for Traceflow.

I can usually guess the root cause of an NSX overlay outage before I log in. It is MTU. Not always, but often enough that MTU is the first thing I check and the last thing teams check, which is exactly backwards. The symptom is maddening because small packets work fine: you ping, it replies, everything looks healthy, and yet large flows hang, file copies stall, and applications time out intermittently. That signature, small traffic fine and large traffic broken, is the fingerprint of an MTU mismatch under a GENEVE overlay. So let me start there and work outward to the rest of the connectivity stack.

The GENEVE MTU math, and why 1600 is the number

NSX overlay segments tunnel your workload traffic inside GENEVE. A guest VM sends a normal frame with a 1500-byte payload, and NSX wraps it with GENEVE encapsulation that adds roughly 50 bytes of headers before it crosses the physical network between transport endpoints. So a 1500-byte inner frame becomes about 1550 bytes on the wire, and that is the minimum your underlay must carry without fragmenting. The reason everyone says 1600 rather than 1550 is headroom: GENEVE option fields can vary, so 1600 gives margin and is the long-standing recommendation. If any link, switch, or router in the transport path is left at the default 1500, those encapsulated frames get dropped or fragmented, and you get the small-works-large-fails signature.

The part that catches people is the word end-to-end. It is not enough for the host uplinks to be at 1600. Every device the encapsulated traffic touches, including any L3 router between host TEPs and Edge TEPs on different subnets, has to honor the larger MTU. One forgotten interface on a transit switch is all it takes. I have spent more billable hours than I would like finding a single 1500-byte link in an otherwise correct 1600 fabric.

Why the underlay needs 1600 A 1500 inner frame plus GENEVE overhead does not fit a 1500 underlay. Guest frame inner payload 1500 On the wire (GENEVE) ~50 inner payload 1500 = ~1550 on wire Underlay MTU 1500 = drops / fragments 1600+ = headroom, correct end to end, every hop
Diagram 1: The encapsulation overhead is why 1500 is not enough. Set 1600 on every device in the transport path.

Test TEP reachability the right way

A plain ping between hosts proves the IPs can reach each other with tiny packets. It does not prove the overlay will work, because it never tests a full-size encapsulated frame. The correct test sends a large packet, with the don’t-fragment bit set, over the dedicated overlay network stack, so you are exercising exactly the path GENEVE uses. On a 1600 MTU network the largest payload that fits is 1572 bytes. If the small ping succeeds and this large one fails, you have found your MTU mismatch.

# From an ESX host: large, do-not-fragment ping over the geneve TEP stack
vmkping ++netstack=vxlan -d -s 1572 <remote-TEP-IP>

# Confirm the host TEP and tunnel state
esxcli network ip interface ipv4 get          # check TEP vmk MTU
get vtep                                       # NSX CLI: list TEPs
get tunnel-ports / get tunnel <id> stats        # tunnel + BFD counters

# A small ping that works while -s 1572 fails == MTU mismatch in the path
In practice: always include the don’t-fragment flag. Without it, a router that should not be fragmenting will quietly fragment your test packet, the ping succeeds, and you conclude the path is fine while production still breaks. The don’t-fragment bit is what turns a feel-good ping into a real MTU test.

Tunnels down: look at BFD and the firewall

When a tunnel between transport nodes shows down, the counters tell you a lot. NSX uses BFD to keep tunnel liveness, and if the BFD statistics show zero Rx or zero Tx packets, the traffic is not getting through at all, which points at something blocking the BFD path rather than a flapping link. The usual culprit is a firewall, often a gateway firewall rule or an upstream physical firewall, blocking the BFD ports. Clear UDP 3784 and 4784, and 9785, between your transport nodes, and a tunnel that looked permanently dead frequently comes straight up.

The tunnel path and its two failure points MTU on the wire, BFD on UDP 3784 / 4784 / 9785 for liveness. Host TEPvmk, MTU 1600 Edge TEPMTU 1600 L3 path / firewallmust honor MTU + BFD ports drop here = MTU zero BFD = blocked ports
Diagram 2: Two things break tunnels: an MTU drop on the wire, or a firewall eating the BFD liveness packets.

Traceflow, and working the path in order

Once the physical and tunnel layers are healthy, Traceflow is the tool that shows you the logical path: it injects a synthetic packet and renders every component it traverses, pinpointing exactly where a packet is dropped. It is excellent, with one prerequisite people miss. If your segment is not connected to a Tier-1 or Tier-0 gateway with gateway connectivity enabled, Traceflow can fail for that reason alone, which sends you chasing a phantom. Confirm the gateway attachment first, then trust the result.

The discipline that saves the most time is order. Work bottom-up: MTU end to end, then TEP reachability with a proper large do-not-fragment ping, then tunnel and BFD state, and only then the logical path with Traceflow. I cover the alarm and Traceflow tooling itself in Part 18 on monitoring and operations. Most failed troubleshooting sessions I am called into skipped the bottom of the stack and spent hours in the logical layer while a 1500-byte link sat quietly underneath.

SymptomLikely causeFix
Small pings work, large flows hangMTU mismatch under GENEVESet 1600 end to end, retest with -d -s 1572
Tunnel down, BFD Rx/Tx = 0Firewall blocking BFDOpen UDP 3784, 4784, 9785 between TNs
Traceflow fails immediatelySegment not gateway-connectedAttach to T1/T0, enable gateway connectivity
Transport node degraded, faulty TEPTEP IP / uplink issueCheck TEP pool, vmk, uplink and re-verify
Intermittent drops across subnetsL3 device MTU between TEP subnetsFix MTU on the routed interfaces too

Worked example

New segment, two VMs on different hosts, cannot talk. Plain ping between the hosts succeeds, so the team blames the firewall rules and spends an afternoon there. Run vmkping ++netstack=vxlan -d -s 1572 between the TEPs and it fails. The host uplinks are 1600, but the leaf-to-spine link a network team rebuilt last week came back at 1500. Raise that one interface to 1600, the large ping succeeds, and the VMs talk. Total fix time once you tested at the right layer: ten minutes. Total time wasted in the firewall first: most of the day.

The Bottom Line

Overlay troubleshooting is mostly discipline, not cleverness. When connectivity breaks, resist the urge to start in the logical layer where the interesting features live, and work the stack from the bottom. Confirm 1600 MTU end to end, including every routed hop between TEP subnets, and prove it with a don’t-fragment 1572-byte ping over the overlay stack rather than a small ping that lies to you. If a tunnel is down with zero BFD counters, open UDP 3784, 4784, and 9785 before you suspect anything exotic. Only once the physical and tunnel layers are clean should you reach for Traceflow, and check the gateway attachment so it does not fail for the wrong reason. Internalize the small-works-large-fails fingerprint and you will diagnose the most common NSX outage in minutes instead of an afternoon. Did you actually test a full-size encapsulated packet, or just a ping that made you feel better?

Disclaimer: commands and ports here are representative for NSX overlay troubleshooting and may vary by version and platform. Validate exact CLI syntax, network stack names, and BFD port numbers against current Broadcom NSX documentation for your release, and make firewall or MTU changes through your change process, testing on a non-critical path first.
Work the stack from the bottom Most failed sessions start at the top and miss the MTU sitting underneath. 1 MTU1600 end to end 2 TEP reachDF 1572 ping 3 Tunnel / BFDports 3784/4784 4 Traceflowlogical path
Diagram 3: The triage ladder. Clear each rung before climbing to the next, and you find the MTU problem in minutes.

The asymmetric-MTU trap on routed TEP networks

The MTU problem gets sneakier the moment your host TEPs and Edge TEPs sit on different subnets. When TEPs share a subnet, a single fabric misconfiguration tends to break everything uniformly, which is at least easy to spot. When they are routed, the Layer 3 device between the two TEP networks also has to honour the larger MTU, and that interface is the one everybody forgets. The result is a maddening partial failure: same-subnet overlay traffic works perfectly while host-to-Edge overlay traffic, the path that actually carries north-south, silently drops its full-size frames on that one undersized routed hop.

This is why the rule is end to end and not just on the hosts. Every interface the encapsulated traffic crosses, including the routed interfaces between TEP subnets, needs the 1600 MTU, and a single 1500 link anywhere in that path is enough to produce the small-works-large-fails signature on exactly the flows you care about most. When the symptom is that intra-rack overlay is fine but cross-rack or host-to-Edge overlay is broken, the routed TEP path is the first place to look, because that asymmetry is the fingerprint of one forgotten interface rather than a fabric-wide misconfiguration.

When you are stuck, capture beats theorizing

Overlay arguments go in circles when people debate MTU and firewall rules from memory instead of measuring. The fastest way out is to capture. A packet capture on the TEP vmkernel interface, or on the Edge, tells you definitively whether the encapsulated frames are arriving, whether they are full size, and whether the BFD packets that keep the tunnel alive are getting through. That evidence ends the debate, because you stop guessing which layer is broken and start seeing it.

The discipline pairs with the bottom-up ladder. Use the large do-not-fragment ping to localize an MTU problem, check the tunnel and BFD counters to localize a liveness problem, and reach for a full capture when the simpler checks disagree or point somewhere ambiguous. The goal at every step is to replace an opinion with an observation, because overlay faults are precisely the kind that feel like they could be three different things at once. Measure the path, read the counters, capture the frames, and the genuinely confusing cases resolve into the single concrete cause they always had.

Document the working MTU as part of the build

When you finally have the overlay working end to end at 1600, write it down, because an undocumented MTU design is one that gets painfully rediscovered six months later. Record which devices carry the larger MTU, which routed hops sit between the TEP subnets, and how to run the verification ping, so the next engineer, or you after you have forgotten the details, can confirm the path in minutes instead of re-deriving it during an outage. The overlay MTU is invisible until it breaks, which is exactly why a clear record of it is worth the few minutes it takes to capture.

That documentation pays off most during change. When a network team rebuilds a switch or replaces a router, the 1600 MTU is the kind of non-default setting that quietly reverts to 1500, and the only thing standing between that and a confusing partial outage is someone knowing the setting was supposed to be there and where to check it. Make the MTU design part of the build record and part of the change checklist for the underlay, and the most common overlay outage becomes one you catch in review rather than in production.

References

NSX Series · Part 29 of 30
« Previous: Part 28  |  NSX Complete Guide  |  Next: Part 30 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading