Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NSX 9 Tier-0 Gateways and North-South Routing: BGP, ECMP and VRF (NSX Series, Part 9)

The Tier-0 is where NSX peers with your physical fabric. BGP design, ECMP and the URPF trap, route redistribution, and VRF-Lite for multi-tenant routing.

NSX Series · Part 9 of 30

TL;DR · Key Takeaways

  • The Tier-0 gateway is where NSX peers with your physical fabric. It lives on the Edge cluster and reaches the outside world through VLAN-backed uplink interfaces.
  • Use BGP over static routes for anything real. Peer to two top-of-rack routers, use BFD for fast failure detection, and control what you advertise with route redistribution, prefix lists, and route maps.
  • ECMP spreads north-south across Edge nodes on a 5-tuple hash. The trap: set URPF to None on ECMP uplinks, or strict reverse-path checking silently drops asymmetric return traffic.
  • Your overlay networks do not reach the fabric until you redistribute them into BGP. Connected and NAT routes are the usual candidates.
  • VRF-Lite gives each tenant its own routing table on a shared Tier-0 and Edge, inheriting the parent’s HA mode and Edge cluster while keeping routes separate.
Who this is for: network architects and admins wiring NSX 9 into the physical fabric.  Prerequisites: an Edge cluster (Part 7), segments and gateways (Part 8), and a network team you can agree BGP details with.

This is the handshake. Everything inside NSX, the segments, the Tier-1s, the distributed firewall, is yours to design freely, but the moment traffic needs to leave for the physical world it has to agree with routers you may not own, speaking a protocol that does not forgive sloppiness. The Tier-0 is where that conversation happens. Get the BGP design right and a failed Edge or a failed uplink is a non-event nobody notices. Get it wrong and you get the worst kind of outage, the intermittent one: traffic that works until a path changes, then black-holes for reasons that take a packet capture to find. I have spent long nights on exactly that, and almost always the cause was a routing decision made casually months earlier.

What the Tier-0 connects to

The Tier-0 gateway runs on your Edge cluster and faces north through uplink interfaces. Those uplinks are VLAN-backed: each one sits on a transit VLAN that the Edge shares with a physical router, usually your top-of-rack pair. You peer BGP across those uplinks, the Tier-0 learns the fabric’s routes and the fabric learns yours, and traffic flows. For resilience you do this at least twice, one uplink to each of two top-of-rack routers, so the loss of a single router or a single uplink leaves a working path. The Tier-1 gateways from Part 8 connect southward to the Tier-0 over an internal transit, and the Tier-0 is the single point where all of that internal routing meets the external world.

Tier-0 to the fabric: two peers, two paths Tier-1 gatewayssouthbound, internal transit Tier-0 (Edge cluster)two VLAN uplink interfaces ToR router ABGP peer ToR router BBGP peer ECMP load-shares across both uplinks. BFD detects a dead peer in well under a second. Lose one router or one uplink and the other path carries on.
Two uplinks to two routers, BGP plus BFD. This is the baseline north-south design, not a luxury.

BGP, done the way that survives failures

Use BGP. Static routes are fine for a lab and a trap in production, because they do not react to a failed path. On the Tier-0 you set a local AS number, define a neighbor for each top-of-rack router with its peer AS, and enable BFD so a dead peer is detected in milliseconds rather than waiting out BGP hold timers. Then you decide, deliberately, what you advertise and what you accept, using route redistribution to inject your networks and prefix lists or route maps to keep the advertisement tidy. The single most common BGP mistake I see is advertising everything by accident, or accepting a full table you never wanted, because nobody put a filter in place. Decide your prefixes on purpose.

BGP elementWhat to setWhy it matters
Local AS / peer ASAgreed with the network team up front.Mismatched AS, no session. Settle it before config.
BFDEnabled on each neighbor.Sub-second failure detection vs slow BGP timers.
RedistributionConnected and NAT routes, deliberately.Nothing reaches the fabric until you redistribute it.
Prefix lists / route mapsFilter what you advertise and accept.Stops accidental over-advertisement and bad imports.
ECMP + URPFECMP on, URPF set to None on those uplinks.Strict URPF drops asymmetric ECMP return traffic.

You can read the routing state from the Edge CLI, which is where I go first when north-south misbehaves. Confirm the BGP session is established and that you are learning and advertising the prefixes you expect.

# On an Edge node, in the Tier-0 service router context
get bgp neighbor            # sessions should read Established
get bgp neighbor summary    # prefixes received / advertised per peer
get route bgp               # the BGP routes actually in the table
get route forwarding        # what the datapath will really use

# If a session is stuck, check AS numbers, the transit VLAN,
# uplink MTU, and that BFD is up on both ends.
In practice: the ECMP black-hole is the one that wastes a whole evening. You turn on active-active ECMP, traffic load-shares across two Edges, and return packets arrive on a different uplink than they left. With URPF in its default strict mode the Edge drops them as spoofed. Set URPF to None on the ECMP uplinks and the black-hole disappears.

Route redistribution: making the overlay reachable

Here is a thing that surprises people new to NSX: you can build a perfect overlay, attach segments to Tier-1s, connect the Tier-1s to the Tier-0, peer BGP cleanly, and still have nothing reachable from outside. The reason is that the Tier-0 does not advertise your internal networks until you tell it to, through route redistribution. You choose which categories of route get injected into BGP, typically the connected segment subnets and any NAT addresses you want reachable, and only those leave for the fabric. This is a feature, not a chore: it means your internal topology is private by default and you expose exactly what you intend. Forget it, though, and you will stare at a healthy BGP session that advertises nothing.

Nothing leaves until you redistribute it connected (segments) NAT addresses redistribute into BGPfiltered by prefix list Tier-0 BGP fabric Private by default, exposed on purpose. Pick the route types and filter the prefixes you advertise.
Redistribution is the on switch for reachability. A clean BGP session that advertises nothing usually means you skipped it.

VRF-Lite: many tenants, one Tier-0

When several tenants or environments need their own routing table and their own BGP relationship with the fabric, but you do not want to build a separate Edge stack for each, VRF-Lite is the answer. A Tier-0 VRF gateway links to a parent Tier-0 and gets its own isolated routing table, its own BGP peering, and its own northbound uplinks on a subset of VLANs carried over trunk interfaces. It inherits the heavy, shared things from the parent, the HA mode, the Edge cluster, the transit subnets, so you are not duplicating infrastructure, just separating routes. Each VRF can keep the parent’s BGP local AS or override it per peer. The result is tenant routing isolation without a tenant-by-tenant Edge sprawl.

VRF-Lite: separate routes, shared Edge Parent Tier-0 on the shared Edge clusterVRFs inherit: HA mode, Edge cluster, transit subnets VRF: tenant Aown table, own BGPuplink VLAN a VRF: tenant Bown table, own BGPuplink VLAN b VRF: tenant Cown table, own BGPuplink VLAN c
Each VRF is a separate routing world on the same Edge and parent Tier-0. Isolation without duplication.
Inherited from the parent Tier-0Set per VRF
HA mode (active-standby or active-active)Its own routing table
Edge cluster placementBGP neighbors and (optionally) local AS
Internal transit subnetsNorthbound uplink VLANs (trunked subset)
Base BGP configurationRedistribution, prefix lists, route maps
Disclaimer: changing Tier-0 routing affects live north-south traffic. Agree AS numbers, transit VLANs, and prefix filters with the network team in advance, stage changes in a maintenance window, and confirm failover by actually failing a path in test. Validate against the current VCF 9 BOM and NSX 9 maximums first.

Routing failures and how to read them

North-south problems are usually one of a handful of failures, and the symptoms are specific enough to diagnose fast if you know the pattern. The hardest are the partial ones, where some traffic works and some does not, because they point at a path or a filter rather than a clean down state. This is the table I keep next to the Edge CLI commands above.

SymptomLikely causeWhere to look
BGP session never comes upAS mismatch, wrong peer IP, or transit VLAN/MTUget bgp neighbor; check the uplink VLAN and MTU.
Session up, nothing reachable inboundRedistribution missing or prefix list too tightRedistribution config and advertised prefix count.
Intermittent drops under ECMPURPF in strict mode dropping asymmetric returnsSet URPF to None on the ECMP uplink interfaces.
Failover is slow, seconds of lossBFD not enabled; waiting on BGP hold timersEnable BFD on both ends of each peering.

The discipline that prevents most of these is boring and effective: write the routing design down before you build it. The AS numbers, the transit VLANs and their MTUs, the exact prefixes you will advertise and accept, the HA mode, and whether ECMP is in play with URPF adjusted. Hand that one page to the network team and to whoever operates the Edge, and the late-night routing mysteries mostly stop happening. The Tier-0 is not where you want to improvise, because a routing mistake here does not break one segment, it breaks the door that every segment uses to reach the world.

My take: the Tier-0 is the one place in NSX where I insist on a joint design review with the physical network team. NSX gives you a powerful router, but BGP is a two-party agreement, and half the outages I see come from the two sides assuming different things about AS, prefixes, or timers.

What I’d Do

Peer BGP to two top-of-rack routers, enable BFD, and never rely on static routes for production north-south. Decide your advertised and accepted prefixes deliberately with redistribution and filters, and remember that nothing leaves the overlay until you redistribute it. If you run active-active ECMP, set URPF to None on those uplinks before you go live, or you will meet the asymmetric black-hole the hard way. Use VRF-Lite when you need per-tenant routing isolation without standing up an Edge per tenant. And test failover for real, by pulling a path in a maintenance window, because a failover you have never exercised is a hope, not a design. Next up is Part 10: Tier-1 gateways and east-west routing, the distributed half of the routing story. Is your URPF set correctly on every ECMP uplink right now, or is that an assumption?


HA mode and ECMP decide your north-south resilience

The Tier-0 high-availability mode is one of the most consequential choices in the whole design. Active-active with ECMP gives you the highest aggregate north-south throughput by spreading flows across multiple Edges, but it is stateless, which means you cannot run stateful services on that Tier-0 without inviting trouble. Active-standby gives you stateful services and a simpler failure model at the cost of lower aggregate throughput, because only one Edge forwards at a time. Underneath either choice sits your BGP design with the physical fabric, and that wants BFD for fast failure detection and a sane graceful-restart posture so a control-plane blip does not become a data-plane outage.

The trap that catches people is mixing active-active ECMP with stateful services and then discovering asymmetric routing, where the return path of a flow lands on a different Edge than the forward path and the stateful service drops it. The clean answer is to keep stateful services off an active-active Tier-0, pushing them to a Tier-1 or choosing active-standby where they belong, and to size the Edges for the real north-south peak with headroom. Decide the HA mode against your actual service and throughput requirements, not by copying whatever the last deployment used.

The BGP choices that decide whether the fabric is stable

North-south stability is mostly a BGP design question, and a few choices carry most of the weight. You are typically running eBGP from the Tier-0 to the top-of-rack or spine, so the autonomous-system design, the route filtering and the aggregation you apply at that boundary determine how much churn the fabric and the Tier-0 inflict on each other. Advertise only the prefixes you actually mean to expose, aggregate where you can, and filter inbound so a mistake upstream does not flood your Tier-0 with routes it should never carry. A Tier-0 that accepts and re-advertises everything is a Tier-0 that turns a small upstream error into your problem.

Underneath the prefixes sits failure detection, and this is where BFD earns its place. BGP timers alone are slow to notice a dead path, so pairing BGP with BFD gives you sub-second detection and failover that actually matches the resilience the rest of the design assumes. Add a sensible graceful-restart posture so a control-plane event does not needlessly tear down forwarding, and you have a north-south edge that fails over quickly and predictably. Get these few BGP and BFD decisions right up front and routing becomes the part of the platform you stop thinking about, which is exactly what you want from routing.

References

NSX Series · Part 9 of 30
« Previous: Part 8  |  NSX Complete Guide  |  Next: Part 10 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading