TL;DR · Key Takeaways
- The Tier-0 gateway is where NSX peers with your physical fabric. It lives on the Edge cluster and reaches the outside world through VLAN-backed uplink interfaces.
- Use BGP over static routes for anything real. Peer to two top-of-rack routers, use BFD for fast failure detection, and control what you advertise with route redistribution, prefix lists, and route maps.
- ECMP spreads north-south across Edge nodes on a 5-tuple hash. The trap: set URPF to None on ECMP uplinks, or strict reverse-path checking silently drops asymmetric return traffic.
- Your overlay networks do not reach the fabric until you redistribute them into BGP. Connected and NAT routes are the usual candidates.
- VRF-Lite gives each tenant its own routing table on a shared Tier-0 and Edge, inheriting the parent’s HA mode and Edge cluster while keeping routes separate.
This is the handshake. Everything inside NSX, the segments, the Tier-1s, the distributed firewall, is yours to design freely, but the moment traffic needs to leave for the physical world it has to agree with routers you may not own, speaking a protocol that does not forgive sloppiness. The Tier-0 is where that conversation happens. Get the BGP design right and a failed Edge or a failed uplink is a non-event nobody notices. Get it wrong and you get the worst kind of outage, the intermittent one: traffic that works until a path changes, then black-holes for reasons that take a packet capture to find. I have spent long nights on exactly that, and almost always the cause was a routing decision made casually months earlier.
What the Tier-0 connects to
The Tier-0 gateway runs on your Edge cluster and faces north through uplink interfaces. Those uplinks are VLAN-backed: each one sits on a transit VLAN that the Edge shares with a physical router, usually your top-of-rack pair. You peer BGP across those uplinks, the Tier-0 learns the fabric’s routes and the fabric learns yours, and traffic flows. For resilience you do this at least twice, one uplink to each of two top-of-rack routers, so the loss of a single router or a single uplink leaves a working path. The Tier-1 gateways from Part 8 connect southward to the Tier-0 over an internal transit, and the Tier-0 is the single point where all of that internal routing meets the external world.
BGP, done the way that survives failures
Use BGP. Static routes are fine for a lab and a trap in production, because they do not react to a failed path. On the Tier-0 you set a local AS number, define a neighbor for each top-of-rack router with its peer AS, and enable BFD so a dead peer is detected in milliseconds rather than waiting out BGP hold timers. Then you decide, deliberately, what you advertise and what you accept, using route redistribution to inject your networks and prefix lists or route maps to keep the advertisement tidy. The single most common BGP mistake I see is advertising everything by accident, or accepting a full table you never wanted, because nobody put a filter in place. Decide your prefixes on purpose.
| BGP element | What to set | Why it matters |
|---|---|---|
| Local AS / peer AS | Agreed with the network team up front. | Mismatched AS, no session. Settle it before config. |
| BFD | Enabled on each neighbor. | Sub-second failure detection vs slow BGP timers. |
| Redistribution | Connected and NAT routes, deliberately. | Nothing reaches the fabric until you redistribute it. |
| Prefix lists / route maps | Filter what you advertise and accept. | Stops accidental over-advertisement and bad imports. |
| ECMP + URPF | ECMP on, URPF set to None on those uplinks. | Strict URPF drops asymmetric ECMP return traffic. |
You can read the routing state from the Edge CLI, which is where I go first when north-south misbehaves. Confirm the BGP session is established and that you are learning and advertising the prefixes you expect.
# On an Edge node, in the Tier-0 service router context
get bgp neighbor # sessions should read Established
get bgp neighbor summary # prefixes received / advertised per peer
get route bgp # the BGP routes actually in the table
get route forwarding # what the datapath will really use
# If a session is stuck, check AS numbers, the transit VLAN,
# uplink MTU, and that BFD is up on both ends.
Route redistribution: making the overlay reachable
Here is a thing that surprises people new to NSX: you can build a perfect overlay, attach segments to Tier-1s, connect the Tier-1s to the Tier-0, peer BGP cleanly, and still have nothing reachable from outside. The reason is that the Tier-0 does not advertise your internal networks until you tell it to, through route redistribution. You choose which categories of route get injected into BGP, typically the connected segment subnets and any NAT addresses you want reachable, and only those leave for the fabric. This is a feature, not a chore: it means your internal topology is private by default and you expose exactly what you intend. Forget it, though, and you will stare at a healthy BGP session that advertises nothing.
VRF-Lite: many tenants, one Tier-0
When several tenants or environments need their own routing table and their own BGP relationship with the fabric, but you do not want to build a separate Edge stack for each, VRF-Lite is the answer. A Tier-0 VRF gateway links to a parent Tier-0 and gets its own isolated routing table, its own BGP peering, and its own northbound uplinks on a subset of VLANs carried over trunk interfaces. It inherits the heavy, shared things from the parent, the HA mode, the Edge cluster, the transit subnets, so you are not duplicating infrastructure, just separating routes. Each VRF can keep the parent’s BGP local AS or override it per peer. The result is tenant routing isolation without a tenant-by-tenant Edge sprawl.
| Inherited from the parent Tier-0 | Set per VRF |
|---|---|
| HA mode (active-standby or active-active) | Its own routing table |
| Edge cluster placement | BGP neighbors and (optionally) local AS |
| Internal transit subnets | Northbound uplink VLANs (trunked subset) |
| Base BGP configuration | Redistribution, prefix lists, route maps |
Routing failures and how to read them
North-south problems are usually one of a handful of failures, and the symptoms are specific enough to diagnose fast if you know the pattern. The hardest are the partial ones, where some traffic works and some does not, because they point at a path or a filter rather than a clean down state. This is the table I keep next to the Edge CLI commands above.
| Symptom | Likely cause | Where to look |
|---|---|---|
| BGP session never comes up | AS mismatch, wrong peer IP, or transit VLAN/MTU | get bgp neighbor; check the uplink VLAN and MTU. |
| Session up, nothing reachable inbound | Redistribution missing or prefix list too tight | Redistribution config and advertised prefix count. |
| Intermittent drops under ECMP | URPF in strict mode dropping asymmetric returns | Set URPF to None on the ECMP uplink interfaces. |
| Failover is slow, seconds of loss | BFD not enabled; waiting on BGP hold timers | Enable BFD on both ends of each peering. |
The discipline that prevents most of these is boring and effective: write the routing design down before you build it. The AS numbers, the transit VLANs and their MTUs, the exact prefixes you will advertise and accept, the HA mode, and whether ECMP is in play with URPF adjusted. Hand that one page to the network team and to whoever operates the Edge, and the late-night routing mysteries mostly stop happening. The Tier-0 is not where you want to improvise, because a routing mistake here does not break one segment, it breaks the door that every segment uses to reach the world.
What I’d Do
Peer BGP to two top-of-rack routers, enable BFD, and never rely on static routes for production north-south. Decide your advertised and accepted prefixes deliberately with redistribution and filters, and remember that nothing leaves the overlay until you redistribute it. If you run active-active ECMP, set URPF to None on those uplinks before you go live, or you will meet the asymmetric black-hole the hard way. Use VRF-Lite when you need per-tenant routing isolation without standing up an Edge per tenant. And test failover for real, by pulling a path in a maintenance window, because a failover you have never exercised is a hope, not a design. Next up is Part 10: Tier-1 gateways and east-west routing, the distributed half of the routing story. Is your URPF set correctly on every ECMP uplink right now, or is that an assumption?
HA mode and ECMP decide your north-south resilience
The Tier-0 high-availability mode is one of the most consequential choices in the whole design. Active-active with ECMP gives you the highest aggregate north-south throughput by spreading flows across multiple Edges, but it is stateless, which means you cannot run stateful services on that Tier-0 without inviting trouble. Active-standby gives you stateful services and a simpler failure model at the cost of lower aggregate throughput, because only one Edge forwards at a time. Underneath either choice sits your BGP design with the physical fabric, and that wants BFD for fast failure detection and a sane graceful-restart posture so a control-plane blip does not become a data-plane outage.
The trap that catches people is mixing active-active ECMP with stateful services and then discovering asymmetric routing, where the return path of a flow lands on a different Edge than the forward path and the stateful service drops it. The clean answer is to keep stateful services off an active-active Tier-0, pushing them to a Tier-1 or choosing active-standby where they belong, and to size the Edges for the real north-south peak with headroom. Decide the HA mode against your actual service and throughput requirements, not by copying whatever the last deployment used.
The BGP choices that decide whether the fabric is stable
North-south stability is mostly a BGP design question, and a few choices carry most of the weight. You are typically running eBGP from the Tier-0 to the top-of-rack or spine, so the autonomous-system design, the route filtering and the aggregation you apply at that boundary determine how much churn the fabric and the Tier-0 inflict on each other. Advertise only the prefixes you actually mean to expose, aggregate where you can, and filter inbound so a mistake upstream does not flood your Tier-0 with routes it should never carry. A Tier-0 that accepts and re-advertises everything is a Tier-0 that turns a small upstream error into your problem.
Underneath the prefixes sits failure detection, and this is where BFD earns its place. BGP timers alone are slow to notice a dead path, so pairing BGP with BFD gives you sub-second detection and failover that actually matches the resilience the rest of the design assumes. Add a sensible graceful-restart posture so a control-plane event does not needlessly tear down forwarding, and you have a north-south edge that fails over quickly and predictably. Get these few BGP and BFD decisions right up front and routing becomes the part of the platform you stop thinking about, which is exactly what you want from routing.
References
- Configure BGP in NSX (Broadcom TechDocs, VCF 9)
- NSX Tier-0 VRF Gateways (Broadcom TechDocs, VCF 9)
- NSX 9 Edge Transport Nodes and Edge Clusters (NSX Series, Part 7)



