Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NSX 9 Edge Transport Nodes and Edge Clusters: VM vs Bare-Metal (NSX Series, Part 7)

Edge nodes carry every flow that leaves the overlay. Here is what they do, VM vs bare-metal, sizing, Tier-0 HA modes, and how Edge clusters fail over.

NSX Series · Part 7 of 30

TL;DR · Key Takeaways

  • Edge transport nodes carry everything that leaves the overlay: Tier-0 routing, NAT, DHCP, VPN, the gateway firewall, and load balancing. North-south scales with the Edge, not the hosts.
  • Edges come as a VM or on bare-metal, both DPDK-accelerated. Bare-metal has the highest ceilings; the VM is right for the large majority of deployments.
  • VM Edges have four form factors (Small to Extra Large). Medium is the production minimum, Large is the practical default, and every node in a cluster must be the same size.
  • Tier-0 runs active-standby (stateful services) or active-active ECMP (stateless, scale-out throughput). The HA mode is a design decision, not a toggle.
  • Always deploy an Edge cluster, never a lone Edge. The cluster is what gives you failover, and under-sizing it bottlenecks the whole platform.
Who this is for: architects and admins designing the north-south edge of an NSX 9 fabric.  Prerequisites: the architecture from Part 2 and prepared host transport nodes from Part 6.

Here is the asymmetry I keep coming back to in this series: east-west traffic is distributed across every host and scales for free as you add them, but north-south traffic, every flow that leaves the overlay for the physical world, funnels through the Edge. That makes the Edge the one tier where a sizing mistake caps the entire platform. I have seen estates with hundreds of beefy hosts throttled to a crawl because someone deployed two Small Edge VMs and called it done. This part is how you avoid that: what the Edge actually does, whether to run it as a VM or on bare-metal, how to size it, and how the Edge cluster keeps it alive.

What the Edge actually does

An Edge transport node is a dedicated forwarding appliance that hosts the centralized services NSX cannot distribute into every hypervisor. The Tier-0 gateway lives here and peers BGP with your physical fabric. So do the stateful services: source and destination NAT, DHCP, the gateway firewall, IPSec and L2 VPN, and NSX native load balancing. When a VM on an overlay segment talks to the internet, the path climbs from its segment to a Tier-1, up to the Tier-0 on an Edge node, and out to the fabric, picking up NAT and firewalling on the way. The Edge is the door between your software-defined world and everything outside it, and like any door, it is a chokepoint if you build it too narrow.

The Edge is the north-south chokepoint Overlay segmentsVMs on hosts Tier-1 gatewaydistributed Edge cluster Tier-0 gateway (BGP) NAT, DHCP, VPN, gateway FW, LB Physicalfabric / WAN East-west never touches this path. North-south has no other way out. Size the Edge for the real egress load. One Tier-0 instance lives per Edge node, so the Edge cluster, not a single node, carries your throughput.
Every flow leaving the overlay passes through the Edge cluster. It is the tier most often under-built.

VM or bare-metal

The Edge runs in one of two form factors, and both use DPDK to accelerate packet processing. The VM Edge is an appliance you deploy onto your ESXi clusters like any other; the bare-metal Edge is NSX installed directly on dedicated physical servers, with no hypervisor in the path. Bare-metal gives you the highest throughput and the lowest, most predictable latency, because the packets never cross a virtualization layer. It also costs you dedicated hardware, more rack space, and a separate operational pattern. The VM Edge is simpler to deploy, lives in the clusters you already run, scales out by adding more appliances, and is more than fast enough for the overwhelming majority of enterprise workloads.

DimensionVM EdgeBare-metal Edge
Throughput ceilingHigh, scales by adding nodes.Highest, single-node line rate.
LatencyLow, with hypervisor in the path.Lowest and most predictable.
FootprintRuns on existing clusters.Dedicated physical servers.
OperationsFamiliar VM lifecycle.Separate hardware lifecycle.
Use it whenAlmost always. The sane default.Sustained very high north-south or strict latency SLAs.
My take: reach for bare-metal only when you can point at a number, a sustained multi-gigabit north-south requirement or a hard latency SLA that a VM Edge has been measured and shown to miss. “It feels faster” is not a reason to take on a separate hardware fleet. Most teams who think they need bare-metal actually need a couple more Large VM Edges.

Sizing the VM Edge

VM Edges come in four form factors, and the choice sets the ceiling on how much work a single node can do. Small exists for labs and for light functions, not for carrying production traffic. Medium is the supported production minimum. Large is what I deploy by default for real north-south workloads, and Extra Large is for the heaviest throughput and service density. There is one hard rule that trips people up: every Edge node in a cluster must be the same size. You cannot mix a Large and a Medium in one cluster to save a license, so pick the size for your busiest requirement and build the cluster uniformly.

VM form factorUse it forIn production?
SmallLabs, PoC, light non-forwarding functions.No.
MediumSmaller production with modest north-south.Minimum supported.
LargeTypical production with real egress and services.Recommended default.
Extra LargeHigh throughput, dense stateful services, LB.For demanding cases.

Remember that a Tier-0 gateway is limited to a single instance per Edge node. That, more than raw CPU, is why throughput scales with the cluster rather than with one big node: you grow capacity by adding Edge nodes and spreading gateways and ECMP paths across them. The performance and sizing math gets a dedicated treatment in Part 26; here the design point is simply to size the node for the service it hosts and the cluster for the aggregate load.

Tier-0 HA: active-standby vs active-active

The Tier-0 runs in one of two high-availability modes, and the choice shapes both your throughput and which services you can offer. Active-standby keeps one Edge node forwarding while another stands ready to take over, and it is the mode you must use when the Tier-0 hosts stateful services like stateful NAT, the stateful gateway firewall, or VPN, because that state has to live in one place and fail over cleanly. Active-active spreads traffic across multiple Edge nodes using ECMP, scaling north-south throughput out across the cluster, but it forwards statelessly, so it cannot run the stateful services. The decision is not cosmetic: pick active-active for raw routed throughput, active-standby when you need the services, and design the Edge cluster around whichever you chose.

Two Tier-0 HA models ACTIVE-STANDBY Edge A (active)forwarding Edge B (standby)ready Supports stateful NAT, gateway FW, VPN. One node’s worth of throughput; clean failover. Use when you need the services. ACTIVE-ACTIVE (ECMP) Edge A Edge B Edge C + All forwarding at once; throughput scales out. Stateless: no stateful services on this Tier-0. Use for maximum routed throughput.
Active-standby buys you stateful services; active-active buys you scale-out throughput. You choose one per Tier-0.

The Edge cluster and failover

You never deploy a single Edge. You deploy an Edge cluster, a group of identically-sized Edge nodes that together provide availability and capacity. The cluster is what lets a gateway survive a node failure: when an Edge goes down, the services and routing it hosted move to a surviving member, and traffic continues. How quickly and how cleanly depends on the HA mode you chose and on whether you left enough headroom. The trap is sizing the cluster so tightly that it runs fine until the day a node fails, at which point the survivors are saturated and you have a performance outage on top of a hardware one. Build in N+1 so a single failure is a non-event, not a second incident.

Size for the failure, not the happy path Edge 1gateways + services Edge 2 (failed)down Edge 3takes over Edge 2’s role services and routing fail over N+1 means the survivors have spare capacity. Size to full load without the failed node, or failover just moves the outage.
Failover only helps if the survivors can carry the load. Plan N+1, not just-enough.
Disclaimer: Edge sizing and HA mode affect production traffic. Validate against the current VCF 9 BOM and NSX 9 configuration maximums, confirm physical uplink bandwidth and BGP peering with the network team, and test failover deliberately before you rely on it. Re-verify the exact NSX 9.x version and Edge maximums before committing.

Where the Edge VMs actually run

A VM Edge has to live on some cluster, and that placement is a real design decision with two common shapes. In a dedicated model, the Edge VMs sit on their own cluster built for north-south, which keeps their resources isolated from tenant workloads and makes the uplink networking clean, at the cost of dedicated hosts. In a collapsed model, the Edge VMs share a cluster with other management or workload VMs, which is cheaper and perfectly fine at smaller scale, as long as you protect the Edges from noisy neighbours. VCF documents these as Edge cluster models, and the right one depends on your scale and how hard you push north-south.

Whichever model you pick, two things are non-negotiable. Give the Edge VMs reserved CPU and memory so they never lose a scheduling fight to a tenant VM; an Edge starved of CPU drops throughput in ways that look like a network fault and waste hours of troubleshooting. And place the Edge VMs with anti-affinity so the members of one Edge cluster never share a host, because the whole point of a multi-node Edge cluster evaporates the moment two of its nodes ride the same failure domain. I check both of these on every Edge deployment I review, and I find at least one of them wrong more often than not.

In practice: the most common Edge performance complaint I get is not a sizing problem at all, it is a placement problem: unreserved Edge VMs contending for CPU on a busy collapsed cluster. Reserve the resources and the “slow Edge” usually fixes itself.

Edge design rules I do not break

A few rules have earned their place by being violated in front of me. Always deploy a cluster of at least two, never a lone Edge. Keep every node in a cluster the same size. Choose the Tier-0 HA mode from the services you need, not by reflex. Default to Large VM Edges and only move to Extra Large or bare-metal against a measured requirement. Spread Edge VMs across hosts with anti-affinity so a single host failure cannot take your whole Edge cluster. And size for the failed-node case, because the day a node dies is precisely the day you need the others to cope.

The Bottom Line

For the large majority of NSX 9 deployments the verdict is the same: a cluster of Large VM Edges, Tier-0 in the HA mode your services dictate, sized with N+1 headroom and spread across hosts. Bare-metal is a deliberate choice for a measured throughput or latency requirement, not a default, and the same goes for Extra Large. The Edge is the one tier where being a little generous pays off, because it is the chokepoint for everything leaving your overlay, and the cost of under-building it is a platform that disappoints under exactly the load you bought it for. For how this slots into the wider VCF picture, see NSX in VCF 9 Explained. Next up is Part 8: segments and logical switching, where we finally start building the networks the VMs actually sit on. VM or bare-metal Edges in your design, and can you defend the choice?


Edge placement is the failure domain you are actually choosing

Picking VM versus bare-metal Edges is not only a throughput decision; it is a decision about your failure and placement model. VM Edges live on your existing hosts and share their resources, so they need CPU and memory reservations to perform predictably, and they need anti-affinity rules so that the two Edges in an active-standby pair never land on the same host. The moment both Edges of a pair share a host, a single host failure is a total north-south outage, and you have quietly built a single point of failure into a design that looks redundant on the diagram. Bare-metal Edges sidestep that by being dedicated tin, at the cost of being dedicated tin.

Whichever form factor you choose, size the Edge cluster for N+1, not N. The reason shows up later in the series: a serial upgrade or a node failure takes one Edge out of service, and if you sized for exactly the throughput two Edges provide, that event becomes a brownout rather than a non-event. Designing the Edge cluster with a spare node is the cheapest insurance you will buy, and it is far cheaper than explaining a peak-hour slowdown that only happens during maintenance.

References

NSX Series · Part 7 of 30
« Previous: Part 6  |  NSX Complete Guide  |  Next: Part 8 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading