Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

NSX 9 NAT, DHCP and DNS Forwarder: The Gateway Services That Need the Edge (NSX Series, Part 11)

NAT, DHCP and the DNS forwarder are the first services that require a service router on the Edge. NAT types and the active-active trap, DHCP modes, and DNS forwarding.

NSX Series · Part 11 of 30

TL;DR · Key Takeaways

  • NAT, DHCP and the DNS forwarder are stateful gateway services. They run on the service router on the Edge, so the gateway that hosts them needs an Edge cluster.
  • SNAT hides internal sources behind an external IP for outbound traffic; DNAT publishes an internal service on an external IP for inbound. Both are stateful and need active-standby.
  • The trap: stateful SNAT/DNAT do not work on an active-active (ECMP) Tier-0. There you use reflexive (stateless) NAT, or you put the NAT on a Tier-1 instead.
  • DHCP runs as a local server (NSX hands out leases) or a relay (NSX forwards to your existing DHCP). Pick relay when you already have IPAM you trust.
  • The DNS forwarder gives clients a local listener IP and forwards queries upstream, with per-domain conditional forwarding when you need it.
Who this is for: admins configuring gateway services on NSX 9.  Prerequisites: Tier-0 and Tier-1 routing (Parts 9 and 10) and an Edge cluster (Part 7), because these services live on the SR.

Up to now the routing has been mostly distributed and almost free. These services are different. NAT, DHCP, and the DNS forwarder are the first features in this series that genuinely need the service router, which means they pull traffic to the Edge and they care about your HA mode in ways the distributed router never did. That is not a reason to avoid them, they are bread-and-butter, but it is a reason to understand where they run before you turn them on, because the most common surprise here is a NAT rule that quietly does nothing because the gateway it is on cannot support it.

NAT: SNAT, DNAT, and the active-active trap

NAT does two everyday jobs. SNAT, source NAT, rewrites the source address of outbound packets so a pool of internal VMs appears to the outside world as one or a few external addresses. DNAT, destination NAT, rewrites the destination of inbound packets so an external address maps to an internal service, which is how you publish a web server that lives on a private overlay segment. You also get No-SNAT and No-DNAT rules, which are exceptions: a way to say “translate everything except this range,” typically so internal routed traffic between your own subnets is not needlessly NATed.

SNAT goes out, DNAT comes in internal VM10.0.1.20 SNAT appears as203.0.113.5 internet client hits203.0.113.9 DNAT maps to10.0.2.40 internal service10.0.2.40
SNAT masks internal sources outbound; DNAT exposes an internal service on a public address inbound.

Now the trap, and it is a good one because it ties directly to the Tier-0 HA decision from Part 9. Stateful SNAT and DNAT track connection state, and that state has to live in one place, so they are only supported on a gateway running in active-standby. If your Tier-0 is active-active for ECMP throughput, stateful NAT is off the table there, because two Edges forwarding independently cannot share that state and asymmetric paths would break it. Your options are to use reflexive NAT, a stateless translation that survives asymmetric paths, or, far more commonly, to do the NAT on a Tier-1 in active-standby and keep the Tier-0 active-active for raw throughput. That second pattern is what I reach for almost every time.

NAT typeDoesStateful?HA mode
SNATRewrites source on outbound.YesActive-standby only.
DNATRewrites destination on inbound.YesActive-standby only.
Reflexive NATStateless 1:1 translation.NoFor active-active Tier-0.
No-SNAT / No-DNATException: skip NAT for a range.n/aEither.
In practice: the support case that comes up again and again is “my SNAT rule is configured but nothing is being translated.” Nine times out of ten the gateway is active-active. Move the NAT to an active-standby Tier-1, or accept reflexive NAT, and the rule starts working. Check the HA mode before you debug the rule.

DHCP: local server or relay

NSX can hand out IP addresses two ways, and the choice usually comes down to where you want IPAM to live. A local DHCP server means NSX itself owns the scope and leases addresses to VMs on a segment; it is self-contained and quick to stand up, ideal for overlay segments that have no business talking to your enterprise DHCP. A DHCP relay means NSX does not lease anything itself but forwards DHCP requests from the segment to your existing external DHCP servers, which keeps a single source of truth for addressing and is what I recommend whenever a team already runs DHCP and IPAM they trust. Both attach to a gateway or segment, and the local server, being stateful, lives on the SR like NAT does.

Two ways to give VMs addresses LOCAL SERVER (NSX leases) segment VMs NSX DHCP serverscope on the SR Self-contained. Good for isolated overlays. RELAY (forward to your DHCP) segment VMs relay corp DHCP One source of truth. Good when IPAM exists.
Local server when NSX should own the scope; relay when your enterprise DHCP already does.

The DNS forwarder

The DNS forwarder gives a gateway a local listener IP that VMs point at for name resolution. It receives client queries and forwards them to upstream DNS servers, using its forwarder IP as the source toward those upstreams. The useful part is conditional forwarding: you can send queries for a specific domain to one set of resolvers and everything else to another, which is how you keep internal zones resolving against internal DNS while general lookups go to a public or corporate resolver. It is a small service, but it removes a dependency, the VMs talk to a local NSX address instead of reaching across the network to a distant resolver for every query, and it gives you a clean place to steer resolution per domain.

DNS forwarder: one listener, steered upstreams client VMspoint at forwarder IP DNS forwarderconditional by domain internal DNScorp.local zone upstream resolvereverything else
Clients query one local address; the forwarder steers each domain to the right upstream.

They all live on the service router

Tie these three back to Part 10 and the picture is consistent: NAT, the local DHCP server, and the DNS forwarder are stateful services, so they run on the service router, which means the gateway hosting them must have an Edge cluster attached. That is the legitimate reason to attach an Edge cluster to a Tier-1, the one I told you to demand a justification for. If a Tier-1 runs any of these services, the SR and the Edge cluster are not optional, they are the point. The discipline is simply to attach the Edge cluster to the gateways that actually run services and leave the pure east-west Tier-1s distributed, so you spend Edge capacity only where a service earns it.

When something here misbehaves, the Edge is where you confirm it. A couple of CLI checks tell you whether the service router is actually doing the translation or forwarding you configured.

# On the Edge node hosting the SR
get firewall                 # NAT and FW are processed here
get nat rules                # confirm SNAT/DNAT rules are present and hit
get dns-forwarder            # forwarder status and upstreams
get dhcp lease               # active leases if running a local DHCP server

# If a NAT rule shows zero hits, re-check the gateway HA mode
# and that traffic is actually routed through this SR.
Disclaimer: NAT, DHCP, and DNS changes affect live connectivity and addressing. Stage changes in a maintenance window, confirm the gateway HA mode supports the NAT you intend, and avoid overlapping a new DHCP scope with an existing one. Validate against the current VCF 9 BOM and NSX 9 maximums before you start.

Local DHCP or relay, decided

The DHCP choice is less about technology and more about who owns addressing in your organisation. If a networking or platform team already runs DHCP and IPAM as the authoritative source, relay keeps that authority intact and avoids two systems disagreeing about who owns a lease. If a segment is genuinely self-contained, a lab overlay or a tenant network that should never touch enterprise addressing, the local server is simpler and removes an external dependency. The table below is the quick version I use when a team cannot decide.

ConsiderationLocal DHCP serverDHCP relay
Source of truthNSX owns the scope.Your existing DHCP/IPAM.
External dependencyNone.Reachability to the DHCP servers.
Best forIsolated overlays, labs, tenants.Estates with trusted enterprise IPAM.
Runs onThe SR (stateful, needs an Edge).Forwarding only; lighter footprint.

Where to place these services

There is a design question that sits underneath all three services: which gateway tier should host them, Tier-0 or Tier-1? My default is to push services down to the Tier-1 wherever I can. Putting NAT, DHCP, and DNS on per-tenant Tier-1 gateways keeps each tenant’s services isolated, lets you run the Tier-0 active-active for throughput, and contains the blast radius of a service change to a single tenant rather than the shared north-south gateway. Reserve Tier-0 services for the genuinely shared, edge-of-network cases, for example a NAT that has to apply to everything leaving the estate. This is the same instinct as the routing parts: keep the shared Tier-0 lean and fast, and let the per-tenant Tier-1s carry the stateful, opinionated work.

The mistakes I see cluster around three things. Stateful NAT placed on an active-active gateway, which silently does nothing, is the big one. Overlapping DHCP scopes, where a new local server hands out addresses that collide with an existing range, is the second, and it produces intermittent, hard-to-trace connectivity loss. The third is forgetting that the DNS forwarder, NAT, and local DHCP all consume the SR, so a Tier-1 that picked up three services has quietly become an Edge-resource consumer that belongs in your sizing math. None of these are hard to avoid once you know to look for them, which is the whole point of understanding where these services run before you switch them on.

My take: services on the Tier-1, throughput on the Tier-0. That one rule resolves most of the placement debates I have with clients, and it keeps the shared north-south path fast while giving each tenant a clean place for its own NAT, DHCP, and DNS.

What I’d Do

Put stateful NAT on an active-standby Tier-1 and keep the Tier-0 active-active for throughput, which sidesteps the most common NAT-does-nothing surprise entirely. Use No-SNAT rules to keep your own internal routed traffic un-NATed rather than building elaborate exceptions later. Reach for DHCP relay when you already run trusted IPAM and a local DHCP server only for genuinely isolated overlays. Stand up the DNS forwarder to give clients a local listener and use conditional forwarding to keep internal zones internal. And remember the through-line of this part: every one of these services lives on the SR, so it is the honest reason a gateway needs an Edge cluster. Next up is Part 12: the Distributed Firewall, where micro-segmentation finally begins. Is your stateful NAT on a gateway that can actually support it?


These services pin workloads to the Edge

NAT, DHCP and the DNS forwarder share a property that shapes your design: they run as stateful services on the gateway service router, which lives on the Edge. The moment you enable any of them, the traffic that depends on them has to traverse the Edge, so a segment that was happily forwarding east-west on the hosts now has an Edge dependency for its address assignment or its name resolution or its outbound translation. That is fine when you plan for it and a surprise when you do not, because it quietly turns the Edge into part of the critical path for functions that feel like they should be everywhere.

The design implications follow directly. Size the Edge for the load these services add, not just for raw forwarding, because a busy DHCP scope or a heavily used NAT rule set consumes real Edge resources. Place the services deliberately, and monitor them as the dependencies they are, because a DNS forwarder or a DHCP server on the Edge failing is an application outage even though no workload moved. The mental model that keeps you out of trouble is to treat every gateway service as something you are pinning to the Edge on purpose, with a capacity cost and a failure mode you have accounted for, rather than a convenient checkbox you flipped without thinking about where it actually runs.

The DNS forwarder is a dependency worth watching

Of the gateway services, the DNS forwarder deserves special attention because of how much depends on it and how quietly it fails. Everything behind it relies on it for name resolution, so a forwarder that is mis-scoped, points at an unreachable upstream, or simply runs on an Edge that is having a bad day takes down name resolution for a whole swathe of workloads at once. The applications do not report a DNS problem; they report timeouts and half-broken behaviour that sends you looking in the wrong place.

Treat the forwarder as the shared dependency it is. Monitor it actively, confirm its upstreams are reachable and sensible, and design its placement and redundancy with the same care you would give any service that an entire tier leans on. When a swathe of workloads behind a gateway suddenly behaves strangely and nothing obvious changed, the DNS forwarder is worth checking early, because a name-resolution failure wears a lot of disguises before anyone names it.

References

NSX Series · Part 11 of 30
« Previous: Part 10  |  NSX Complete Guide  |  Next: Part 12 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

NSX 9 Series

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading