Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
, ,

Networking for VMware Private AI Workloads: Segmentation, Ingress and the East-West Path (Private AI Series, Part 26)

Model serving lives or dies on the network nobody designed. Here is how to segment AI namespaces with NSX, expose inference endpoints through the Gateway API and the load balancer, and keep RAG east-west traffic fast and private.

VMware Private AI Series · Part 26 of 30

Here is a pattern I see on almost every Private AI engagement. The GPU hosts are sized correctly, the NIM Operator is humming, the models load, and then the first real RAG query takes four seconds because the embedding service, the vector database and the LLM are scattered across three namespaces with traffic hairpinning through an edge it never needed to touch. Nobody designed the network. They designed the compute and assumed the network would sort itself out. It does not. On a platform where a single user request fans out into half a dozen internal model calls, the network is a first-class part of the architecture, not plumbing.

This post covers the three networking problems that matter on Private AI: isolating AI projects from each other, getting inference endpoints exposed safely, and keeping the chatty east-west traffic inside a RAG stack fast and private.

Three traffic patterns, three different problems

Before you reach for NSX policies, separate the traffic in your head. North-south is users and applications calling your inference endpoints from outside the cluster. East-west is the internal chatter between model components, an LLM calling an embedding service calling a reranker. And there is a third, often forgotten flow: the egress a NIMService makes to pull a model or a license heartbeat. Each has different latency tolerance, different exposure risk and a different control.

TrafficExampleLatency sensitivityPrimary control
North-southApp calling the chat endpointModerateGateway API + load balancer + TLS
East-westLLM to embedding to vector DBHigh, every hop countsDistributed firewall, same-zone placement
EgressModel pull from NGC, license checkLowEgress firewall rules or a mirror
The three flows on one map users Load balancer + Gateway API north-south AI namespace (one tenant) LLM NIM embedding + reranker vector DB pgvector east-west (keep it in-zone) distributed firewall wraps each pod egress to NGC / mirror
One user request becomes several internal hops. The east-west path is where latency hides.

Segmentation: with NSX or without it

The first design fork on Private AI is whether the Supervisor uses NSX networking or VDS-based networking without NSX. Broadcom documents both, and they are not equivalent for AI. Without NSX you get vSphere namespaces and basic isolation, which is enough for a single-team lab. With NSX you get the distributed firewall, micro-segmentation down to the pod, and proper multi-tenant separation between AI projects that happen to share GPU hosts. If two business units share a cluster and one is handling regulated data, that distinction is not optional.

Two supervisor networking models Without NSX (VDS) Namespace-level isolation only No distributed firewall Simpler to stand up External load balancer required Fit: single team, lab, PoC With NSX Micro-segmentation per pod Distributed firewall rules NSX Advanced LB for L4 and L7 True multi-tenant separation Fit: shared, regulated, production
If AI projects share hosts or touch regulated data, NSX is the answer. Decide before you build the Supervisor.

Exposing inference endpoints without leaking them

The NIM Operator moved to the Kubernetes Gateway API as the default routing mechanism in 3.1, with traditional ingress deprecated. That is the right call: the Gateway API gives you a clean split between the platform team that owns the gateway and the app team that attaches routes to it. On Private AI the gateway typically fronts NSX Advanced Load Balancer for the actual L4 and L7 data path. Terminate TLS at the gateway, require authentication at that boundary, and never expose a NIMService directly with a LoadBalancer service type unless you genuinely want the model reachable from the data center network. I have walked into more than one environment where an internal chat model was one routable hop from the entire corporate LAN because someone took the quick path.

The safe ingress chain Client app NSX Adv LBVIP, L4/L7 GatewayTLS + authN HTTPRoutepath rules NIMServicepods Authentication and TLS belong at the gateway, not inside the model pod.
Front every endpoint with the gateway. A direct LoadBalancer on a NIMService is a shortcut you will regret.

The east-west path is where latency hides

A RAG request is not one call, it is a chain: embed the query, search the vector store, rerank, then generate. If those services sit in different namespaces on different worker nodes, every hop adds a network round trip, and you pay it on every single query. The cheapest performance win in most RAG deployments is not a faster GPU, it is placing the embedding service, reranker and vector database close to the LLM so the east-west traffic stays on the same node or at least the same zone. Use pod affinity to co-locate the components of one pipeline, and use the distributed firewall to allow exactly those flows and deny the rest. Tighten security and cut latency with the same design move.

Disclaimer: changing Supervisor networking or distributed firewall policy on a live cluster can sever running inference traffic. Validate rules in a staging namespace, stage default-deny policies with logging before enforcing, and confirm your load balancer and certificate design with the network team before cutover.

What I would do

For anything beyond a single-team lab, build the Supervisor on NSX from day one. Retrofitting micro-segmentation onto a running AI platform is painful and you will not get permission for the maintenance window once models are in production. Front all inference with the Gateway API and NSX Advanced Load Balancer, terminate TLS and authenticate at that edge, and treat a bare LoadBalancer service on a NIMService as a red flag in code review. Then spend an afternoon on pod affinity for your RAG stacks, because that one change quietly removes more tail latency than most GPU tuning. Networking is not the exciting part of Private AI, but it is the part that decides whether the platform feels fast and stays contained. For the security model that sits on top of this, see the Private AI security and data privacy post.

How is your supervisor networking set up today, NSX or VDS? If you are on VDS and sharing GPU hosts across teams, that is the first thing I would revisit.

References

VMware Private AI Series · Part 26 of 30
« Previous: Part 25  |  VMware Private AI Complete Guide  |  Next: Part 27 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading