Networking for VMware Private AI Workloads: Segmentation, Ingress and the East-West Path (Private AI Series, Part 26)

Model serving lives or dies on the network nobody designed. Here is how to segment AI namespaces with NSX, expose inference endpoints through the Gateway API and the load balancer, and keep RAG east-west traffic fast and private.

by

Dr. Pranay Jha

June 17, 2026

No comments

6 minutes

Read Time

VMware Private AI Series · Part 26 of 30

Here is a pattern I see on almost every Private AI engagement. The GPU hosts are sized correctly, the NIM Operator is humming, the models load, and then the first real RAG query takes four seconds because the embedding service, the vector database and the LLM are scattered across three namespaces with traffic hairpinning through an edge it never needed to touch. Nobody designed the network. They designed the compute and assumed the network would sort itself out. It does not. On a platform where a single user request fans out into half a dozen internal model calls, the network is a first-class part of the architecture, not plumbing.

This post covers the three networking problems that matter on Private AI: isolating AI projects from each other, getting inference endpoints exposed safely, and keeping the chatty east-west traffic inside a RAG stack fast and private.

Three traffic patterns, three different problems

Before you reach for NSX policies, separate the traffic in your head. North-south is users and applications calling your inference endpoints from outside the cluster. East-west is the internal chatter between model components, an LLM calling an embedding service calling a reranker. And there is a third, often forgotten flow: the egress a NIMService makes to pull a model or a license heartbeat. Each has different latency tolerance, different exposure risk and a different control.

Traffic	Example	Latency sensitivity	Primary control
North-south	App calling the chat endpoint	Moderate	Gateway API + load balancer + TLS
East-west	LLM to embedding to vector DB	High, every hop counts	Distributed firewall, same-zone placement
Egress	Model pull from NGC, license check	Low	Egress firewall rules or a mirror

One user request becomes several internal hops. The east-west path is where latency hides.

Segmentation: with NSX or without it

The first design fork on Private AI is whether the Supervisor uses NSX networking or VDS-based networking without NSX. Broadcom documents both, and they are not equivalent for AI. Without NSX you get vSphere namespaces and basic isolation, which is enough for a single-team lab. With NSX you get the distributed firewall, micro-segmentation down to the pod, and proper multi-tenant separation between AI projects that happen to share GPU hosts. If two business units share a cluster and one is handling regulated data, that distinction is not optional.

If AI projects share hosts or touch regulated data, NSX is the answer. Decide before you build the Supervisor.

Exposing inference endpoints without leaking them

The NIM Operator moved to the Kubernetes Gateway API as the default routing mechanism in 3.1, with traditional ingress deprecated. That is the right call: the Gateway API gives you a clean split between the platform team that owns the gateway and the app team that attaches routes to it. On Private AI the gateway typically fronts NSX Advanced Load Balancer for the actual L4 and L7 data path. Terminate TLS at the gateway, require authentication at that boundary, and never expose a NIMService directly with a LoadBalancer service type unless you genuinely want the model reachable from the data center network. I have walked into more than one environment where an internal chat model was one routable hop from the entire corporate LAN because someone took the quick path.

Front every endpoint with the gateway. A direct LoadBalancer on a NIMService is a shortcut you will regret.

The east-west path is where latency hides

A RAG request is not one call, it is a chain: embed the query, search the vector store, rerank, then generate. If those services sit in different namespaces on different worker nodes, every hop adds a network round trip, and you pay it on every single query. The cheapest performance win in most RAG deployments is not a faster GPU, it is placing the embedding service, reranker and vector database close to the LLM so the east-west traffic stays on the same node or at least the same zone. Use pod affinity to co-locate the components of one pipeline, and use the distributed firewall to allow exactly those flows and deny the rest. Tighten security and cut latency with the same design move.

Disclaimer: changing Supervisor networking or distributed firewall policy on a live cluster can sever running inference traffic. Validate rules in a staging namespace, stage default-deny policies with logging before enforcing, and confirm your load balancer and certificate design with the network team before cutover.

What I would do

For anything beyond a single-team lab, build the Supervisor on NSX from day one. Retrofitting micro-segmentation onto a running AI platform is painful and you will not get permission for the maintenance window once models are in production. Front all inference with the Gateway API and NSX Advanced Load Balancer, terminate TLS and authenticate at that edge, and treat a bare LoadBalancer service on a NIMService as a red flag in code review. Then spend an afternoon on pod affinity for your RAG stacks, because that one change quietly removes more tail latency than most GPU tuning. Networking is not the exciting part of Private AI, but it is the part that decides whether the platform feels fast and stays contained. For the security model that sits on top of this, see the Private AI security and data privacy post.

How is your supervisor networking set up today, NSX or VDS? If you are on VDS and sharing GPU hosts across teams, that is the first thing I would revisit.

References

VMware Private AI Series · Part 26 of 30
« Previous: Part 25 | VMware Private AI Complete Guide | Next: Part 27 »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

See author's posts