Here is a pattern I see on almost every Private AI engagement. The GPU hosts are sized correctly, the NIM Operator is humming, the models load, and then the first real RAG query takes four seconds because the embedding service, the vector database and the LLM are scattered across three namespaces with traffic hairpinning through an edge it never needed to touch. Nobody designed the network. They designed the compute and assumed the network would sort itself out. It does not. On a platform where a single user request fans out into half a dozen internal model calls, the network is a first-class part of the architecture, not plumbing.
This post covers the three networking problems that matter on Private AI: isolating AI projects from each other, getting inference endpoints exposed safely, and keeping the chatty east-west traffic inside a RAG stack fast and private.
Three traffic patterns, three different problems
Before you reach for NSX policies, separate the traffic in your head. North-south is users and applications calling your inference endpoints from outside the cluster. East-west is the internal chatter between model components, an LLM calling an embedding service calling a reranker. And there is a third, often forgotten flow: the egress a NIMService makes to pull a model or a license heartbeat. Each has different latency tolerance, different exposure risk and a different control.
| Traffic | Example | Latency sensitivity | Primary control |
|---|---|---|---|
| North-south | App calling the chat endpoint | Moderate | Gateway API + load balancer + TLS |
| East-west | LLM to embedding to vector DB | High, every hop counts | Distributed firewall, same-zone placement |
| Egress | Model pull from NGC, license check | Low | Egress firewall rules or a mirror |
Segmentation: with NSX or without it
The first design fork on Private AI is whether the Supervisor uses NSX networking or VDS-based networking without NSX. Broadcom documents both, and they are not equivalent for AI. Without NSX you get vSphere namespaces and basic isolation, which is enough for a single-team lab. With NSX you get the distributed firewall, micro-segmentation down to the pod, and proper multi-tenant separation between AI projects that happen to share GPU hosts. If two business units share a cluster and one is handling regulated data, that distinction is not optional.
Exposing inference endpoints without leaking them
The NIM Operator moved to the Kubernetes Gateway API as the default routing mechanism in 3.1, with traditional ingress deprecated. That is the right call: the Gateway API gives you a clean split between the platform team that owns the gateway and the app team that attaches routes to it. On Private AI the gateway typically fronts NSX Advanced Load Balancer for the actual L4 and L7 data path. Terminate TLS at the gateway, require authentication at that boundary, and never expose a NIMService directly with a LoadBalancer service type unless you genuinely want the model reachable from the data center network. I have walked into more than one environment where an internal chat model was one routable hop from the entire corporate LAN because someone took the quick path.
The east-west path is where latency hides
A RAG request is not one call, it is a chain: embed the query, search the vector store, rerank, then generate. If those services sit in different namespaces on different worker nodes, every hop adds a network round trip, and you pay it on every single query. The cheapest performance win in most RAG deployments is not a faster GPU, it is placing the embedding service, reranker and vector database close to the LLM so the east-west traffic stays on the same node or at least the same zone. Use pod affinity to co-locate the components of one pipeline, and use the distributed firewall to allow exactly those flows and deny the rest. Tighten security and cut latency with the same design move.
What I would do
For anything beyond a single-team lab, build the Supervisor on NSX from day one. Retrofitting micro-segmentation onto a running AI platform is painful and you will not get permission for the maintenance window once models are in production. Front all inference with the Gateway API and NSX Advanced Load Balancer, terminate TLS and authenticate at that edge, and treat a bare LoadBalancer service on a NIMService as a red flag in code review. Then spend an afternoon on pod affinity for your RAG stacks, because that one change quietly removes more tail latency than most GPU tuning. Networking is not the exciting part of Private AI, but it is the part that decides whether the platform feels fast and stays contained. For the security model that sits on top of this, see the Private AI security and data privacy post.
How is your supervisor networking set up today, NSX or VDS? If you are on VDS and sharing GPU hosts across teams, that is the first thing I would revisit.
References
- Deploying VCF Private AI Services: Supervisor architectures with and without NSX
- NVIDIA NIM Operator release notes (Gateway API routing)
- Kubernetes Gateway API documentation
« Previous: Part 25 | VMware Private AI Complete Guide | Next: Part 27 »








