Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.
,

VKS Networking: NSX, VPCs and the CNI Options (VKS Series, Part 6)

Antrea by default, NSX VPCs or VDS on 9.1, and CIDRs that must not collide. Here is the VKS networking design that keeps clusters reachable instead of stuck.

VKS Networking: NSX, VPCs & CNI Options
VKS Series · Part 6 of 17

TL;DR · Key Takeaways

  • Antrea is the default CNI and the one with the deepest NSX integration. Since VKS 3.6 you can bring your own (Calico, Cilium), but you trade native NSX visibility for it.
  • On VCF 9.1 the supported Supervisor networking is VDS (VLAN-backed) or NSX VPCs. NSX Classic Tier-0/Tier-1 is gone from 9.1; if your design still draws it, it is legacy.
  • NSX VPCs behave like a public-cloud VPC: per-project isolation, self-service, automatic routing and NAT, which maps cleanly onto multi-tenant namespaces.
  • Plan pod and service CIDRs deliberately. They must not overlap each other, the node network, or anything they route to, and they are painful to change later.
Who this is for: platform and network teams designing VKS connectivity, especially anyone carrying a vSphere 8 pattern forward.  Prerequisites: a decision on VDS versus NSX, and a block of IP space you control.

Networking is where VKS stops being generic Kubernetes and starts being VCF. The CNI inside the cluster is standard, but how that cluster gets its addresses, its load-balancer VIPs and its isolation is a VCF networking decision. Get this layer right and clusters just work; get it wrong and you collect the stuck-provisioning and no-external-access tickets that fill troubleshooting threads. For the platform-level NSX picture, pair this with my VCF 9 explainer on NSX, transit gateways and VPCs.

The CNI choice: Antrea, or bring your own

VKS lets you pick the cluster CNI at creation time. Antrea is the default and the one VMware integrates most deeply, through the Antrea-NSX adapter, so it is the CNI behind the NSX integration, the VPC features, and the VCF 9.1 multi-network capability that lets a pod attach a secondary interface. Because VKS is CNCF-conformant and bring-your-own-CNI has been allowed since VKS 3.6, you can run Calico or Cilium inside guest clusters instead. Just know the trade: you give up native NSX visibility for that flexibility. In practice I keep Antrea unless a workload has a hard, specific dependency on another CNI, because swapping it to match a runbook from another platform is a cost you pay forever in troubleshooting. The CNI is also a per-cluster decision baked in at provisioning, so decide before you create, not after.

Supervisor networking: VDS or NSX VPC, and nothing else in 9.1

Underneath the CNI, the Supervisor runs on one of two network backings, and this is the part most likely to be wrong if you carried a vSphere 8 design forward. NSX with VPCs is the software-defined option: NSX provides overlay segments, routing and the load balancing that Service type LoadBalancer depends on, and VPCs give each project a self-service slice of isolated networking. VDS networking is the lighter, VLAN-backed option for environments not running full NSX. On VCF 9.1 these two are the supported configurations; the older NSX Classic model of hand-built Tier-0/Tier-1 segments was supported up to 9.0.2 and is gone from 9.1 onward.

OptionWhat you getChoose when
NSX with VPCsOverlay, per-project isolation, native LB, microsegmentationYou run NSX and want self-service multi-tenant networking
VDS networkingVLAN-backed connectivity, no NSX overlayNo NSX; simpler or lab footprints; external LB like Avi
NSX Classic (T0/T1)Hand-managed segments (legacy)Existing only; removed in 9.1, not for new builds

VPCs are the construct to understand for multi-tenancy. Each NSX project gets a VPC, and a VKS cluster provisioned into that project’s namespace gets isolated segments and IP space without anyone hand-cutting Tier-1 routers. That is what makes per-team self-service realistic at scale, and it ties straight into how load balancing works for VKS, which is Part 7.

NSX VPC per project, CIDRs that don’t collide VPC: project-a node network 10.10.0.0/22VKS node VMs pod CIDR100.96.0.0/11 service CIDR100.64.0.0/13 VPC: project-b node network 10.20.0.0/22isolated from project-a its own non-overlapping pod / service CIDRsmust not collide with anything they route to
Each project’s VPC isolates its networking; the pod and service CIDRs inside still have to avoid every range they route to.

CIDR planning you will not regret

Every VKS cluster carries a pod CIDR and a service CIDR, the internal ranges Kubernetes hands to pods and services. They are easy to set and painful to change, so plan them once, properly. The non-negotiables: pod and service ranges must not overlap each other, must not overlap the node network, and must not collide with any network they need to route to, on-premises ranges, peered clouds, partner networks. A lazy default that happens to overlap a corporate subnet is the classic cause of “my pods cannot reach the database” weeks after the cluster was built, and the tell is that internal cluster traffic works but one specific external range does not. Give yourself room, too: a too-small pod CIDR caps how many pods the cluster can ever schedule, and with NSX VPCs you also want the cluster’s needs to fit inside the project’s allocated IP space.

CIDR overlap is self-inflicted: it is the single most common VKS networking wound, and it is entirely preventable. Spend fifteen minutes with your network team allocating non-overlapping pod and service ranges before you provision, and you dodge a category of failure that is genuinely miserable to fix in production.

How Antrea and NSX actually integrate

The reason Antrea is more than just a default is the Antrea-NSX adapter, which connects the in-cluster CNI to the NSX control plane. With it, Kubernetes network policies and the cluster’s pods become visible to NSX, so you can see and secure container traffic in the same place you manage the rest of your east-west security rather than in a separate Kubernetes-only tool. That visibility is the quiet payoff of staying on Antrea: a network team that already lives in NSX does not have to learn a parallel toolchain to understand what the clusters are doing, and microsegmentation policy can extend down to the pod. Run Calico or Cilium instead and that bridge is gone; you get a perfectly good CNI, but NSX no longer sees inside the cluster, and you own that observability gap yourself.

This is the trade I weigh for every client who asks about bring-your-own-CNI. The flexibility is real and occasionally necessary, a workload with a hard Cilium eBPF dependency, say, but the default is integrated for a reason, and the cost of leaving it is paid slowly, in every future troubleshooting session where the network team can see the VMs but not the pods. Unless there is a specific, named requirement, the integrated path is worth more than the flexibility.

Secondary interfaces and multi-network in 9.1

VCF 9.1 added multi-network support to VKS, which matters for a specific but common class of workload. By default a pod has a single network interface, and for most applications that is exactly right. But some workloads, network functions, certain data or telco apps, monitoring tools that need to tap a separate network, genuinely need a second interface on a different network. In 9.1 a VKS cluster and the pods within it can be deployed with a secondary vNIC using the Antrea CNI, so those workloads get their additional network path without hacks. If you have been carrying a design that assumed one network per pod and worked around it with host networking or privileged sidecars, this is the feature that lets you do it properly.

Do not reach for it by default, though. A second interface is operational surface area: more addressing to plan, more routing to reason about, more to get wrong. Use it where a workload genuinely requires network separation, and keep everything else single-homed. The feature exists to solve a real problem, not to be sprinkled on clusters that do not have that problem.

Egress, SNAT and reaching the outside world

Inbound traffic gets all the attention, but egress, how pods reach things outside the cluster, is where a surprising number of real problems live. By default pod traffic leaving the cluster is source-NATed behind the node or a gateway address, which is fine until something on the outside needs to allow-list the source, a firewall rule, a database that only accepts known IPs, a partner API with IP-based auth. Then you care a great deal about which address your egress traffic actually presents, and whether it is stable. With NSX VPCs the egress path and its NAT behaviour are part of the VPC configuration, and Antrea offers egress controls so you can pin specific workloads to specific egress addresses when an external system demands a predictable source IP.

The failure pattern is always the same and always weeks after go-live: the app works fine talking to everything inside the cluster, then cannot reach one external system that allow-lists by IP, and nobody connects it to egress NAT because the cluster itself looks healthy. Plan your egress addressing alongside your pod and service CIDRs, decide early which workloads need a stable, allow-listable egress IP, and document it. It is far cheaper to design than to diagnose.

What I’d Do

I keep Antrea unless a workload truly forces another CNI, standardise on NSX VPCs where NSX is in play so tenant onboarding is self-service rather than a ticket, and reserve VDS for genuinely NSX-free or lab footprints. Most importantly, I treat pod and service CIDRs as a network-team decision made once, from a planned block, never a per-cluster default someone accepts under time pressure. If you are still drawing NSX Classic Tier-0/Tier-1 topologies for new clusters, stop, that design is already legacy on 9.1 and you will rework it at upgrade time. Networking is the layer that punishes shortcuts the longest. Are your cluster CIDRs documented and non-overlapping across the whole estate, or allocated ad hoc one cluster at a time?

References

VKS Series · Part 6 of 17
« Prev: Part 5  |  VKS Complete Guide  |  Next: Part 7 »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading