Tag: NVIDIA AI Series
-
Deploying and Autoscaling NIM in Production on Kubernetes (NVIDIA AI Series, Part 17)
How to deploy NVIDIA NIM in production using the NIM Operator and Helm, wire autoscaling on the right GPU and KV-cache signals instead of CPU, handle cold-start model load, and run blue-green rollouts without dropping throughput.
-
TensorRT and TensorRT-LLM: Optimization, Quantization, and Engine Building (NVIDIA AI Series, Part 18)
What TensorRT does at build time versus what TensorRT-LLM adds at runtime — kernel fusion, paged KV cache, in-flight batching, and quantization choices from FP8 to NVFP4 — and when to hand-build engines instead of relying on a NIM.
-
NVIDIA NIM Inference Microservices: What a NIM Is and How It Serves a Model (NVIDIA AI Series, Part 16)
NVIDIA NIM packages a model, an optimized inference engine, and an OpenAI-compatible API into a single container. Pull it, pass your NGC API key, and you have a production inference endpoint on your own GPU infrastructure in minutes.
-
NVIDIA Network Operator on Kubernetes: RDMA, SR-IOV, and the Accelerated Fabric (NVIDIA AI Series, Part 13)
The NVIDIA Network Operator provisions MOFED drivers, RDMA shared device plugin, SR-IOV VFs, and Multus secondary networks to Kubernetes pods. This is how GPUDirect RDMA actually works at scale on ConnectX-7 and NDR InfiniBand clusters.
-
NVIDIA Drivers, CUDA, and the Container Toolkit: Building a Clean GPU Host Baseline (NVIDIA AI Series, Part 11)
The GPU host stack has three distinct layers: the data-center driver (open kernel module now required for Hopper and Blackwell), the CUDA Toolkit, and the NVIDIA Container Toolkit. Get the install order or versions wrong and containers fail silently. Here is the right sequence, the compatibility matrix, and the failure modes.
-
Air-Gapped Deployment, Lifecycle and CVE Patching for the NVIDIA Stack (NVIDIA AI Series, Part 15)
Running NVIDIA AI Enterprise in an air-gapped environment requires mirroring nvcr.io containers, Helm charts, and model weights before you cut the wire. Here is the branch selection, driver patch cadence, and CVE triage workflow that keeps regulated deployments defensible.
-
NGC Catalog: Containers, Models, Helm Charts and How to Consume Them (NVIDIA AI Series, Part 14)
The NGC catalog is your upstream source for NVIDIA GPU-optimized containers, pretrained models, and Helm charts. Here is how the nvcr.io registry, org/team/API-key model, and NVAIE entitlement actually work, with a full operational pull-and-deploy walkthrough.
-
NVIDIA GPU Operator on Kubernetes: ClusterPolicy, Components, and Day-2 Ops (NVIDIA AI Series, Part 12)
The NVIDIA GPU Operator automates every software layer a GPU node needs in Kubernetes, from kernel driver to DCGM metrics, via a single ClusterPolicy CRD. Here is what it installs, how the reconciliation loop works, when to disable the driver component, and the failure modes that will catch you on first install.
-
InfiniBand vs Spectrum-X Ethernet: Choosing Your AI Cluster Scale-Out Fabric (NVIDIA AI Series, Part 8)
InfiniBand Quantum-X800 and Spectrum-X Ethernet both run at 800 Gb/s — but they are not the same choice. A direct comparison of SHARPv4 in-network reduction, lossless fabric mechanisms, rail-optimized topology, multi-tenant isolation, and operational trade-offs, with a clear verdict on which fabric wins for dedicated AI training versus shared enterprise GPU platforms.
-
GPUDirect Storage: The DMA Path from NVMe to GPU Memory (NVIDIA AI Series, Part 9)
GPUDirect Storage (GDS) creates a direct DMA path from NVMe or networked storage straight into GPU HBM, bypassing the CPU bounce buffer entirely. Here is when it helps, what the cuFile API requires, and the filesystem and NIC prerequisites to validate before enabling in production.
-
GPU Power, Cooling and Density: Why Blackwell Forces Liquid (NVIDIA AI Series, Part 10)
The GB200 NVL72 draws ~120 kW per rack and ships liquid-cooled by design. Learn why Blackwell-class systems make direct-to-chip cooling mandatory, how CDUs and facility water loops work, and what to validate before ordering.
-
NVLink and NVSwitch: How NVIDIA Builds the Scale-Up Fabric (NVIDIA AI Series, Part 7)
Fifth-generation NVLink delivers 1.8 TB/s per GPU, and NVSwitch builds a non-blocking 130 TB/s all-to-all fabric across 72 GPUs in the GB200 NVL72. Here is how the domain forms, why it determines your tensor and expert parallelism strategy, and where the boundary falls.
Architect’s Toolkit
VMware Cloud Foundation
- VCF Documentation
- VCF 9 Planning & Preparation Workbook
- VCF Bill of Materials (BoM)
- VMware Compatibility Guide
- VMware Interoperability Matrix
- VMware Configuration Maximums
- VMware Ports & Protocols
- VMware Hands-on Labs
- RVTools Download
Nutanix
AI & Cloud-Native Platform
- AI Infra Sizing & Cost Calculator
- NVIDIA Build (Model Catalog)
- NVIDIA AI Enterprise Reference Architecture
- NVIDIA NIM Performance Benchmarking
- NVIDIA NGC Catalog
- NeMo Microservices Helm Chart
- Helm Charts Repository
- Hugging Face Models
Architecture & Design
About the Author

Dr Pranay Jha
Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

