Tag: NVIDIA AI Series
-
Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)
The finale: running the NVIDIA AI stack on bare metal, on VMware Cloud Foundation, or in the cloud; the real total cost of an AI factory; and the verdict on when to build versus rent.
-
GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)
Why GPU utilization lies, the DCGM profiling fields that tell the truth (SM and Tensor activity), dcgm-exporter into Prometheus, and choosing MIG vs time-slicing for multi-tenancy.
-
NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)
NVIDIA Blueprints, the AI-Q enterprise research agent, and the framework-agnostic NeMo Agent Toolkit: how to build agents you can profile, afford, and trust in production.
-
The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)
What the NVIDIA NeMo framework is: Megatron-Core parallelism, NeMo 2.0 Python recipes and NeMo-Run, Megatron Bridge for Hugging Face interop, and when to fine-tune instead of pretrain.
-
NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)
How NVIDIA’s NeMo Retriever builds enterprise RAG: extraction, embedding and reranking NIMs, the open Nemotron Retriever models, and NeMo Guardrails, plus the retrieval failures they quietly fix.
-
NVIDIA Nemotron Foundation Models: Open Weights from Nano to Ultra (NVIDIA AI Series, Part 26)
NVIDIA’s Nemotron family explained: genuinely open weights, data and recipes; the hybrid Mamba-Transformer MoE architecture; Nano, Super and Ultra; and when to self-host open models instead of calling a proprietary API.
-
Multi-Node LLM Training: Scheduling, Checkpointing and Fault Tolerance (NVIDIA AI Series, Part 25)
At thousands of GPUs, failures are routine. This part covers gang scheduling (Slurm vs Kubernetes vs NVIDIA Run:ai), async distributed checkpointing with NeMo, and the NVIDIA Resiliency Extension stack for fault tolerance, straggler detection, and elastic restart.
-
Inference Economics: Throughput, Latency, Batching and Cost Per Token (NVIDIA AI Series, Part 21)
TTFT, ITL, continuous batching, KV cache pressure, FP8 quantization — this is how you compute and actually drive down $/1M tokens on NVIDIA H100, H200, and B200 GPUs without breaking your latency SLOs.
-
NeMo Customization: LoRA, SFT, and RLHF on NVIDIA NeMo (NVIDIA AI Series, Part 23)
A practical decision guide for AI infrastructure architects on the full NeMo customization spectrum: when to use LoRA, full SFT, DPO, or GRPO, what data and GPU budget each method needs, and how the NeMo Customizer microservice ties it all together.
-
Data Preparation at Scale with NeMo Curator (NVIDIA AI Series, Part 24)
NeMo Curator is NVIDIA’s GPU-accelerated data curation toolkit that runs exact dedup, fuzzy dedup, semantic dedup, heuristic filtering, classifier-based quality filters, and PII redaction at trillion-token scale using RAPIDS cuDF and Dask. Learn why investing in data curation beats buying more GPUs.
-
Triton Inference Server vs NIM: When to Use Which (NVIDIA AI Series, Part 19)
Triton Inference Server and NVIDIA NIM solve different problems. This guide breaks down when to use each — and when to run both — covering backends, ensembles, dynamic batching, and the NIM packaged-microservice approach for LLM serving.
-
NVIDIA Dynamo Disaggregated Inference: Prefill, Decode, and KV-Aware Routing at Scale (NVIDIA AI Series, Part 20)
NVIDIA Dynamo separates prefill and decode onto independent GPU pools, routing requests via a KV-aware smart router and transferring KV cache blocks via NIXL. Here is when disaggregation wins, when it does not, and what to validate before committing to the architecture.
Architect’s Toolkit
VMware Cloud Foundation
- VCF Documentation
- VCF 9 Planning & Preparation Workbook
- VCF Bill of Materials (BoM)
- VMware Compatibility Guide
- VMware Interoperability Matrix
- VMware Configuration Maximums
- VMware Ports & Protocols
- VMware Hands-on Labs
- RVTools Download
Nutanix
AI & Cloud-Native Platform
- AI Infra Sizing & Cost Calculator
- NVIDIA Build (Model Catalog)
- NVIDIA AI Enterprise Reference Architecture
- NVIDIA NIM Performance Benchmarking
- NVIDIA NGC Catalog
- NeMo Microservices Helm Chart
- Helm Charts Repository
- Hugging Face Models
Architecture & Design
About the Author

Dr Pranay Jha
Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

