Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

Tag: NVIDIA AI Series

AI Stack, AI/ML, VMware & Cloud

Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)

Dr. Pranay Jha

June 23, 2026

The finale: running the NVIDIA AI stack on bare metal, on VMware Cloud Foundation, or in the cloud; the real total cost of an AI factory; and the verdict on when to build versus rent.
Continue Reading
AI Stack, AI/ML

GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)

Dr. Pranay Jha

June 23, 2026

Why GPU utilization lies, the DCGM profiling fields that tell the truth (SM and Tensor activity), dcgm-exporter into Prometheus, and choosing MIG vs time-slicing for multi-tenancy.
Continue Reading
AI Stack, AI/ML

NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)

Dr. Pranay Jha

June 23, 2026

NVIDIA Blueprints, the AI-Q enterprise research agent, and the framework-agnostic NeMo Agent Toolkit: how to build agents you can profile, afford, and trust in production.
Continue Reading
AI Stack, AI/ML

The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)

Dr. Pranay Jha

June 23, 2026

What the NVIDIA NeMo framework is: Megatron-Core parallelism, NeMo 2.0 Python recipes and NeMo-Run, Megatron Bridge for Hugging Face interop, and when to fine-tune instead of pretrain.
Continue Reading
AI Stack, AI/ML

NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)

Dr. Pranay Jha

June 23, 2026

How NVIDIA’s NeMo Retriever builds enterprise RAG: extraction, embedding and reranking NIMs, the open Nemotron Retriever models, and NeMo Guardrails, plus the retrieval failures they quietly fix.
Continue Reading
AI Stack, AI/ML

NVIDIA Nemotron Foundation Models: Open Weights from Nano to Ultra (NVIDIA AI Series, Part 26)

Dr. Pranay Jha

June 23, 2026

NVIDIA’s Nemotron family explained: genuinely open weights, data and recipes; the hybrid Mamba-Transformer MoE architecture; Nano, Super and Ultra; and when to self-host open models instead of calling a proprietary API.
Continue Reading
AI Stack, AI/ML

Multi-Node LLM Training: Scheduling, Checkpointing and Fault Tolerance (NVIDIA AI Series, Part 25)

Dr. Pranay Jha

June 22, 2026

At thousands of GPUs, failures are routine. This part covers gang scheduling (Slurm vs Kubernetes vs NVIDIA Run:ai), async distributed checkpointing with NeMo, and the NVIDIA Resiliency Extension stack for fault tolerance, straggler detection, and elastic restart.
Continue Reading
AI Stack, AI/ML

Inference Economics: Throughput, Latency, Batching and Cost Per Token (NVIDIA AI Series, Part 21)

Dr. Pranay Jha

June 22, 2026

TTFT, ITL, continuous batching, KV cache pressure, FP8 quantization — this is how you compute and actually drive down $/1M tokens on NVIDIA H100, H200, and B200 GPUs without breaking your latency SLOs.
Continue Reading
AI Stack, AI/ML

NeMo Customization: LoRA, SFT, and RLHF on NVIDIA NeMo (NVIDIA AI Series, Part 23)

Dr. Pranay Jha

June 22, 2026

A practical decision guide for AI infrastructure architects on the full NeMo customization spectrum: when to use LoRA, full SFT, DPO, or GRPO, what data and GPU budget each method needs, and how the NeMo Customizer microservice ties it all together.
Continue Reading
AI Stack, AI/ML

Data Preparation at Scale with NeMo Curator (NVIDIA AI Series, Part 24)

Dr. Pranay Jha

June 22, 2026

NeMo Curator is NVIDIA’s GPU-accelerated data curation toolkit that runs exact dedup, fuzzy dedup, semantic dedup, heuristic filtering, classifier-based quality filters, and PII redaction at trillion-token scale using RAPIDS cuDF and Dask. Learn why investing in data curation beats buying more GPUs.
Continue Reading
AI Stack, AI/ML

Triton Inference Server vs NIM: When to Use Which (NVIDIA AI Series, Part 19)

Dr. Pranay Jha

June 22, 2026

Triton Inference Server and NVIDIA NIM solve different problems. This guide breaks down when to use each — and when to run both — covering backends, ensembles, dynamic batching, and the NIM packaged-microservice approach for LLM serving.
Continue Reading
AI Stack, AI/ML

NVIDIA Dynamo Disaggregated Inference: Prefill, Decode, and KV-Aware Routing at Scale (NVIDIA AI Series, Part 20)

Dr. Pranay Jha

June 22, 2026

NVIDIA Dynamo separates prefill and decode onto independent GPU pools, routing requests via a KV-aware smart router and transferring KV cache blocks via NIXL. Here is when disaggregation wins, when it does not, and what to validate before committing to the architecture.
Continue Reading

Architect’s Toolkit

About the Author

Dr Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed

AI Stack, AI/ML, VMware & Cloud

Running NVIDIA AI On-Prem and on VCF: Cost, Trade-offs and the Verdict (NVIDIA AI Series, Part 30)

June 23, 2026
AI Stack, AI/ML

GPU Observability and Multi-Tenancy: DCGM, Honest Utilization, and Sharing (NVIDIA AI Series, Part 29)

June 23, 2026
AI Stack, AI/ML

NVIDIA Blueprints and Agentic AI: AI-Q and the NeMo Agent Toolkit (NVIDIA AI Series, Part 28)

June 23, 2026
AI Stack, AI/ML

The NVIDIA NeMo Framework: Training and Fine-Tuning at Scale (NVIDIA AI Series, Part 22)

June 23, 2026
AI Stack, AI/ML

NVIDIA NeMo Retriever: RAG with Embeddings, Reranking and Guardrails (NVIDIA AI Series, Part 27)

June 23, 2026