Tag: GenAI Series

Generative AI

The Economics and Future of Generative AI: An Honest Take (GenAI Series, Part 30)

Dr. Pranay Jha

June 18, 2026

An honest take to close the series: why GPU utilization is the real cost lever, a blunt verdict on the hype, what is actually coming, and a recap with reading paths.
Continue Reading
Generative AI

Mixture-of-Experts and Where AI Architecture Is Heading (GenAI Series, Part 29)

Dr. Pranay Jha

June 18, 2026

Mixture-of-experts models hold enormous capacity but activate only a few experts per token, so they run cheaply. How MoE works, its memory catch, and the trends to watch.
Continue Reading
Generative AI

What It Takes to Train a Model Across Thousands of GPUs (GenAI Series, Part 28)

Dr. Pranay Jha

June 18, 2026

Training a frontier model coordinates thousands of GPUs for months. How data, tensor, pipeline and expert parallelism, the memory math, and checkpointing make it possible.
Continue Reading
Generative AI

On-Prem vs Cloud vs Hybrid for GenAI: An Honest Verdict (GenAI Series, Part 27)

Dr. Pranay Jha

June 18, 2026

Where should generative AI run? An honest framework weighing data sovereignty, the cost crossover, and control, and why most large organisations end up hybrid.
Continue Reading
Generative AI

The Network and Storage Behind Large-Scale AI (GenAI Series, Part 26)

Dr. Pranay Jha

June 18, 2026

At scale, the network between GPUs is often the real bottleneck. How NVLink, InfiniBand and RoCE, collective operations like all-reduce, and high-throughput storage keep GPUs fed.
Continue Reading
Generative AI

Scaling Inference: The Latency vs Throughput Trade-Off (GenAI Series, Part 25)

Dr. Pranay Jha

June 18, 2026

Scaling AI inference means choosing a point on the latency-versus-throughput curve. How batching, tensor and pipeline parallelism, and autoscaling on the right signal work.
Continue Reading
Generative AI

vLLM vs TensorRT-LLM vs SGLang: Which Inference Engine, and When (GenAI Series, Part 24)

Dr. Pranay Jha

June 18, 2026

The inference engine decides whether a GPU serves five users or fifty. How continuous batching and paged attention work, and when to choose vLLM, TensorRT-LLM, SGLang or NIM.
Continue Reading
Generative AI

Why GenAI Runs on GPUs, and the Memory Wall That Limits It (GenAI Series, Part 23)

Dr. Pranay Jha

June 18, 2026

Models run on GPUs for parallel matrix math, but generating text is limited by memory, not compute. Why bandwidth caps speed, VRAM caps what runs, and the KV cache fills the gap.
Continue Reading
Generative AI

Where the Money Actually Goes in Generative AI (GenAI Series, Part 22)

Dr. Pranay Jha

June 18, 2026

Almost every dollar in generative AI is GPU time, metered as tokens. The real cost drivers, why output tokens cost more than input, and the build-versus-buy decision.
Continue Reading
Generative AI

Guardrails and Responsible AI: What They Catch, and What They Miss (GenAI Series, Part 21)

Dr. Pranay Jha

June 18, 2026

Guardrails screen what goes into and out of an AI model. What they catch, harmful content, jailbreaks, prompt injection, data leaks, and why safety must be layered, not a single filter.
Continue Reading
Generative AI

Quantization: Running Big Models on Smaller GPUs (GenAI Series, Part 20)

Dr. Pranay Jha

June 18, 2026

Quantization stores a model at lower precision so it needs far less memory. How FP16, INT8 and INT4 trade a little quality for big savings, plus distillation and pruning.
Continue Reading
Generative AI

Why Data, Not Model Size, Usually Decides Quality (GenAI Series, Part 19)

Dr. Pranay Jha

June 18, 2026

A smaller model trained on more, cleaner data often beats a bigger one. Why parameter count is overrated, what the Chinchilla result showed, and how data curation decides quality.
Continue Reading

Architect’s Toolkit

About the Author

Dr Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha

Tag: GenAI Series

The Economics and Future of Generative AI: An Honest Take (GenAI Series, Part 30)

Mixture-of-Experts and Where AI Architecture Is Heading (GenAI Series, Part 29)

What It Takes to Train a Model Across Thousands of GPUs (GenAI Series, Part 28)

The Network and Storage Behind Large-Scale AI (GenAI Series, Part 26)

Scaling Inference: The Latency vs Throughput Trade-Off (GenAI Series, Part 25)

vLLM vs TensorRT-LLM vs SGLang: Which Inference Engine, and When (GenAI Series, Part 24)

Why GenAI Runs on GPUs, and the Memory Wall That Limits It (GenAI Series, Part 23)

Where the Money Actually Goes in Generative AI (GenAI Series, Part 22)

Guardrails and Responsible AI: What They Catch, and What They Miss (GenAI Series, Part 21)

Quantization: Running Big Models on Smaller GPUs (GenAI Series, Part 20)

Why Data, Not Model Size, Usually Decides Quality (GenAI Series, Part 19)

Architect’s Toolkit

VMware Cloud Foundation

Nutanix

AI & Cloud-Native Platform

Architecture & Design

About the Author

Dr Pranay Jha

You May Have Missed

VKS: The Verdict and When to Use It vs Alternatives (VKS Series, Part 17)

VKS Day-2 Operations: Backup, Multi-Tenancy and Capacity (VKS Series, Part 16)

Troubleshooting VKS: The Failure Modes That Actually Bite (VKS Series, Part 15)

Running GPU and AI Workloads on VKS (VKS Series, Part 14)

Deploying Applications on VKS with GitOps: Argo CD, Flux and Helm (VKS Series, Part 13)