Tag: inference
-
Scaling Inference: The Latency vs Throughput Trade-Off (GenAI Series, Part 25)
Scaling AI inference means choosing a point on the latency-versus-throughput curve. How batching, tensor and pipeline parallelism, and autoscaling on the right signal work.
-
vLLM vs TensorRT-LLM vs SGLang: Which Inference Engine, and When (GenAI Series, Part 24)
The inference engine decides whether a GPU serves five users or fifty. How continuous batching and paged attention work, and when to choose vLLM, TensorRT-LLM, SGLang or NIM.
-
Why GenAI Runs on GPUs, and the Memory Wall That Limits It (GenAI Series, Part 23)
Models run on GPUs for parallel matrix math, but generating text is limited by memory, not compute. Why bandwidth caps speed, VRAM caps what runs, and the KV cache fills the gap.
-
Quantization: Running Big Models on Smaller GPUs (GenAI Series, Part 20)
Quantization stores a model at lower precision so it needs far less memory. How FP16, INT8 and INT4 trade a little quality for big savings, plus distillation and pruning.
-
Training vs Inference: Why Using AI Is the Real Cost (GenAI Series, Part 9)
Training builds a model once in three stages; inference runs it on every request, forever. Why the recurring inference bill, not the headline training cost, decides AI economics.
Architect’s Toolkit
VMware Cloud Foundation
- VCF Documentation
- VCF 9 Planning & Preparation Workbook
- VCF Bill of Materials (BoM)
- VMware Compatibility Guide
- VMware Interoperability Matrix
- VMware Configuration Maximums
- VMware Ports & Protocols
- VMware Hands-on Labs
- RVTools Download
Nutanix
AI & Cloud-Native Platform
- AI Infra Sizing & Cost Calculator
- NVIDIA Build (Model Catalog)
- NVIDIA AI Enterprise Reference Architecture
- NVIDIA NIM Performance Benchmarking
- NVIDIA NGC Catalog
- NeMo Microservices Helm Chart
- Helm Charts Repository
- Hugging Face Models
Architecture & Design
About the Author

Dr Pranay Jha
Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

You May Have Missed






